xref: /openbsd/usr.bin/lex/flex.1 (revision a6445c1d)
1.\"	$OpenBSD: flex.1,v 1.37 2014/03/23 16:28:29 jmc Exp $
2.\"
3.\" Copyright (c) 1990 The Regents of the University of California.
4.\" All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Vern Paxson.
8.\"
9.\" The United States Government has rights in this work pursuant
10.\" to contract no. DE-AC03-76SF00098 between the United States
11.\" Department of Energy and the University of California.
12.\"
13.\" Redistribution and use in source and binary forms, with or without
14.\" modification, are permitted provided that the following conditions
15.\" are met:
16.\"
17.\" 1. Redistributions of source code must retain the above copyright
18.\"    notice, this list of conditions and the following disclaimer.
19.\" 2. Redistributions in binary form must reproduce the above copyright
20.\"    notice, this list of conditions and the following disclaimer in the
21.\"    documentation and/or other materials provided with the distribution.
22.\"
23.\" Neither the name of the University nor the names of its contributors
24.\" may be used to endorse or promote products derived from this software
25.\" without specific prior written permission.
26.\"
27.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28.\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30.\" PURPOSE.
31.\"
32.Dd $Mdocdate: March 23 2014 $
33.Dt FLEX 1
34.Os
35.Sh NAME
36.Nm flex
37.Nd fast lexical analyzer generator
38.Sh SYNOPSIS
39.Nm
40.Bk -words
41.Op Fl 78BbdFfhIiLlnpsTtVvw+?
42.Op Fl C Ns Op Cm aeFfmr
43.Op Fl Fl help
44.Op Fl Fl version
45.Op Fl o Ns Ar output
46.Op Fl P Ns Ar prefix
47.Op Fl S Ns Ar skeleton
48.Op Ar
49.Ek
50.Sh DESCRIPTION
51.Nm
52is a tool for generating
53.Em scanners :
54programs which recognize lexical patterns in text.
55.Nm
56reads the given input files, or its standard input if no file names are given,
57for a description of a scanner to generate.
58The description is in the form of pairs of regular expressions and C code,
59called
60.Em rules .
61.Nm
62generates as output a C source file,
63.Pa lex.yy.c ,
64which defines a routine
65.Fn yylex .
66This file is compiled and linked with the
67.Fl lfl
68library to produce an executable.
69When the executable is run, it analyzes its input for occurrences
70of the regular expressions.
71Whenever it finds one, it executes the corresponding C code.
72.Pp
73The manual includes both tutorial and reference sections:
74.Bl -ohang
75.It Sy Some Simple Examples
76.It Sy Format of the Input File
77.It Sy Patterns
78The extended regular expressions used by
79.Nm .
80.It Sy How the Input is Matched
81The rules for determining what has been matched.
82.It Sy Actions
83How to specify what to do when a pattern is matched.
84.It Sy The Generated Scanner
85Details regarding the scanner that
86.Nm
87produces;
88how to control the input source.
89.It Sy Start Conditions
90Introducing context into scanners, and managing
91.Qq mini-scanners .
92.It Sy Multiple Input Buffers
93How to manipulate multiple input sources;
94how to scan from strings instead of files.
95.It Sy End-of-File Rules
96Special rules for matching the end of the input.
97.It Sy Miscellaneous Macros
98A summary of macros available to the actions.
99.It Sy Values Available to the User
100A summary of values available to the actions.
101.It Sy Interfacing with Yacc
102Connecting flex scanners together with
103.Xr yacc 1
104parsers.
105.It Sy Options
106.Nm
107command-line options, and the
108.Dq %option
109directive.
110.It Sy Performance Considerations
111How to make scanners go as fast as possible.
112.It Sy Generating C++ Scanners
113The
114.Pq experimental
115facility for generating C++ scanner classes.
116.It Sy Incompatibilities with Lex and POSIX
117How
118.Nm
119differs from
120.At
121.Nm lex
122and the
123.Tn POSIX
124.Nm lex
125standard.
126.It Sy Files
127Files used by
128.Nm .
129.It Sy Diagnostics
130Those error messages produced by
131.Nm
132.Pq or scanners it generates
133whose meanings might not be apparent.
134.It Sy See Also
135Other documentation, related tools.
136.It Sy Authors
137Includes contact information.
138.It Sy Bugs
139Known problems with
140.Nm .
141.El
142.Sh SOME SIMPLE EXAMPLES
143First some simple examples to get the flavor of how one uses
144.Nm .
145The following
146.Nm
147input specifies a scanner which whenever it encounters the string
148.Qq username
149will replace it with the user's login name:
150.Bd -literal -offset indent
151%%
152username    printf("%s", getlogin());
153.Ed
154.Pp
155By default, any text not matched by a
156.Nm
157scanner is copied to the output, so the net effect of this scanner is
158to copy its input file to its output with each occurrence of
159.Qq username
160expanded.
161In this input, there is just one rule.
162.Qq username
163is the
164.Em pattern
165and the
166.Qq printf
167is the
168.Em action .
169The
170.Qq %%
171marks the beginning of the rules.
172.Pp
173Here's another simple example:
174.Bd -literal -offset indent
175%{
176int num_lines = 0, num_chars = 0;
177%}
178
179%%
180\en      ++num_lines; ++num_chars;
181\&.       ++num_chars;
182
183%%
184main()
185{
186	yylex();
187	printf("# of lines = %d, # of chars = %d\en",
188            num_lines, num_chars);
189}
190.Ed
191.Pp
192This scanner counts the number of characters and the number
193of lines in its input
194(it produces no output other than the final report on the counts).
195The first line declares two globals,
196.Qq num_lines
197and
198.Qq num_chars ,
199which are accessible both inside
200.Fn yylex
201and in the
202.Fn main
203routine declared after the second
204.Qq %% .
205There are two rules, one which matches a newline
206.Pq \&"\en\&"
207and increments both the line count and the character count,
208and one which matches any character other than a newline
209(indicated by the
210.Qq \&.
211regular expression).
212.Pp
213A somewhat more complicated example:
214.Bd -literal -offset indent
215/* scanner for a toy Pascal-like language */
216
217%{
218/* need this for the call to atof() below */
219#include <math.h>
220%}
221
222DIGIT    [0-9]
223ID       [a-z][a-z0-9]*
224
225%%
226
227{DIGIT}+ {
228        printf("An integer: %s (%d)\en", yytext,
229            atoi(yytext));
230}
231
232{DIGIT}+"."{DIGIT}* {
233        printf("A float: %s (%g)\en", yytext,
234            atof(yytext));
235}
236
237if|then|begin|end|procedure|function {
238        printf("A keyword: %s\en", yytext);
239}
240
241{ID}    printf("An identifier: %s\en", yytext);
242
243"+"|"-"|"*"|"/"   printf("An operator: %s\en", yytext);
244
245"{"[^}\en]*"}"     /* eat up one-line comments */
246
247[ \et\en]+          /* eat up whitespace */
248
249\&.       printf("Unrecognized character: %s\en", yytext);
250
251%%
252
253main(int argc, char *argv[])
254{
255        ++argv; --argc;  /* skip over program name */
256        if (argc > 0)
257                yyin = fopen(argv[0], "r");
258        else
259                yyin = stdin;
260
261        yylex();
262}
263.Ed
264.Pp
265This is the beginnings of a simple scanner for a language like Pascal.
266It identifies different types of
267.Em tokens
268and reports on what it has seen.
269.Pp
270The details of this example will be explained in the following sections.
271.Sh FORMAT OF THE INPUT FILE
272The
273.Nm
274input file consists of three sections, separated by a line with just
275.Qq %%
276in it:
277.Bd -unfilled -offset indent
278definitions
279%%
280rules
281%%
282user code
283.Ed
284.Pp
285The
286.Em definitions
287section contains declarations of simple
288.Em name
289definitions to simplify the scanner specification, and declarations of
290.Em start conditions ,
291which are explained in a later section.
292.Pp
293Name definitions have the form:
294.Pp
295.D1 name definition
296.Pp
297The
298.Qq name
299is a word beginning with a letter or an underscore
300.Pq Sq _
301followed by zero or more letters, digits,
302.Sq _ ,
303or
304.Sq -
305.Pq dash .
306The definition is taken to begin at the first non-whitespace character
307following the name and continuing to the end of the line.
308The definition can subsequently be referred to using
309.Qq {name} ,
310which will expand to
311.Qq (definition) .
312For example:
313.Bd -literal -offset indent
314DIGIT    [0-9]
315ID       [a-z][a-z0-9]*
316.Ed
317.Pp
318This defines
319.Qq DIGIT
320to be a regular expression which matches a single digit, and
321.Qq ID
322to be a regular expression which matches a letter
323followed by zero-or-more letters-or-digits.
324A subsequent reference to
325.Pp
326.Dl {DIGIT}+"."{DIGIT}*
327.Pp
328is identical to
329.Pp
330.Dl ([0-9])+"."([0-9])*
331.Pp
332and matches one-or-more digits followed by a
333.Sq .\&
334followed by zero-or-more digits.
335.Pp
336The
337.Em rules
338section of the
339.Nm
340input contains a series of rules of the form:
341.Pp
342.Dl pattern	action
343.Pp
344The pattern must be unindented and the action must begin
345on the same line.
346.Pp
347See below for a further description of patterns and actions.
348.Pp
349Finally, the user code section is simply copied to
350.Pa lex.yy.c
351verbatim.
352It is used for companion routines which call or are called by the scanner.
353The presence of this section is optional;
354if it is missing, the second
355.Qq %%
356in the input file may be skipped too.
357.Pp
358In the definitions and rules sections, any indented text or text enclosed in
359.Sq %{
360and
361.Sq %}
362is copied verbatim to the output
363.Pq with the %{}'s removed .
364The %{}'s must appear unindented on lines by themselves.
365.Pp
366In the rules section,
367any indented or %{} text appearing before the first rule may be used to
368declare variables which are local to the scanning routine and
369.Pq after the declarations
370code which is to be executed whenever the scanning routine is entered.
371Other indented or %{} text in the rule section is still copied to the output,
372but its meaning is not well-defined and it may well cause compile-time
373errors (this feature is present for
374.Tn POSIX
375compliance; see below for other such features).
376.Pp
377In the definitions section
378.Pq but not in the rules section ,
379an unindented comment
380(i.e., a line beginning with
381.Qq /* )
382is also copied verbatim to the output up to the next
383.Qq */ .
384.Sh PATTERNS
385The patterns in the input are written using an extended set of regular
386expressions.
387These are:
388.Bl -tag -width "XXXXXXXX"
389.It x
390Match the character
391.Sq x .
392.It .\&
393Any character
394.Pq byte
395except newline.
396.It [xyz]
397A
398.Qq character class ;
399in this case, the pattern matches either an
400.Sq x ,
401a
402.Sq y ,
403or a
404.Sq z .
405.It [abj-oZ]
406A
407.Qq character class
408with a range in it; matches an
409.Sq a ,
410a
411.Sq b ,
412any letter from
413.Sq j
414through
415.Sq o ,
416or a
417.Sq Z .
418.It [^A-Z]
419A
420.Qq negated character class ,
421i.e., any character but those in the class.
422In this case, any character EXCEPT an uppercase letter.
423.It [^A-Z\en]
424Any character EXCEPT an uppercase letter or a newline.
425.It r*
426Zero or more r's, where
427.Sq r
428is any regular expression.
429.It r+
430One or more r's.
431.It r?
432Zero or one r's (that is,
433.Qq an optional r ) .
434.It r{2,5}
435Anywhere from two to five r's.
436.It r{2,}
437Two or more r's.
438.It r{4}
439Exactly 4 r's.
440.It {name}
441The expansion of the
442.Qq name
443definition
444.Pq see above .
445.It \&"[xyz]\e\&"foo\&"
446The literal string: [xyz]"foo.
447.It \eX
448If
449.Sq X
450is an
451.Sq a ,
452.Sq b ,
453.Sq f ,
454.Sq n ,
455.Sq r ,
456.Sq t ,
457or
458.Sq v ,
459then the ANSI-C interpretation of
460.Sq \eX .
461Otherwise, a literal
462.Sq X
463(used to escape operators such as
464.Sq * ) .
465.It \e0
466A NUL character
467.Pq ASCII code 0 .
468.It \e123
469The character with octal value 123.
470.It \ex2a
471The character with hexadecimal value 2a.
472.It (r)
473Match an
474.Sq r ;
475parentheses are used to override precedence
476.Pq see below .
477.It rs
478The regular expression
479.Sq r
480followed by the regular expression
481.Sq s ;
482called
483.Qq concatenation .
484.It r|s
485Either an
486.Sq r
487or an
488.Sq s .
489.It r/s
490An
491.Sq r ,
492but only if it is followed by an
493.Sq s .
494The text matched by
495.Sq s
496is included when determining whether this rule is the
497.Qq longest match ,
498but is then returned to the input before the action is executed.
499So the action only sees the text matched by
500.Sq r .
501This type of pattern is called
502.Qq trailing context .
503(There are some combinations of r/s that
504.Nm
505cannot match correctly; see notes in the
506.Sx BUGS
507section below regarding
508.Qq dangerous trailing context . )
509.It ^r
510An
511.Sq r ,
512but only at the beginning of a line
513(i.e., just starting to scan, or right after a newline has been scanned).
514.It r$
515An
516.Sq r ,
517but only at the end of a line
518.Pq i.e., just before a newline .
519Equivalent to
520.Qq r/\en .
521.Pp
522Note that
523.Nm flex Ns 's
524notion of
525.Qq newline
526is exactly whatever the C compiler used to compile
527.Nm
528interprets
529.Sq \en
530as.
531.\" In particular, on some DOS systems you must either filter out \er's in the
532.\" input yourself, or explicitly use r/\er\en for
533.\" .Qq r$ .
534.It <s>r
535An
536.Sq r ,
537but only in start condition
538.Sq s
539.Pq see below for discussion of start conditions .
540.It <s1,s2,s3>r
541The same, but in any of start conditions s1, s2, or s3.
542.It <*>r
543An
544.Sq r
545in any start condition, even an exclusive one.
546.It <<EOF>>
547An end-of-file.
548.It <s1,s2><<EOF>>
549An end-of-file when in start condition s1 or s2.
550.El
551.Pp
552Note that inside of a character class, all regular expression operators
553lose their special meaning except escape
554.Pq Sq \e
555and the character class operators,
556.Sq - ,
557.Sq ]\& ,
558and, at the beginning of the class,
559.Sq ^ .
560.Pp
561The regular expressions listed above are grouped according to
562precedence, from highest precedence at the top to lowest at the bottom.
563Those grouped together have equal precedence.
564For example,
565.Pp
566.D1 foo|bar*
567.Pp
568is the same as
569.Pp
570.D1 (foo)|(ba(r*))
571.Pp
572since the
573.Sq *
574operator has higher precedence than concatenation,
575and concatenation higher than alternation
576.Pq Sq |\& .
577This pattern therefore matches
578.Em either
579the string
580.Qq foo
581.Em or
582the string
583.Qq ba
584followed by zero-or-more r's.
585To match
586.Qq foo
587or zero-or-more "bar"'s,
588use:
589.Pp
590.D1 foo|(bar)*
591.Pp
592and to match zero-or-more "foo"'s-or-"bar"'s:
593.Pp
594.D1 (foo|bar)*
595.Pp
596In addition to characters and ranges of characters, character classes
597can also contain character class
598.Em expressions .
599These are expressions enclosed inside
600.Sq [:
601and
602.Sq :]
603delimiters (which themselves must appear between the
604.Sq \&[
605and
606.Sq ]\&
607of the
608character class; other elements may occur inside the character class, too).
609The valid expressions are:
610.Bd -unfilled -offset indent
611[:alnum:] [:alpha:] [:blank:]
612[:cntrl:] [:digit:] [:graph:]
613[:lower:] [:print:] [:punct:]
614[:space:] [:upper:] [:xdigit:]
615.Ed
616.Pp
617These expressions all designate a set of characters equivalent to
618the corresponding standard C
619.Fn isXXX
620function.
621For example, [:alnum:] designates those characters for which
622.Xr isalnum 3
623returns true \- i.e., any alphabetic or numeric.
624Some systems don't provide
625.Xr isblank 3 ,
626so
627.Nm
628defines [:blank:] as a blank or a tab.
629.Pp
630For example, the following character classes are all equivalent:
631.Bd -unfilled -offset indent
632[[:alnum:]]
633[[:alpha:][:digit:]]
634[[:alpha:]0-9]
635[a-zA-Z0-9]
636.Ed
637.Pp
638If the scanner is case-insensitive (the
639.Fl i
640flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
641.Pp
642Some notes on patterns:
643.Bl -dash
644.It
645A negated character class such as the example
646.Qq [^A-Z]
647above will match a newline unless "\en"
648.Pq or an equivalent escape sequence
649is one of the characters explicitly present in the negated character class
650(e.g.,
651.Qq [^A-Z\en] ) .
652This is unlike how many other regular expression tools treat negated character
653classes, but unfortunately the inconsistency is historically entrenched.
654Matching newlines means that a pattern like
655.Qq [^"]*
656can match the entire input unless there's another quote in the input.
657.It
658A rule can have at most one instance of trailing context
659(the
660.Sq /
661operator or the
662.Sq $
663operator).
664The start condition,
665.Sq ^ ,
666and
667.Qq <<EOF>>
668patterns can only occur at the beginning of a pattern, and, as well as with
669.Sq /
670and
671.Sq $ ,
672cannot be grouped inside parentheses.
673A
674.Sq ^
675which does not occur at the beginning of a rule or a
676.Sq $
677which does not occur at the end of a rule loses its special properties
678and is treated as a normal character.
679.It
680The following are illegal:
681.Bd -unfilled -offset indent
682foo/bar$
683<sc1>foo<sc2>bar
684.Ed
685.Pp
686Note that the first of these, can be written
687.Qq foo/bar\en .
688.It
689The following will result in
690.Sq $
691or
692.Sq ^
693being treated as a normal character:
694.Bd -unfilled -offset indent
695foo|(bar$)
696foo|^bar
697.Ed
698.Pp
699If what's wanted is a
700.Qq foo
701or a bar-followed-by-a-newline, the following could be used
702(the special
703.Sq |\&
704action is explained below):
705.Bd -unfilled -offset indent
706foo      |
707bar$     /* action goes here */
708.Ed
709.Pp
710A similar trick will work for matching a foo or a
711bar-at-the-beginning-of-a-line.
712.El
713.Sh HOW THE INPUT IS MATCHED
714When the generated scanner is run,
715it analyzes its input looking for strings which match any of its patterns.
716If it finds more than one match,
717it takes the one matching the most text
718(for trailing context rules, this includes the length of the trailing part,
719even though it will then be returned to the input).
720If it finds two or more matches of the same length,
721the rule listed first in the
722.Nm
723input file is chosen.
724.Pp
725Once the match is determined, the text corresponding to the match
726(called the
727.Em token )
728is made available in the global character pointer
729.Fa yytext ,
730and its length in the global integer
731.Fa yyleng .
732The
733.Em action
734corresponding to the matched pattern is then executed
735.Pq a more detailed description of actions follows ,
736and then the remaining input is scanned for another match.
737.Pp
738If no match is found, then the default rule is executed:
739the next character in the input is considered matched and
740copied to the standard output.
741Thus, the simplest legal
742.Nm
743input is:
744.Pp
745.D1 %%
746.Pp
747which generates a scanner that simply copies its input
748.Pq one character at a time
749to its output.
750.Pp
751Note that
752.Fa yytext
753can be defined in two different ways:
754either as a character pointer or as a character array.
755Which definition
756.Nm
757uses can be controlled by including one of the special directives
758.Dq %pointer
759or
760.Dq %array
761in the first
762.Pq definitions
763section of flex input.
764The default is
765.Dq %pointer ,
766unless the
767.Fl l
768.Nm lex
769compatibility option is used, in which case
770.Fa yytext
771will be an array.
772The advantage of using
773.Dq %pointer
774is substantially faster scanning and no buffer overflow when matching
775very large tokens
776.Pq unless not enough dynamic memory is available .
777The disadvantage is that actions are restricted in how they can modify
778.Fa yytext
779.Pq see the next section ,
780and calls to the
781.Fn unput
782function destroy the present contents of
783.Fa yytext ,
784which can be a considerable porting headache when moving between different
785.Nm lex
786versions.
787.Pp
788The advantage of
789.Dq %array
790is that
791.Fa yytext
792can be modified as much as wanted, and calls to
793.Fn unput
794do not destroy
795.Fa yytext
796.Pq see below .
797Furthermore, existing
798.Nm lex
799programs sometimes access
800.Fa yytext
801externally using declarations of the form:
802.Pp
803.D1 extern char yytext[];
804.Pp
805This definition is erroneous when used with
806.Dq %pointer ,
807but correct for
808.Dq %array .
809.Pp
810.Dq %array
811defines
812.Fa yytext
813to be an array of
814.Dv YYLMAX
815characters, which defaults to a fairly large value.
816The size can be changed by simply #define'ing
817.Dv YYLMAX
818to a different value in the first section of
819.Nm
820input.
821As mentioned above, with
822.Dq %pointer
823yytext grows dynamically to accommodate large tokens.
824While this means a
825.Dq %pointer
826scanner can accommodate very large tokens
827.Pq such as matching entire blocks of comments ,
828bear in mind that each time the scanner must resize
829.Fa yytext
830it also must rescan the entire token from the beginning, so matching such
831tokens can prove slow.
832.Fa yytext
833presently does not dynamically grow if a call to
834.Fn unput
835results in too much text being pushed back; instead, a run-time error results.
836.Pp
837Also note that
838.Dq %array
839cannot be used with C++ scanner classes
840.Pq the c++ option; see below .
841.Sh ACTIONS
842Each pattern in a rule has a corresponding action,
843which can be any arbitrary C statement.
844The pattern ends at the first non-escaped whitespace character;
845the remainder of the line is its action.
846If the action is empty,
847then when the pattern is matched the input token is simply discarded.
848For example, here is the specification for a program
849which deletes all occurrences of
850.Qq zap me
851from its input:
852.Bd -literal -offset indent
853%%
854"zap me"
855.Ed
856.Pp
857(It will copy all other characters in the input to the output since
858they will be matched by the default rule.)
859.Pp
860Here is a program which compresses multiple blanks and tabs down to
861a single blank, and throws away whitespace found at the end of a line:
862.Bd -literal -offset indent
863%%
864[ \et]+        putchar(' ');
865[ \et]+$       /* ignore this token */
866.Ed
867.Pp
868If the action contains a
869.Sq { ,
870then the action spans till the balancing
871.Sq }
872is found, and the action may cross multiple lines.
873.Nm
874knows about C strings and comments and won't be fooled by braces found
875within them, but also allows actions to begin with
876.Sq %{
877and will consider the action to be all the text up to the next
878.Sq %}
879.Pq regardless of ordinary braces inside the action .
880.Pp
881An action consisting solely of a vertical bar
882.Pq Sq |\&
883means
884.Qq same as the action for the next rule .
885See below for an illustration.
886.Pp
887Actions can include arbitrary C code,
888including return statements to return a value to whatever routine called
889.Fn yylex .
890Each time
891.Fn yylex
892is called, it continues processing tokens from where it last left off
893until it either reaches the end of the file or executes a return.
894.Pp
895Actions are free to modify
896.Fa yytext
897except for lengthening it
898(adding characters to its end \- these will overwrite later characters in the
899input stream).
900This, however, does not apply when using
901.Dq %array
902.Pq see above ;
903in that case,
904.Fa yytext
905may be freely modified in any way.
906.Pp
907Actions are free to modify
908.Fa yyleng
909except they should not do so if the action also includes use of
910.Fn yymore
911.Pq see below .
912.Pp
913There are a number of special directives which can be included within
914an action:
915.Bl -tag -width Ds
916.It ECHO
917Copies
918.Fa yytext
919to the scanner's output.
920.It BEGIN
921Followed by the name of a start condition, places the scanner in the
922corresponding start condition
923.Pq see below .
924.It REJECT
925Directs the scanner to proceed on to the
926.Qq second best
927rule which matched the input
928.Pq or a prefix of the input .
929The rule is chosen as described above in
930.Sx HOW THE INPUT IS MATCHED ,
931and
932.Fa yytext
933and
934.Fa yyleng
935set up appropriately.
936It may either be one which matched as much text
937as the originally chosen rule but came later in the
938.Nm
939input file, or one which matched less text.
940For example, the following will both count the
941words in the input and call the routine
942.Fn special
943whenever
944.Qq frob
945is seen:
946.Bd -literal -offset indent
947int word_count = 0;
948%%
949
950frob        special(); REJECT;
951[^ \et\en]+   ++word_count;
952.Ed
953.Pp
954Without the
955.Em REJECT ,
956any "frob"'s in the input would not be counted as words,
957since the scanner normally executes only one action per token.
958Multiple
959.Em REJECT Ns 's
960are allowed,
961each one finding the next best choice to the currently active rule.
962For example, when the following scanner scans the token
963.Qq abcd ,
964it will write
965.Qq abcdabcaba
966to the output:
967.Bd -literal -offset indent
968%%
969a        |
970ab       |
971abc      |
972abcd     ECHO; REJECT;
973\&.|\en     /* eat up any unmatched character */
974.Ed
975.Pp
976(The first three rules share the fourth's action since they use
977the special
978.Sq |\&
979action.)
980.Em REJECT
981is a particularly expensive feature in terms of scanner performance;
982if it is used in any of the scanner's actions it will slow down
983all of the scanner's matching.
984Furthermore,
985.Em REJECT
986cannot be used with the
987.Fl Cf
988or
989.Fl CF
990options
991.Pq see below .
992.Pp
993Note also that unlike the other special actions,
994.Em REJECT
995is a
996.Em branch ;
997code immediately following it in the action will not be executed.
998.It yymore()
999Tells the scanner that the next time it matches a rule, the corresponding
1000token should be appended onto the current value of
1001.Fa yytext
1002rather than replacing it.
1003For example, given the input
1004.Qq mega-kludge
1005the following will write
1006.Qq mega-mega-kludge
1007to the output:
1008.Bd -literal -offset indent
1009%%
1010mega-    ECHO; yymore();
1011kludge   ECHO;
1012.Ed
1013.Pp
1014First
1015.Qq mega-
1016is matched and echoed to the output.
1017Then
1018.Qq kludge
1019is matched, but the previous
1020.Qq mega-
1021is still hanging around at the beginning of
1022.Fa yytext
1023so the
1024.Em ECHO
1025for the
1026.Qq kludge
1027rule will actually write
1028.Qq mega-kludge .
1029.Pp
1030Two notes regarding use of
1031.Fn yymore :
1032First,
1033.Fn yymore
1034depends on the value of
1035.Fa yyleng
1036correctly reflecting the size of the current token, so
1037.Fa yyleng
1038must not be modified when using
1039.Fn yymore .
1040Second, the presence of
1041.Fn yymore
1042in the scanner's action entails a minor performance penalty in the
1043scanner's matching speed.
1044.It yyless(n)
1045Returns all but the first
1046.Ar n
1047characters of the current token back to the input stream, where they
1048will be rescanned when the scanner looks for the next match.
1049.Fa yytext
1050and
1051.Fa yyleng
1052are adjusted appropriately (e.g.,
1053.Fa yyleng
1054will now be equal to
1055.Ar n ) .
1056For example, on the input
1057.Qq foobar
1058the following will write out
1059.Qq foobarbar :
1060.Bd -literal -offset indent
1061%%
1062foobar    ECHO; yyless(3);
1063[a-z]+    ECHO;
1064.Ed
1065.Pp
1066An argument of 0 to
1067.Fa yyless
1068will cause the entire current input string to be scanned again.
1069Unless how the scanner will subsequently process its input has been changed
1070(using
1071.Em BEGIN ,
1072for example),
1073this will result in an endless loop.
1074.Pp
1075Note that
1076.Fa yyless
1077is a macro and can only be used in the
1078.Nm
1079input file, not from other source files.
1080.It unput(c)
1081Puts the character
1082.Ar c
1083back into the input stream.
1084It will be the next character scanned.
1085The following action will take the current token and cause it
1086to be rescanned enclosed in parentheses.
1087.Bd -literal -offset indent
1088{
1089        int i;
1090        char *yycopy;
1091
1092        /* Copy yytext because unput() trashes yytext */
1093        if ((yycopy = strdup(yytext)) == NULL)
1094                err(1, NULL);
1095        unput(')');
1096        for (i = yyleng - 1; i >= 0; --i)
1097                unput(yycopy[i]);
1098        unput('(');
1099        free(yycopy);
1100}
1101.Ed
1102.Pp
1103Note that since each
1104.Fn unput
1105puts the given character back at the beginning of the input stream,
1106pushing back strings must be done back-to-front.
1107.Pp
1108An important potential problem when using
1109.Fn unput
1110is that if using
1111.Dq %pointer
1112.Pq the default ,
1113a call to
1114.Fn unput
1115destroys the contents of
1116.Fa yytext ,
1117starting with its rightmost character and devouring one character to
1118the left with each call.
1119If the value of
1120.Fa yytext
1121should be preserved after a call to
1122.Fn unput
1123.Pq as in the above example ,
1124it must either first be copied elsewhere, or the scanner must be built using
1125.Dq %array
1126instead (see
1127.Sx HOW THE INPUT IS MATCHED ) .
1128.Pp
1129Finally, note that EOF cannot be put back
1130to attempt to mark the input stream with an end-of-file.
1131.It input()
1132Reads the next character from the input stream.
1133For example, the following is one way to eat up C comments:
1134.Bd -literal -offset indent
1135%%
1136"/*" {
1137        int c;
1138
1139        for (;;) {
1140                while ((c = input()) != '*' && c != EOF)
1141                        ; /* eat up text of comment */
1142
1143                if (c == '*') {
1144                        while ((c = input()) == '*')
1145                                ;
1146                        if (c == '/')
1147                                break; /* found the end */
1148                }
1149
1150                if (c == EOF) {
1151                        errx(1, "EOF in comment");
1152                        break;
1153                }
1154        }
1155}
1156.Ed
1157.Pp
1158(Note that if the scanner is compiled using C++, then
1159.Fn input
1160is instead referred to as
1161.Fn yyinput ,
1162in order to avoid a name clash with the C++ stream by the name of input.)
1163.It YY_FLUSH_BUFFER
1164Flushes the scanner's internal buffer
1165so that the next time the scanner attempts to match a token,
1166it will first refill the buffer using
1167.Dv YY_INPUT
1168(see
1169.Sx THE GENERATED SCANNER ,
1170below).
1171This action is a special case of the more general
1172.Fn yy_flush_buffer
1173function, described below in the section
1174.Sx MULTIPLE INPUT BUFFERS .
1175.It yyterminate()
1176Can be used in lieu of a return statement in an action.
1177It terminates the scanner and returns a 0 to the scanner's caller, indicating
1178.Qq all done .
1179By default,
1180.Fn yyterminate
1181is also called when an end-of-file is encountered.
1182It is a macro and may be redefined.
1183.El
1184.Sh THE GENERATED SCANNER
1185The output of
1186.Nm
1187is the file
1188.Pa lex.yy.c ,
1189which contains the scanning routine
1190.Fn yylex ,
1191a number of tables used by it for matching tokens,
1192and a number of auxiliary routines and macros.
1193By default,
1194.Fn yylex
1195is declared as follows:
1196.Bd -unfilled -offset indent
1197int yylex()
1198{
1199    ... various definitions and the actions in here ...
1200}
1201.Ed
1202.Pp
1203(If the environment supports function prototypes, then it will
1204be "int yylex(void)".)
1205This definition may be changed by defining the
1206.Dv YY_DECL
1207macro.
1208For example:
1209.Bd -literal -offset indent
1210#define YY_DECL float lexscan(a, b) float a, b;
1211.Ed
1212.Pp
1213would give the scanning routine the name
1214.Em lexscan ,
1215returning a float, and taking two floats as arguments.
1216Note that if arguments are given to the scanning routine using a
1217K&R-style/non-prototyped function declaration,
1218the definition must be terminated with a semi-colon
1219.Pq Sq ;\& .
1220.Pp
1221Whenever
1222.Fn yylex
1223is called, it scans tokens from the global input file
1224.Pa yyin
1225.Pq which defaults to stdin .
1226It continues until it either reaches an end-of-file
1227.Pq at which point it returns the value 0
1228or one of its actions executes a
1229.Em return
1230statement.
1231.Pp
1232If the scanner reaches an end-of-file, subsequent calls are undefined
1233unless either
1234.Em yyin
1235is pointed at a new input file
1236.Pq in which case scanning continues from that file ,
1237or
1238.Fn yyrestart
1239is called.
1240.Fn yyrestart
1241takes one argument, a
1242.Fa FILE *
1243pointer (which can be nil, if
1244.Dv YY_INPUT
1245has been set up to scan from a source other than
1246.Em yyin ) ,
1247and initializes
1248.Em yyin
1249for scanning from that file.
1250Essentially there is no difference between just assigning
1251.Em yyin
1252to a new input file or using
1253.Fn yyrestart
1254to do so; the latter is available for compatibility with previous versions of
1255.Nm ,
1256and because it can be used to switch input files in the middle of scanning.
1257It can also be used to throw away the current input buffer,
1258by calling it with an argument of
1259.Em yyin ;
1260but better is to use
1261.Dv YY_FLUSH_BUFFER
1262.Pq see above .
1263Note that
1264.Fn yyrestart
1265does not reset the start condition to
1266.Em INITIAL
1267(see
1268.Sx START CONDITIONS ,
1269below).
1270.Pp
1271If
1272.Fn yylex
1273stops scanning due to executing a
1274.Em return
1275statement in one of the actions, the scanner may then be called again and it
1276will resume scanning where it left off.
1277.Pp
1278By default
1279.Pq and for purposes of efficiency ,
1280the scanner uses block-reads rather than simple
1281.Xr getc 3
1282calls to read characters from
1283.Em yyin .
1284The nature of how it gets its input can be controlled by defining the
1285.Dv YY_INPUT
1286macro.
1287.Dv YY_INPUT Ns 's
1288calling sequence is
1289.Qq YY_INPUT(buf,result,max_size) .
1290Its action is to place up to
1291.Dv max_size
1292characters in the character array
1293.Em buf
1294and return in the integer variable
1295.Em result
1296either the number of characters read or the constant
1297.Dv YY_NULL
1298(0 on
1299.Ux
1300systems)
1301to indicate
1302.Dv EOF .
1303The default
1304.Dv YY_INPUT
1305reads from the global file-pointer
1306.Qq yyin .
1307.Pp
1308A sample definition of
1309.Dv YY_INPUT
1310.Pq in the definitions section of the input file :
1311.Bd -unfilled -offset indent
1312%{
1313#define YY_INPUT(buf,result,max_size) \e
1314{ \e
1315        int c = getchar(); \e
1316        result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1317}
1318%}
1319.Ed
1320.Pp
1321This definition will change the input processing to occur
1322one character at a time.
1323.Pp
1324When the scanner receives an end-of-file indication from
1325.Dv YY_INPUT ,
1326it then checks the
1327.Fn yywrap
1328function.
1329If
1330.Fn yywrap
1331returns false
1332.Pq zero ,
1333then it is assumed that the function has gone ahead and set up
1334.Em yyin
1335to point to another input file, and scanning continues.
1336If it returns true
1337.Pq non-zero ,
1338then the scanner terminates, returning 0 to its caller.
1339Note that in either case, the start condition remains unchanged;
1340it does not revert to
1341.Em INITIAL .
1342.Pp
1343If you do not supply your own version of
1344.Fn yywrap ,
1345then you must either use
1346.Dq %option noyywrap
1347(in which case the scanner behaves as though
1348.Fn yywrap
1349returned 1), or you must link with
1350.Fl lfl
1351to obtain the default version of the routine, which always returns 1.
1352.Pp
1353Three routines are available for scanning from in-memory buffers rather
1354than files:
1355.Fn yy_scan_string ,
1356.Fn yy_scan_bytes ,
1357and
1358.Fn yy_scan_buffer .
1359See the discussion of them below in the section
1360.Sx MULTIPLE INPUT BUFFERS .
1361.Pp
1362The scanner writes its
1363.Em ECHO
1364output to the
1365.Em yyout
1366global
1367.Pq default, stdout ,
1368which may be redefined by the user simply by assigning it to some other
1369.Va FILE
1370pointer.
1371.Sh START CONDITIONS
1372.Nm
1373provides a mechanism for conditionally activating rules.
1374Any rule whose pattern is prefixed with
1375.Qq Aq sc
1376will only be active when the scanner is in the start condition named
1377.Qq sc .
1378For example,
1379.Bd -literal -offset indent
1380<STRING>[^"]* { /* eat up the string body ... */
1381        ...
1382}
1383.Ed
1384.Pp
1385will be active only when the scanner is in the
1386.Qq STRING
1387start condition, and
1388.Bd -literal -offset indent
1389<INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1390        ...
1391}
1392.Ed
1393.Pp
1394will be active only when the current start condition is either
1395.Qq INITIAL ,
1396.Qq STRING ,
1397or
1398.Qq QUOTE .
1399.Pp
1400Start conditions are declared in the definitions
1401.Pq first
1402section of the input using unindented lines beginning with either
1403.Sq %s
1404or
1405.Sq %x
1406followed by a list of names.
1407The former declares
1408.Em inclusive
1409start conditions, the latter
1410.Em exclusive
1411start conditions.
1412A start condition is activated using the
1413.Em BEGIN
1414action.
1415Until the next
1416.Em BEGIN
1417action is executed, rules with the given start condition will be active and
1418rules with other start conditions will be inactive.
1419If the start condition is inclusive,
1420then rules with no start conditions at all will also be active.
1421If it is exclusive,
1422then only rules qualified with the start condition will be active.
1423A set of rules contingent on the same exclusive start condition
1424describe a scanner which is independent of any of the other rules in the
1425.Nm
1426input.
1427Because of this, exclusive start conditions make it easy to specify
1428.Qq mini-scanners
1429which scan portions of the input that are syntactically different
1430from the rest
1431.Pq e.g., comments .
1432.Pp
1433If the distinction between inclusive and exclusive start conditions
1434is still a little vague, here's a simple example illustrating the
1435connection between the two.
1436The set of rules:
1437.Bd -literal -offset indent
1438%s example
1439%%
1440
1441<example>foo   do_something();
1442
1443bar            something_else();
1444.Ed
1445.Pp
1446is equivalent to
1447.Bd -literal -offset indent
1448%x example
1449%%
1450
1451<example>foo   do_something();
1452
1453<INITIAL,example>bar    something_else();
1454.Ed
1455.Pp
1456Without the
1457.Aq INITIAL,example
1458qualifier, the
1459.Dq bar
1460pattern in the second example wouldn't be active
1461.Pq i.e., couldn't match
1462when in start condition
1463.Dq example .
1464If we just used
1465.Aq example
1466to qualify
1467.Dq bar ,
1468though, then it would only be active in
1469.Dq example
1470and not in
1471.Em INITIAL ,
1472while in the first example it's active in both,
1473because in the first example the
1474.Dq example
1475start condition is an inclusive
1476.Pq Sq %s
1477start condition.
1478.Pp
1479Also note that the special start-condition specifier
1480.Sq Aq *
1481matches every start condition.
1482Thus, the above example could also have been written:
1483.Bd -literal -offset indent
1484%x example
1485%%
1486
1487<example>foo   do_something();
1488
1489<*>bar         something_else();
1490.Ed
1491.Pp
1492The default rule (to
1493.Em ECHO
1494any unmatched character) remains active in start conditions.
1495It is equivalent to:
1496.Bd -literal -offset indent
1497<*>.|\en     ECHO;
1498.Ed
1499.Pp
1500.Dq BEGIN(0)
1501returns to the original state where only the rules with
1502no start conditions are active.
1503This state can also be referred to as the start-condition
1504.Em INITIAL ,
1505so
1506.Dq BEGIN(INITIAL)
1507is equivalent to
1508.Dq BEGIN(0) .
1509(The parentheses around the start condition name are not required but
1510are considered good style.)
1511.Pp
1512.Em BEGIN
1513actions can also be given as indented code at the beginning
1514of the rules section.
1515For example, the following will cause the scanner to enter the
1516.Qq SPECIAL
1517start condition whenever
1518.Fn yylex
1519is called and the global variable
1520.Fa enter_special
1521is true:
1522.Bd -literal -offset indent
1523int enter_special;
1524
1525%x SPECIAL
1526%%
1527        if (enter_special)
1528                BEGIN(SPECIAL);
1529
1530<SPECIAL>blahblahblah
1531\&...more rules follow...
1532.Ed
1533.Pp
1534To illustrate the uses of start conditions,
1535here is a scanner which provides two different interpretations
1536of a string like
1537.Qq 123.456 .
1538By default it will treat it as three tokens: the integer
1539.Qq 123 ,
1540a dot
1541.Pq Sq .\& ,
1542and the integer
1543.Qq 456 .
1544But if the string is preceded earlier in the line by the string
1545.Qq expect-floats
1546it will treat it as a single token, the floating-point number 123.456:
1547.Bd -literal -offset indent
1548%{
1549#include <math.h>
1550%}
1551%s expect
1552
1553%%
1554expect-floats        BEGIN(expect);
1555
1556<expect>[0-9]+"."[0-9]+ {
1557        printf("found a float, = %f\en",
1558            atof(yytext));
1559}
1560<expect>\en {
1561        /*
1562         * That's the end of the line, so
1563         * we need another "expect-number"
1564         * before we'll recognize any more
1565         * numbers.
1566         */
1567        BEGIN(INITIAL);
1568}
1569
1570[0-9]+ {
1571        printf("found an integer, = %d\en",
1572            atoi(yytext));
1573}
1574
1575"."     printf("found a dot\en");
1576.Ed
1577.Pp
1578Here is a scanner which recognizes
1579.Pq and discards
1580C comments while maintaining a count of the current input line:
1581.Bd -literal -offset indent
1582%x comment
1583%%
1584int line_num = 1;
1585
1586"/*"                    BEGIN(comment);
1587
1588<comment>[^*\en]*        /* eat anything that's not a '*' */
1589<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1590<comment>\en             ++line_num;
1591<comment>"*"+"/"        BEGIN(INITIAL);
1592.Ed
1593.Pp
1594This scanner goes to a bit of trouble to match as much
1595text as possible with each rule.
1596In general, when attempting to write a high-speed scanner
1597try to match as much as possible in each rule, as it's a big win.
1598.Pp
1599Note that start-condition names are really integer values and
1600can be stored as such.
1601Thus, the above could be extended in the following fashion:
1602.Bd -literal -offset indent
1603%x comment foo
1604%%
1605int line_num = 1;
1606int comment_caller;
1607
1608"/*" {
1609        comment_caller = INITIAL;
1610        BEGIN(comment);
1611}
1612
1613\&...
1614
1615<foo>"/*" {
1616        comment_caller = foo;
1617        BEGIN(comment);
1618}
1619
1620<comment>[^*\en]*        /* eat anything that's not a '*' */
1621<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1622<comment>\en             ++line_num;
1623<comment>"*"+"/"        BEGIN(comment_caller);
1624.Ed
1625.Pp
1626Furthermore, the current start condition can be accessed by using
1627the integer-valued
1628.Dv YY_START
1629macro.
1630For example, the above assignments to
1631.Em comment_caller
1632could instead be written
1633.Pp
1634.Dl comment_caller = YY_START;
1635.Pp
1636Flex provides
1637.Dv YYSTATE
1638as an alias for
1639.Dv YY_START
1640(since that is what's used by
1641.At
1642.Nm lex ) .
1643.Pp
1644Note that start conditions do not have their own name-space;
1645%s's and %x's declare names in the same fashion as #define's.
1646.Pp
1647Finally, here's an example of how to match C-style quoted strings using
1648exclusive start conditions, including expanded escape sequences
1649(but not including checking for a string that's too long):
1650.Bd -literal -offset indent
1651%x str
1652
1653%%
1654#define MAX_STR_CONST 1024
1655char string_buf[MAX_STR_CONST];
1656char *string_buf_ptr;
1657
1658\e"      string_buf_ptr = string_buf; BEGIN(str);
1659
1660<str>\e" { /* saw closing quote - all done */
1661        BEGIN(INITIAL);
1662        *string_buf_ptr = '\e0';
1663        /*
1664         * return string constant token type and
1665         * value to parser
1666         */
1667}
1668
1669<str>\en {
1670        /* error - unterminated string constant */
1671        /* generate error message */
1672}
1673
1674<str>\e\e[0-7]{1,3} {
1675        /* octal escape sequence */
1676        int result;
1677
1678        (void) sscanf(yytext + 1, "%o", &result);
1679
1680        if (result > 0xff) {
1681                /* error, constant is out-of-bounds */
1682	} else
1683	        *string_buf_ptr++ = result;
1684}
1685
1686<str>\e\e[0-9]+ {
1687        /*
1688         * generate error - bad escape sequence; something
1689         * like '\e48' or '\e0777777'
1690         */
1691}
1692
1693<str>\e\en  *string_buf_ptr++ = '\en';
1694<str>\e\et  *string_buf_ptr++ = '\et';
1695<str>\e\er  *string_buf_ptr++ = '\er';
1696<str>\e\eb  *string_buf_ptr++ = '\eb';
1697<str>\e\ef  *string_buf_ptr++ = '\ef';
1698
1699<str>\e\e(.|\en)  *string_buf_ptr++ = yytext[1];
1700
1701<str>[^\e\e\en\e"]+ {
1702        char *yptr = yytext;
1703
1704        while (*yptr)
1705                *string_buf_ptr++ = *yptr++;
1706}
1707.Ed
1708.Pp
1709Often, such as in some of the examples above,
1710a whole bunch of rules are all preceded by the same start condition(s).
1711.Nm
1712makes this a little easier and cleaner by introducing a notion of
1713start condition
1714.Em scope .
1715A start condition scope is begun with:
1716.Pp
1717.Dl <SCs>{
1718.Pp
1719where
1720.Dq SCs
1721is a list of one or more start conditions.
1722Inside the start condition scope, every rule automatically has the prefix
1723.Aq SCs
1724applied to it, until a
1725.Sq }
1726which matches the initial
1727.Sq { .
1728So, for example,
1729.Bd -literal -offset indent
1730<ESC>{
1731    "\e\en"   return '\en';
1732    "\e\er"   return '\er';
1733    "\e\ef"   return '\ef';
1734    "\e\e0"   return '\e0';
1735}
1736.Ed
1737.Pp
1738is equivalent to:
1739.Bd -literal -offset indent
1740<ESC>"\e\en"  return '\en';
1741<ESC>"\e\er"  return '\er';
1742<ESC>"\e\ef"  return '\ef';
1743<ESC>"\e\e0"  return '\e0';
1744.Ed
1745.Pp
1746Start condition scopes may be nested.
1747.Pp
1748Three routines are available for manipulating stacks of start conditions:
1749.Bl -tag -width Ds
1750.It void yy_push_state(int new_state)
1751Pushes the current start condition onto the top of the start condition
1752stack and switches to
1753.Fa new_state
1754as though
1755.Dq BEGIN new_state
1756had been used
1757.Pq recall that start condition names are also integers .
1758.It void yy_pop_state()
1759Pops the top of the stack and switches to it via
1760.Em BEGIN .
1761.It int yy_top_state()
1762Returns the top of the stack without altering the stack's contents.
1763.El
1764.Pp
1765The start condition stack grows dynamically and so has no built-in
1766size limitation.
1767If memory is exhausted, program execution aborts.
1768.Pp
1769To use start condition stacks, scanners must include a
1770.Dq %option stack
1771directive (see
1772.Sx OPTIONS
1773below).
1774.Sh MULTIPLE INPUT BUFFERS
1775Some scanners
1776(such as those which support
1777.Qq include
1778files)
1779require reading from several input streams.
1780As
1781.Nm
1782scanners do a large amount of buffering, one cannot control
1783where the next input will be read from by simply writing a
1784.Dv YY_INPUT
1785which is sensitive to the scanning context.
1786.Dv YY_INPUT
1787is only called when the scanner reaches the end of its buffer, which
1788may be a long time after scanning a statement such as an
1789.Qq include
1790which requires switching the input source.
1791.Pp
1792To negotiate these sorts of problems,
1793.Nm
1794provides a mechanism for creating and switching between multiple
1795input buffers.
1796An input buffer is created by using:
1797.Pp
1798.D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1799.Pp
1800which takes a
1801.Fa FILE
1802pointer and a
1803.Fa size
1804and creates a buffer associated with the given file and large enough to hold
1805.Fa size
1806characters (when in doubt, use
1807.Dv YY_BUF_SIZE
1808for the size).
1809It returns a
1810.Dv YY_BUFFER_STATE
1811handle, which may then be passed to other routines
1812.Pq see below .
1813The
1814.Dv YY_BUFFER_STATE
1815type is a pointer to an opaque
1816.Dq struct yy_buffer_state
1817structure, so
1818.Dv YY_BUFFER_STATE
1819variables may be safely initialized to
1820.Dq ((YY_BUFFER_STATE) 0)
1821if desired, and the opaque structure can also be referred to in order to
1822correctly declare input buffers in source files other than that of scanners.
1823Note that the
1824.Fa FILE
1825pointer in the call to
1826.Fn yy_create_buffer
1827is only used as the value of
1828.Fa yyin
1829seen by
1830.Dv YY_INPUT ;
1831if
1832.Dv YY_INPUT
1833is redefined so that it no longer uses
1834.Fa yyin ,
1835then a nil
1836.Fa FILE
1837pointer can safely be passed to
1838.Fn yy_create_buffer .
1839To select a particular buffer to scan:
1840.Pp
1841.D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1842.Pp
1843It switches the scanner's input buffer so subsequent tokens will
1844come from
1845.Fa new_buffer .
1846Note that
1847.Fn yy_switch_to_buffer
1848may be used by
1849.Fn yywrap
1850to set things up for continued scanning,
1851instead of opening a new file and pointing
1852.Fa yyin
1853at it.
1854Note also that switching input sources via either
1855.Fn yy_switch_to_buffer
1856or
1857.Fn yywrap
1858does not change the start condition.
1859.Pp
1860.D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1861.Pp
1862is used to reclaim the storage associated with a buffer.
1863.Pf ( Fa buffer
1864can be nil, in which case the routine does nothing.)
1865To clear the current contents of a buffer:
1866.Pp
1867.D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1868.Pp
1869This function discards the buffer's contents,
1870so the next time the scanner attempts to match a token from the buffer,
1871it will first fill the buffer anew using
1872.Dv YY_INPUT .
1873.Pp
1874.Fn yy_new_buffer
1875is an alias for
1876.Fn yy_create_buffer ,
1877provided for compatibility with the C++ use of
1878.Em new
1879and
1880.Em delete
1881for creating and destroying dynamic objects.
1882.Pp
1883Finally, the
1884.Dv YY_CURRENT_BUFFER
1885macro returns a
1886.Dv YY_BUFFER_STATE
1887handle to the current buffer.
1888.Pp
1889Here is an example of using these features for writing a scanner
1890which expands include files (the
1891.Aq Aq EOF
1892feature is discussed below):
1893.Bd -literal -offset indent
1894/*
1895 * the "incl" state is used for picking up the name
1896 * of an include file
1897 */
1898%x incl
1899
1900%{
1901#define MAX_INCLUDE_DEPTH 10
1902YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1903int include_stack_ptr = 0;
1904%}
1905
1906%%
1907include             BEGIN(incl);
1908
1909[a-z]+              ECHO;
1910[^a-z\en]*\en?        ECHO;
1911
1912<incl>[ \et]*        /* eat the whitespace */
1913<incl>[^ \et\en]+ {   /* got the include file name */
1914        if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1915                errx(1, "Includes nested too deeply");
1916
1917        include_stack[include_stack_ptr++] =
1918            YY_CURRENT_BUFFER;
1919
1920        yyin = fopen(yytext, "r");
1921
1922        if (yyin == NULL)
1923                err(1, NULL);
1924
1925        yy_switch_to_buffer(
1926            yy_create_buffer(yyin, YY_BUF_SIZE));
1927
1928        BEGIN(INITIAL);
1929}
1930
1931<<EOF>> {
1932        if (--include_stack_ptr < 0)
1933                yyterminate();
1934        else {
1935                yy_delete_buffer(YY_CURRENT_BUFFER);
1936                yy_switch_to_buffer(
1937                    include_stack[include_stack_ptr]);
1938       }
1939}
1940.Ed
1941.Pp
1942Three routines are available for setting up input buffers for
1943scanning in-memory strings instead of files.
1944All of them create a new input buffer for scanning the string,
1945and return a corresponding
1946.Dv YY_BUFFER_STATE
1947handle (which should be deleted afterwards using
1948.Fn yy_delete_buffer ) .
1949They also switch to the new buffer using
1950.Fn yy_switch_to_buffer ,
1951so the next call to
1952.Fn yylex
1953will start scanning the string.
1954.Bl -tag -width Ds
1955.It yy_scan_string(const char *str)
1956Scans a NUL-terminated string.
1957.It yy_scan_bytes(const char *bytes, int len)
1958Scans
1959.Fa len
1960bytes
1961.Pq including possibly NUL's
1962starting at location
1963.Fa bytes .
1964.El
1965.Pp
1966Note that both of these functions create and scan a copy
1967of the string or bytes.
1968(This may be desirable, since
1969.Fn yylex
1970modifies the contents of the buffer it is scanning.)
1971The copy can be avoided by using:
1972.Bl -tag -width Ds
1973.It yy_scan_buffer(char *base, yy_size_t size)
1974Which scans the buffer starting at
1975.Fa base ,
1976consisting of
1977.Fa size
1978bytes, the last two bytes of which must be
1979.Dv YY_END_OF_BUFFER_CHAR
1980.Pq ASCII NUL .
1981These last two bytes are not scanned; thus, scanning consists of
1982base[0] through base[size-2], inclusive.
1983.Pp
1984If
1985.Fa base
1986is not set up in this manner
1987(i.e., forget the final two
1988.Dv YY_END_OF_BUFFER_CHAR
1989bytes), then
1990.Fn yy_scan_buffer
1991returns a nil pointer instead of creating a new input buffer.
1992.Pp
1993The type
1994.Fa yy_size_t
1995is an integral type which can be cast to an integer expression
1996reflecting the size of the buffer.
1997.El
1998.Sh END-OF-FILE RULES
1999The special rule
2000.Qq Aq Aq EOF
2001indicates actions which are to be taken when an end-of-file is encountered and
2002.Fn yywrap
2003returns non-zero
2004.Pq i.e., indicates no further files to process .
2005The action must finish by doing one of four things:
2006.Bl -dash
2007.It
2008Assigning
2009.Em yyin
2010to a new input file
2011(in previous versions of
2012.Nm ,
2013after doing the assignment, it was necessary to call the special action
2014.Dv YY_NEW_FILE ;
2015this is no longer necessary).
2016.It
2017Executing a
2018.Em return
2019statement.
2020.It
2021Executing the special
2022.Fn yyterminate
2023action.
2024.It
2025Switching to a new buffer using
2026.Fn yy_switch_to_buffer
2027as shown in the example above.
2028.El
2029.Pp
2030.Aq Aq EOF
2031rules may not be used with other patterns;
2032they may only be qualified with a list of start conditions.
2033If an unqualified
2034.Aq Aq EOF
2035rule is given, it applies to all start conditions which do not already have
2036.Aq Aq EOF
2037actions.
2038To specify an
2039.Aq Aq EOF
2040rule for only the initial start condition, use
2041.Pp
2042.Dl <INITIAL><<EOF>>
2043.Pp
2044These rules are useful for catching things like unclosed comments.
2045An example:
2046.Bd -literal -offset indent
2047%x quote
2048%%
2049
2050\&...other rules for dealing with quotes...
2051
2052<quote><<EOF>> {
2053         error("unterminated quote");
2054         yyterminate();
2055}
2056<<EOF>> {
2057         if (*++filelist)
2058                 yyin = fopen(*filelist, "r");
2059         else
2060                 yyterminate();
2061}
2062.Ed
2063.Sh MISCELLANEOUS MACROS
2064The macro
2065.Dv YY_USER_ACTION
2066can be defined to provide an action
2067which is always executed prior to the matched rule's action.
2068For example,
2069it could be #define'd to call a routine to convert yytext to lower-case.
2070When
2071.Dv YY_USER_ACTION
2072is invoked, the variable
2073.Fa yy_act
2074gives the number of the matched rule
2075.Pq rules are numbered starting with 1 .
2076For example, to profile how often each rule is matched,
2077the following would do the trick:
2078.Pp
2079.Dl #define YY_USER_ACTION ++ctr[yy_act]
2080.Pp
2081where
2082.Fa ctr
2083is an array to hold the counts for the different rules.
2084Note that the macro
2085.Dv YY_NUM_RULES
2086gives the total number of rules
2087(including the default rule, even if
2088.Fl s
2089is used),
2090so a correct declaration for
2091.Fa ctr
2092is:
2093.Pp
2094.Dl int ctr[YY_NUM_RULES];
2095.Pp
2096The macro
2097.Dv YY_USER_INIT
2098may be defined to provide an action which is always executed before
2099the first scan
2100.Pq and before the scanner's internal initializations are done .
2101For example, it could be used to call a routine to read
2102in a data table or open a logging file.
2103.Pp
2104The macro
2105.Dv yy_set_interactive(is_interactive)
2106can be used to control whether the current buffer is considered
2107.Em interactive .
2108An interactive buffer is processed more slowly,
2109but must be used when the scanner's input source is indeed
2110interactive to avoid problems due to waiting to fill buffers
2111(see the discussion of the
2112.Fl I
2113flag below).
2114A non-zero value in the macro invocation marks the buffer as interactive,
2115a zero value as non-interactive.
2116Note that use of this macro overrides
2117.Dq %option always-interactive
2118or
2119.Dq %option never-interactive
2120(see
2121.Sx OPTIONS
2122below).
2123.Fn yy_set_interactive
2124must be invoked prior to beginning to scan the buffer that is
2125.Pq or is not
2126to be considered interactive.
2127.Pp
2128The macro
2129.Dv yy_set_bol(at_bol)
2130can be used to control whether the current buffer's scanning
2131context for the next token match is done as though at the
2132beginning of a line.
2133A non-zero macro argument makes rules anchored with
2134.Sq ^
2135active, while a zero argument makes
2136.Sq ^
2137rules inactive.
2138.Pp
2139The macro
2140.Dv YY_AT_BOL
2141returns true if the next token scanned from the current buffer will have
2142.Sq ^
2143rules active, false otherwise.
2144.Pp
2145In the generated scanner, the actions are all gathered in one large
2146switch statement and separated using
2147.Dv YY_BREAK ,
2148which may be redefined.
2149By default, it is simply a
2150.Qq break ,
2151to separate each rule's action from the following rules.
2152Redefining
2153.Dv YY_BREAK
2154allows, for example, C++ users to
2155.Dq #define YY_BREAK
2156to do nothing
2157(while being very careful that every rule ends with a
2158.Qq break
2159or a
2160.Qq return ! )
2161to avoid suffering from unreachable statement warnings where because a rule's
2162action ends with
2163.Dq return ,
2164the
2165.Dv YY_BREAK
2166is inaccessible.
2167.Sh VALUES AVAILABLE TO THE USER
2168This section summarizes the various values available to the user
2169in the rule actions.
2170.Bl -tag -width Ds
2171.It char *yytext
2172Holds the text of the current token.
2173It may be modified but not lengthened
2174.Pq characters cannot be appended to the end .
2175.Pp
2176If the special directive
2177.Dq %array
2178appears in the first section of the scanner description, then
2179.Fa yytext
2180is instead declared
2181.Dq char yytext[YYLMAX] ,
2182where
2183.Dv YYLMAX
2184is a macro definition that can be redefined in the first section
2185to change the default value
2186.Pq generally 8KB .
2187Using
2188.Dq %array
2189results in somewhat slower scanners, but the value of
2190.Fa yytext
2191becomes immune to calls to
2192.Fn input
2193and
2194.Fn unput ,
2195which potentially destroy its value when
2196.Fa yytext
2197is a character pointer.
2198The opposite of
2199.Dq %array
2200is
2201.Dq %pointer ,
2202which is the default.
2203.Pp
2204.Dq %array
2205cannot be used when generating C++ scanner classes
2206(the
2207.Fl +
2208flag).
2209.It int yyleng
2210Holds the length of the current token.
2211.It FILE *yyin
2212Is the file which by default
2213.Nm
2214reads from.
2215It may be redefined, but doing so only makes sense before
2216scanning begins or after an
2217.Dv EOF
2218has been encountered.
2219Changing it in the midst of scanning will have unexpected results since
2220.Nm
2221buffers its input; use
2222.Fn yyrestart
2223instead.
2224Once scanning terminates because an end-of-file
2225has been seen,
2226.Fa yyin
2227can be assigned as the new input file
2228and the scanner can be called again to continue scanning.
2229.It void yyrestart(FILE *new_file)
2230May be called to point
2231.Fa yyin
2232at the new input file.
2233The switch-over to the new file is immediate
2234.Pq any previously buffered-up input is lost .
2235Note that calling
2236.Fn yyrestart
2237with
2238.Fa yyin
2239as an argument thus throws away the current input buffer and continues
2240scanning the same input file.
2241.It FILE *yyout
2242Is the file to which
2243.Em ECHO
2244actions are done.
2245It can be reassigned by the user.
2246.It YY_CURRENT_BUFFER
2247Returns a
2248.Dv YY_BUFFER_STATE
2249handle to the current buffer.
2250.It YY_START
2251Returns an integer value corresponding to the current start condition.
2252This value can subsequently be used with
2253.Em BEGIN
2254to return to that start condition.
2255.El
2256.Sh INTERFACING WITH YACC
2257One of the main uses of
2258.Nm
2259is as a companion to the
2260.Xr yacc 1
2261parser-generator.
2262yacc parsers expect to call a routine named
2263.Fn yylex
2264to find the next input token.
2265The routine is supposed to return the type of the next token
2266as well as putting any associated value in the global
2267.Fa yylval ,
2268which is defined externally,
2269and can be a union or any other complex data structure.
2270To use
2271.Nm
2272with yacc, one specifies the
2273.Fl d
2274option to yacc to instruct it to generate the file
2275.Pa y.tab.h
2276containing definitions of all the
2277.Dq %tokens
2278appearing in the yacc input.
2279This file is then included in the
2280.Nm
2281scanner.
2282For example, if one of the tokens is
2283.Qq TOK_NUMBER ,
2284part of the scanner might look like:
2285.Bd -literal -offset indent
2286%{
2287#include "y.tab.h"
2288%}
2289
2290%%
2291
2292[0-9]+        yylval = atoi(yytext); return TOK_NUMBER;
2293.Ed
2294.Sh OPTIONS
2295.Nm
2296has the following options:
2297.Bl -tag -width Ds
2298.It Fl 7
2299Instructs
2300.Nm
2301to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2302characters in its input.
2303The advantage of using
2304.Fl 7
2305is that the scanner's tables can be up to half the size of those generated
2306using the
2307.Fl 8
2308option
2309.Pq see below .
2310The disadvantage is that such scanners often hang
2311or crash if their input contains an 8-bit character.
2312.Pp
2313Note, however, that unless generating a scanner using the
2314.Fl Cf
2315or
2316.Fl CF
2317table compression options, use of
2318.Fl 7
2319will save only a small amount of table space,
2320and make the scanner considerably less portable.
2321.Nm flex Ns 's
2322default behavior is to generate an 8-bit scanner unless
2323.Fl Cf
2324or
2325.Fl CF
2326is specified, in which case
2327.Nm
2328defaults to generating 7-bit scanners unless it was
2329configured to generate 8-bit scanners
2330(as will often be the case with non-USA sites).
2331It is possible tell whether
2332.Nm
2333generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2334.Fl v
2335output as described below.
2336.Pp
2337Note that if
2338.Fl Cfe
2339or
2340.Fl CFe
2341are used
2342(the table compression options, but also using equivalence classes as
2343discussed below),
2344.Nm
2345still defaults to generating an 8-bit scanner,
2346since usually with these compression options full 8-bit tables
2347are not much more expensive than 7-bit tables.
2348.It Fl 8
2349Instructs
2350.Nm
2351to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2352characters.
2353This flag is only needed for scanners generated using
2354.Fl Cf
2355or
2356.Fl CF ,
2357as otherwise
2358.Nm
2359defaults to generating an 8-bit scanner anyway.
2360.Pp
2361See the discussion of
2362.Fl 7
2363above for
2364.Nm flex Ns 's
2365default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2366.It Fl B
2367Instructs
2368.Nm
2369to generate a
2370.Em batch
2371scanner, the opposite of
2372.Em interactive
2373scanners generated by
2374.Fl I
2375.Pq see below .
2376In general,
2377.Fl B
2378is used when the scanner will never be used interactively,
2379and you want to squeeze a little more performance out of it.
2380If the aim is instead to squeeze out a lot more performance,
2381use the
2382.Fl Cf
2383or
2384.Fl CF
2385options
2386.Pq discussed below ,
2387which turn on
2388.Fl B
2389automatically anyway.
2390.It Fl b
2391Generate backing-up information to
2392.Pa lex.backup .
2393This is a list of scanner states which require backing up
2394and the input characters on which they do so.
2395By adding rules one can remove backing-up states.
2396If all backing-up states are eliminated and
2397.Fl Cf
2398or
2399.Fl CF
2400is used, the generated scanner will run faster (see the
2401.Fl p
2402flag).
2403Only users who wish to squeeze every last cycle out of their
2404scanners need worry about this option.
2405(See the section on
2406.Sx PERFORMANCE CONSIDERATIONS
2407below.)
2408.It Fl C Ns Op Cm aeFfmr
2409Controls the degree of table compression and, more generally, trade-offs
2410between small scanners and fast scanners.
2411.Bl -tag -width Ds
2412.It Fl Ca
2413Instructs
2414.Nm
2415to trade off larger tables in the generated scanner for faster performance
2416because the elements of the tables are better aligned for memory access
2417and computation.
2418On some
2419.Tn RISC
2420architectures, fetching and manipulating longwords is more efficient
2421than with smaller-sized units such as shortwords.
2422This option can double the size of the tables used by the scanner.
2423.It Fl Ce
2424Directs
2425.Nm
2426to construct
2427.Em equivalence classes ,
2428i.e., sets of characters which have identical lexical properties
2429(for example, if the only appearance of digits in the
2430.Nm
2431input is in the character class
2432.Qq [0-9]
2433then the digits
2434.Sq 0 ,
2435.Sq 1 ,
2436.Sq ... ,
2437.Sq 9
2438will all be put in the same equivalence class).
2439Equivalence classes usually give dramatic reductions in the final
2440table/object file sizes
2441.Pq typically a factor of 2\-5
2442and are pretty cheap performance-wise
2443.Pq one array look-up per character scanned .
2444.It Fl CF
2445Specifies that the alternate fast scanner representation
2446(described below under the
2447.Fl F
2448option)
2449should be used.
2450This option cannot be used with
2451.Fl + .
2452.It Fl Cf
2453Specifies that the
2454.Em full
2455scanner tables should be generated \-
2456.Nm
2457should not compress the tables by taking advantage of
2458similar transition functions for different states.
2459.It Fl \&Cm
2460Directs
2461.Nm
2462to construct
2463.Em meta-equivalence classes ,
2464which are sets of equivalence classes
2465(or characters, if equivalence classes are not being used)
2466that are commonly used together.
2467Meta-equivalence classes are often a big win when using compressed tables,
2468but they have a moderate performance impact
2469(one or two
2470.Qq if
2471tests and one array look-up per character scanned).
2472.It Fl Cr
2473Causes the generated scanner to
2474.Em bypass
2475use of the standard I/O library
2476.Pq stdio
2477for input.
2478Instead of calling
2479.Xr fread 3
2480or
2481.Xr getc 3 ,
2482the scanner will use the
2483.Xr read 2
2484system call,
2485resulting in a performance gain which varies from system to system,
2486but in general is probably negligible unless
2487.Fl Cf
2488or
2489.Fl CF
2490are being used.
2491Using
2492.Fl Cr
2493can cause strange behavior if, for example, reading from
2494.Fa yyin
2495using stdio prior to calling the scanner
2496(because the scanner will miss whatever text previous reads left
2497in the stdio input buffer).
2498.Pp
2499.Fl Cr
2500has no effect if
2501.Dv YY_INPUT
2502is defined
2503(see
2504.Sx THE GENERATED SCANNER
2505above).
2506.El
2507.Pp
2508A lone
2509.Fl C
2510specifies that the scanner tables should be compressed but neither
2511equivalence classes nor meta-equivalence classes should be used.
2512.Pp
2513The options
2514.Fl Cf
2515or
2516.Fl CF
2517and
2518.Fl \&Cm
2519do not make sense together \- there is no opportunity for meta-equivalence
2520classes if the table is not being compressed.
2521Otherwise the options may be freely mixed, and are cumulative.
2522.Pp
2523The default setting is
2524.Fl Cem
2525which specifies that
2526.Nm
2527should generate equivalence classes and meta-equivalence classes.
2528This setting provides the highest degree of table compression.
2529It is possible to trade off faster-executing scanners at the cost of
2530larger tables with the following generally being true:
2531.Bd -unfilled -offset indent
2532slowest & smallest
2533      -Cem
2534      -Cm
2535      -Ce
2536      -C
2537      -C{f,F}e
2538      -C{f,F}
2539      -C{f,F}a
2540fastest & largest
2541.Ed
2542.Pp
2543Note that scanners with the smallest tables are usually generated and
2544compiled the quickest,
2545so during development the default is usually best,
2546maximal compression.
2547.Pp
2548.Fl Cfe
2549is often a good compromise between speed and size for production scanners.
2550.It Fl d
2551Makes the generated scanner run in debug mode.
2552Whenever a pattern is recognized and the global
2553.Fa yy_flex_debug
2554is non-zero
2555.Pq which is the default ,
2556the scanner will write to stderr a line of the form:
2557.Pp
2558.D1 --accepting rule at line 53 ("the matched text")
2559.Pp
2560The line number refers to the location of the rule in the file
2561defining the scanner
2562(i.e., the file that was fed to
2563.Nm ) .
2564Messages are also generated when the scanner backs up,
2565accepts the default rule,
2566reaches the end of its input buffer
2567(or encounters a NUL;
2568at this point, the two look the same as far as the scanner's concerned),
2569or reaches an end-of-file.
2570.It Fl F
2571Specifies that the fast scanner table representation should be used
2572.Pq and stdio bypassed .
2573This representation is about as fast as the full table representation
2574.Pq Fl f ,
2575and for some sets of patterns will be considerably smaller
2576.Pq and for others, larger .
2577In general, if the pattern set contains both
2578.Qq keywords
2579and a catch-all,
2580.Qq identifier
2581rule, such as in the set:
2582.Bd -unfilled -offset indent
2583"case"    return TOK_CASE;
2584"switch"  return TOK_SWITCH;
2585\&...
2586"default" return TOK_DEFAULT;
2587[a-z]+    return TOK_ID;
2588.Ed
2589.Pp
2590then it's better to use the full table representation.
2591If only the
2592.Qq identifier
2593rule is present and a hash table or some such is used to detect the keywords,
2594it's better to use
2595.Fl F .
2596.Pp
2597This option is equivalent to
2598.Fl CFr
2599.Pq see above .
2600It cannot be used with
2601.Fl + .
2602.It Fl f
2603Specifies
2604.Em fast scanner .
2605No table compression is done and stdio is bypassed.
2606The result is large but fast.
2607This option is equivalent to
2608.Fl Cfr
2609.Pq see above .
2610.It Fl h
2611Generates a help summary of
2612.Nm flex Ns 's
2613options to stdout and then exits.
2614.Fl ?\&
2615and
2616.Fl Fl help
2617are synonyms for
2618.Fl h .
2619.It Fl I
2620Instructs
2621.Nm
2622to generate an
2623.Em interactive
2624scanner.
2625An interactive scanner is one that only looks ahead to decide
2626what token has been matched if it absolutely must.
2627It turns out that always looking one extra character ahead,
2628even if the scanner has already seen enough text
2629to disambiguate the current token, is a bit faster than
2630only looking ahead when necessary.
2631But scanners that always look ahead give dreadful interactive performance;
2632for example, when a user types a newline,
2633it is not recognized as a newline token until they enter
2634.Em another
2635token, which often means typing in another whole line.
2636.Pp
2637.Nm
2638scanners default to
2639.Em interactive
2640unless
2641.Fl Cf
2642or
2643.Fl CF
2644table-compression options are specified
2645.Pq see above .
2646That's because if high-performance is most important,
2647one of these options should be used,
2648so if they weren't,
2649.Nm
2650assumes it is preferable to trade off a bit of run-time performance for
2651intuitive interactive behavior.
2652Note also that
2653.Fl I
2654cannot be used in conjunction with
2655.Fl Cf
2656or
2657.Fl CF .
2658Thus, this option is not really needed; it is on by default for all those
2659cases in which it is allowed.
2660.Pp
2661A scanner can be forced to not be interactive by using
2662.Fl B
2663.Pq see above .
2664.It Fl i
2665Instructs
2666.Nm
2667to generate a case-insensitive scanner.
2668The case of letters given in the
2669.Nm
2670input patterns will be ignored,
2671and tokens in the input will be matched regardless of case.
2672The matched text given in
2673.Fa yytext
2674will have the preserved case
2675.Pq i.e., it will not be folded .
2676.It Fl L
2677Instructs
2678.Nm
2679not to generate
2680.Dq #line
2681directives.
2682Without this option,
2683.Nm
2684peppers the generated scanner with #line directives so error messages
2685in the actions will be correctly located with respect to either the original
2686.Nm
2687input file
2688(if the errors are due to code in the input file),
2689or
2690.Pa lex.yy.c
2691(if the errors are
2692.Nm flex Ns 's
2693fault \- these sorts of errors should be reported to the email address
2694given below).
2695.It Fl l
2696Turns on maximum compatibility with the original
2697.At
2698.Nm lex
2699implementation.
2700Note that this does not mean full compatibility.
2701Use of this option costs a considerable amount of performance,
2702and it cannot be used with the
2703.Fl + , f , F , Cf ,
2704or
2705.Fl CF
2706options.
2707For details on the compatibilities it provides, see the section
2708.Sx INCOMPATIBILITIES WITH LEX AND POSIX
2709below.
2710This option also results in the name
2711.Dv YY_FLEX_LEX_COMPAT
2712being #define'd in the generated scanner.
2713.It Fl n
2714Another do-nothing, deprecated option included only for
2715.Tn POSIX
2716compliance.
2717.It Fl o Ns Ar output
2718Directs
2719.Nm
2720to write the scanner to the file
2721.Ar output
2722instead of
2723.Pa lex.yy.c .
2724If
2725.Fl o
2726is combined with the
2727.Fl t
2728option, then the scanner is written to stdout but its
2729.Dq #line
2730directives
2731(see the
2732.Fl L
2733option above)
2734refer to the file
2735.Ar output .
2736.It Fl P Ns Ar prefix
2737Changes the default
2738.Qq yy
2739prefix used by
2740.Nm
2741for all globally visible variable and function names to instead be
2742.Ar prefix .
2743For example,
2744.Fl P Ns Ar foo
2745changes the name of
2746.Fa yytext
2747to
2748.Fa footext .
2749It also changes the name of the default output file from
2750.Pa lex.yy.c
2751to
2752.Pa lex.foo.c .
2753Here are all of the names affected:
2754.Bd -unfilled -offset indent
2755yy_create_buffer
2756yy_delete_buffer
2757yy_flex_debug
2758yy_init_buffer
2759yy_flush_buffer
2760yy_load_buffer_state
2761yy_switch_to_buffer
2762yyin
2763yyleng
2764yylex
2765yylineno
2766yyout
2767yyrestart
2768yytext
2769yywrap
2770.Ed
2771.Pp
2772(If using a C++ scanner, then only
2773.Fa yywrap
2774and
2775.Fa yyFlexLexer
2776are affected.)
2777Within the scanner itself, it is still possible to refer to the global variables
2778and functions using either version of their name; but externally, they
2779have the modified name.
2780.Pp
2781This option allows multiple
2782.Nm
2783programs to be easily linked together into the same executable.
2784Note, though, that using this option also renames
2785.Fn yywrap ,
2786so now either an
2787.Pq appropriately named
2788version of the routine for the scanner must be supplied, or
2789.Dq %option noyywrap
2790must be used, as linking with
2791.Fl lfl
2792no longer provides one by default.
2793.It Fl p
2794Generates a performance report to stderr.
2795The report consists of comments regarding features of the
2796.Nm
2797input file which will cause a serious loss of performance in the resulting
2798scanner.
2799If the flag is specified twice,
2800comments regarding features that lead to minor performance losses
2801will also be reported>
2802.Pp
2803Note that the use of
2804.Em REJECT ,
2805.Dq %option yylineno ,
2806and variable trailing context
2807(see the
2808.Sx BUGS
2809section below)
2810entails a substantial performance penalty; use of
2811.Fn yymore ,
2812the
2813.Sq ^
2814operator, and the
2815.Fl I
2816flag entail minor performance penalties.
2817.It Fl S Ns Ar skeleton
2818Overrides the default skeleton file from which
2819.Nm
2820constructs its scanners.
2821This option is needed only for
2822.Nm
2823maintenance or development.
2824.It Fl s
2825Causes the default rule
2826.Pq that unmatched scanner input is echoed to stdout
2827to be suppressed.
2828If the scanner encounters input that does not
2829match any of its rules, it aborts with an error.
2830This option is useful for finding holes in a scanner's rule set.
2831.It Fl T
2832Makes
2833.Nm
2834run in
2835.Em trace
2836mode.
2837It will generate a lot of messages to stderr concerning
2838the form of the input and the resultant non-deterministic and deterministic
2839finite automata.
2840This option is mostly for use in maintaining
2841.Nm .
2842.It Fl t
2843Instructs
2844.Nm
2845to write the scanner it generates to standard output instead of
2846.Pa lex.yy.c .
2847.It Fl V
2848Prints the version number to stdout and exits.
2849.Fl Fl version
2850is a synonym for
2851.Fl V .
2852.It Fl v
2853Specifies that
2854.Nm
2855should write to stderr
2856a summary of statistics regarding the scanner it generates.
2857Most of the statistics are meaningless to the casual
2858.Nm
2859user, but the first line identifies the version of
2860.Nm
2861(same as reported by
2862.Fl V ) ,
2863and the next line the flags used when generating the scanner,
2864including those that are on by default.
2865.It Fl w
2866Suppresses warning messages.
2867.It Fl +
2868Specifies that
2869.Nm
2870should generate a C++ scanner class.
2871See the section on
2872.Sx GENERATING C++ SCANNERS
2873below for details.
2874.El
2875.Pp
2876.Nm
2877also provides a mechanism for controlling options within the
2878scanner specification itself, rather than from the
2879.Nm
2880command line.
2881This is done by including
2882.Dq %option
2883directives in the first section of the scanner specification.
2884Multiple options can be specified with a single
2885.Dq %option
2886directive, and multiple directives in the first section of the
2887.Nm
2888input file.
2889.Pp
2890Most options are given simply as names, optionally preceded by the word
2891.Qq no
2892.Pq with no intervening whitespace
2893to negate their meaning.
2894A number are equivalent to
2895.Nm
2896flags or their negation:
2897.Bd -unfilled -offset indent
28987bit            -7 option
28998bit            -8 option
2900align           -Ca option
2901backup          -b option
2902batch           -B option
2903c++             -+ option
2904
2905caseful or
2906case-sensitive  opposite of -i (default)
2907
2908case-insensitive or
2909caseless        -i option
2910
2911debug           -d option
2912default         opposite of -s option
2913ecs             -Ce option
2914fast            -F option
2915full            -f option
2916interactive     -I option
2917lex-compat      -l option
2918meta-ecs        -Cm option
2919perf-report     -p option
2920read            -Cr option
2921stdout          -t option
2922verbose         -v option
2923warn            opposite of -w option
2924                (use "%option nowarn" for -w)
2925
2926array           equivalent to "%array"
2927pointer         equivalent to "%pointer" (default)
2928.Ed
2929.Pp
2930Some %option's provide features otherwise not available:
2931.Bl -tag -width Ds
2932.It always-interactive
2933Instructs
2934.Nm
2935to generate a scanner which always considers its input
2936.Qq interactive .
2937Normally, on each new input file the scanner calls
2938.Fn isatty
2939in an attempt to determine whether the scanner's input source is interactive
2940and thus should be read a character at a time.
2941When this option is used, however, no such call is made.
2942.It main
2943Directs
2944.Nm
2945to provide a default
2946.Fn main
2947program for the scanner, which simply calls
2948.Fn yylex .
2949This option implies
2950.Dq noyywrap
2951.Pq see below .
2952.It never-interactive
2953Instructs
2954.Nm
2955to generate a scanner which never considers its input
2956.Qq interactive
2957(again, no call made to
2958.Fn isatty ) .
2959This is the opposite of
2960.Dq always-interactive .
2961.It stack
2962Enables the use of start condition stacks
2963(see
2964.Sx START CONDITIONS
2965above).
2966.It stdinit
2967If set (i.e.,
2968.Dq %option stdinit ) ,
2969initializes
2970.Fa yyin
2971and
2972.Fa yyout
2973to stdin and stdout, instead of the default of
2974.Dq nil .
2975Some existing
2976.Nm lex
2977programs depend on this behavior, even though it is not compliant with ANSI C,
2978which does not require stdin and stdout to be compile-time constant.
2979.It yylineno
2980Directs
2981.Nm
2982to generate a scanner that maintains the number of the current line
2983read from its input in the global variable
2984.Fa yylineno .
2985This option is implied by
2986.Dq %option lex-compat .
2987.It yywrap
2988If unset (i.e.,
2989.Dq %option noyywrap ) ,
2990makes the scanner not call
2991.Fn yywrap
2992upon an end-of-file, but simply assume that there are no more files to scan
2993(until the user points
2994.Fa yyin
2995at a new file and calls
2996.Fn yylex
2997again).
2998.El
2999.Pp
3000.Nm
3001scans rule actions to determine whether the
3002.Em REJECT
3003or
3004.Fn yymore
3005features are being used.
3006The
3007.Dq reject
3008and
3009.Dq yymore
3010options are available to override its decision as to whether to use the
3011options, either by setting them (e.g.,
3012.Dq %option reject )
3013to indicate the feature is indeed used,
3014or unsetting them to indicate it actually is not used
3015(e.g.,
3016.Dq %option noyymore ) .
3017.Pp
3018Three options take string-delimited values, offset with
3019.Sq = :
3020.Pp
3021.D1 %option outfile="ABC"
3022.Pp
3023is equivalent to
3024.Fl o Ns Ar ABC ,
3025and
3026.Pp
3027.D1 %option prefix="XYZ"
3028.Pp
3029is equivalent to
3030.Fl P Ns Ar XYZ .
3031Finally,
3032.Pp
3033.D1 %option yyclass="foo"
3034.Pp
3035only applies when generating a C++ scanner
3036.Pf ( Fl +
3037option).
3038It informs
3039.Nm
3040that
3041.Dq foo
3042has been derived as a subclass of yyFlexLexer, so
3043.Nm
3044will place actions in the member function
3045.Dq foo::yylex()
3046instead of
3047.Dq yyFlexLexer::yylex() .
3048It also generates a
3049.Dq yyFlexLexer::yylex()
3050member function that emits a run-time error (by invoking
3051.Dq yyFlexLexer::LexerError() )
3052if called.
3053See
3054.Sx GENERATING C++ SCANNERS ,
3055below, for additional information.
3056.Pp
3057A number of options are available for
3058lint
3059purists who want to suppress the appearance of unneeded routines
3060in the generated scanner.
3061Each of the following, if unset
3062(e.g.,
3063.Dq %option nounput ) ,
3064results in the corresponding routine not appearing in the generated scanner:
3065.Bd -unfilled -offset indent
3066input, unput
3067yy_push_state, yy_pop_state, yy_top_state
3068yy_scan_buffer, yy_scan_bytes, yy_scan_string
3069.Ed
3070.Pp
3071(though
3072.Fn yy_push_state
3073and friends won't appear anyway unless
3074.Dq %option stack
3075is being used).
3076.Sh PERFORMANCE CONSIDERATIONS
3077The main design goal of
3078.Nm
3079is that it generate high-performance scanners.
3080It has been optimized for dealing well with large sets of rules.
3081Aside from the effects on scanner speed of the table compression
3082.Fl C
3083options outlined above,
3084there are a number of options/actions which degrade performance.
3085These are, from most expensive to least:
3086.Bd -unfilled -offset indent
3087REJECT
3088%option yylineno
3089arbitrary trailing context
3090
3091pattern sets that require backing up
3092%array
3093%option interactive
3094%option always-interactive
3095
3096\&'^' beginning-of-line operator
3097yymore()
3098.Ed
3099.Pp
3100with the first three all being quite expensive
3101and the last two being quite cheap.
3102Note also that
3103.Fn unput
3104is implemented as a routine call that potentially does quite a bit of work,
3105while
3106.Fn yyless
3107is a quite-cheap macro; so if just putting back some excess text,
3108use
3109.Fn yyless .
3110.Pp
3111.Em REJECT
3112should be avoided at all costs when performance is important.
3113It is a particularly expensive option.
3114.Pp
3115Getting rid of backing up is messy and often may be an enormous
3116amount of work for a complicated scanner.
3117In principal, one begins by using the
3118.Fl b
3119flag to generate a
3120.Pa lex.backup
3121file.
3122For example, on the input
3123.Bd -literal -offset indent
3124%%
3125foo        return TOK_KEYWORD;
3126foobar     return TOK_KEYWORD;
3127.Ed
3128.Pp
3129the file looks like:
3130.Bd -literal -offset indent
3131State #6 is non-accepting -
3132 associated rule line numbers:
3133       2       3
3134 out-transitions: [ o ]
3135 jam-transitions: EOF [ \e001-n  p-\e177 ]
3136
3137State #8 is non-accepting -
3138 associated rule line numbers:
3139       3
3140 out-transitions: [ a ]
3141 jam-transitions: EOF [ \e001-`  b-\e177 ]
3142
3143State #9 is non-accepting -
3144 associated rule line numbers:
3145       3
3146 out-transitions: [ r ]
3147 jam-transitions: EOF [ \e001-q  s-\e177 ]
3148
3149Compressed tables always back up.
3150.Ed
3151.Pp
3152The first few lines tell us that there's a scanner state in
3153which it can make a transition on an
3154.Sq o
3155but not on any other character,
3156and that in that state the currently scanned text does not match any rule.
3157The state occurs when trying to match the rules found
3158at lines 2 and 3 in the input file.
3159If the scanner is in that state and then reads something other than an
3160.Sq o ,
3161it will have to back up to find a rule which is matched.
3162With a bit of headscratching one can see that this must be the
3163state it's in when it has seen
3164.Sq fo .
3165When this has happened, if anything other than another
3166.Sq o
3167is seen, the scanner will have to back up to simply match the
3168.Sq f
3169.Pq by the default rule .
3170.Pp
3171The comment regarding State #8 indicates there's a problem when
3172.Qq foob
3173has been scanned.
3174Indeed, on any character other than an
3175.Sq a ,
3176the scanner will have to back up to accept
3177.Qq foo .
3178Similarly, the comment for State #9 concerns when
3179.Qq fooba
3180has been scanned and an
3181.Sq r
3182does not follow.
3183.Pp
3184The final comment reminds us that there's no point going to
3185all the trouble of removing backing up from the rules unless we're using
3186.Fl Cf
3187or
3188.Fl CF ,
3189since there's no performance gain doing so with compressed scanners.
3190.Pp
3191The way to remove the backing up is to add
3192.Qq error
3193rules:
3194.Bd -literal -offset indent
3195%%
3196foo    return TOK_KEYWORD;
3197foobar return TOK_KEYWORD;
3198
3199fooba  |
3200foob   |
3201fo {
3202        /* false alarm, not really a keyword */
3203        return TOK_ID;
3204}
3205.Ed
3206.Pp
3207Eliminating backing up among a list of keywords can also be done using a
3208.Qq catch-all
3209rule:
3210.Bd -literal -offset indent
3211%%
3212foo    return TOK_KEYWORD;
3213foobar return TOK_KEYWORD;
3214
3215[a-z]+ return TOK_ID;
3216.Ed
3217.Pp
3218This is usually the best solution when appropriate.
3219.Pp
3220Backing up messages tend to cascade.
3221With a complicated set of rules it's not uncommon to get hundreds of messages.
3222If one can decipher them, though,
3223it often only takes a dozen or so rules to eliminate the backing up
3224(though it's easy to make a mistake and have an error rule accidentally match
3225a valid token; a possible future
3226.Nm
3227feature will be to automatically add rules to eliminate backing up).
3228.Pp
3229It's important to keep in mind that the benefits of eliminating
3230backing up are gained only if
3231.Em every
3232instance of backing up is eliminated.
3233Leaving just one gains nothing.
3234.Pp
3235.Em Variable
3236trailing context
3237(where both the leading and trailing parts do not have a fixed length)
3238entails almost the same performance loss as
3239.Em REJECT
3240.Pq i.e., substantial .
3241So when possible a rule like:
3242.Bd -literal -offset indent
3243%%
3244mouse|rat/(cat|dog)   run();
3245.Ed
3246.Pp
3247is better written:
3248.Bd -literal -offset indent
3249%%
3250mouse/cat|dog         run();
3251rat/cat|dog           run();
3252.Ed
3253.Pp
3254or as
3255.Bd -literal -offset indent
3256%%
3257mouse|rat/cat         run();
3258mouse|rat/dog         run();
3259.Ed
3260.Pp
3261Note that here the special
3262.Sq |\&
3263action does not provide any savings, and can even make things worse (see
3264.Sx BUGS
3265below).
3266.Pp
3267Another area where the user can increase a scanner's performance
3268.Pq and one that's easier to implement
3269arises from the fact that the longer the tokens matched,
3270the faster the scanner will run.
3271This is because with long tokens the processing of most input
3272characters takes place in the
3273.Pq short
3274inner scanning loop, and does not often have to go through the additional work
3275of setting up the scanning environment (e.g.,
3276.Fa yytext )
3277for the action.
3278Recall the scanner for C comments:
3279.Bd -literal -offset indent
3280%x comment
3281%%
3282int line_num = 1;
3283
3284"/*"                    BEGIN(comment);
3285
3286<comment>[^*\en]*
3287<comment>"*"+[^*/\en]*
3288<comment>\en             ++line_num;
3289<comment>"*"+"/"        BEGIN(INITIAL);
3290.Ed
3291.Pp
3292This could be sped up by writing it as:
3293.Bd -literal -offset indent
3294%x comment
3295%%
3296int line_num = 1;
3297
3298"/*"                    BEGIN(comment);
3299
3300<comment>[^*\en]*
3301<comment>[^*\en]*\en      ++line_num;
3302<comment>"*"+[^*/\en]*
3303<comment>"*"+[^*/\en]*\en ++line_num;
3304<comment>"*"+"/"        BEGIN(INITIAL);
3305.Ed
3306.Pp
3307Now instead of each newline requiring the processing of another action,
3308recognizing the newlines is
3309.Qq distributed
3310over the other rules to keep the matched text as long as possible.
3311Note that adding rules does
3312.Em not
3313slow down the scanner!
3314The speed of the scanner is independent of the number of rules or
3315(modulo the considerations given at the beginning of this section)
3316how complicated the rules are with regard to operators such as
3317.Sq *
3318and
3319.Sq |\& .
3320.Pp
3321A final example in speeding up a scanner:
3322scan through a file containing identifiers and keywords, one per line
3323and with no other extraneous characters, and recognize all the keywords.
3324A natural first approach is:
3325.Bd -literal -offset indent
3326%%
3327asm      |
3328auto     |
3329break    |
3330\&... etc ...
3331volatile |
3332while    /* it's a keyword */
3333
3334\&.|\en     /* it's not a keyword */
3335.Ed
3336.Pp
3337To eliminate the back-tracking, introduce a catch-all rule:
3338.Bd -literal -offset indent
3339%%
3340asm      |
3341auto     |
3342break    |
3343\&... etc ...
3344volatile |
3345while    /* it's a keyword */
3346
3347[a-z]+   |
3348\&.|\en     /* it's not a keyword */
3349.Ed
3350.Pp
3351Now, if it's guaranteed that there's exactly one word per line,
3352then we can reduce the total number of matches by a half by
3353merging in the recognition of newlines with that of the other tokens:
3354.Bd -literal -offset indent
3355%%
3356asm\en      |
3357auto\en     |
3358break\en    |
3359\&... etc ...
3360volatile\en |
3361while\en    /* it's a keyword */
3362
3363[a-z]+\en   |
3364\&.|\en       /* it's not a keyword */
3365.Ed
3366.Pp
3367One has to be careful here,
3368as we have now reintroduced backing up into the scanner.
3369In particular, while we know that there will never be any characters
3370in the input stream other than letters or newlines,
3371.Nm
3372can't figure this out, and it will plan for possibly needing to back up
3373when it has scanned a token like
3374.Qq auto
3375and then the next character is something other than a newline or a letter.
3376Previously it would then just match the
3377.Qq auto
3378rule and be done, but now it has no
3379.Qq auto
3380rule, only an
3381.Qq auto\en
3382rule.
3383To eliminate the possibility of backing up,
3384we could either duplicate all rules but without final newlines, or,
3385since we never expect to encounter such an input and therefore don't
3386how it's classified, we can introduce one more catch-all rule,
3387this one which doesn't include a newline:
3388.Bd -literal -offset indent
3389%%
3390asm\en      |
3391auto\en     |
3392break\en    |
3393\&... etc ...
3394volatile\en |
3395while\en    /* it's a keyword */
3396
3397[a-z]+\en   |
3398[a-z]+     |
3399\&.|\en       /* it's not a keyword */
3400.Ed
3401.Pp
3402Compiled with
3403.Fl Cf ,
3404this is about as fast as one can get a
3405.Nm
3406scanner to go for this particular problem.
3407.Pp
3408A final note:
3409.Nm
3410is slow when matching NUL's,
3411particularly when a token contains multiple NUL's.
3412It's best to write rules which match short
3413amounts of text if it's anticipated that the text will often include NUL's.
3414.Pp
3415Another final note regarding performance: as mentioned above in the section
3416.Sx HOW THE INPUT IS MATCHED ,
3417dynamically resizing
3418.Fa yytext
3419to accommodate huge tokens is a slow process because it presently requires that
3420the
3421.Pq huge
3422token be rescanned from the beginning.
3423Thus if performance is vital, it is better to attempt to match
3424.Qq large
3425quantities of text but not
3426.Qq huge
3427quantities, where the cutoff between the two is at about 8K characters/token.
3428.Sh GENERATING C++ SCANNERS
3429.Nm
3430provides two different ways to generate scanners for use with C++.
3431The first way is to simply compile a scanner generated by
3432.Nm
3433using a C++ compiler instead of a C compiler.
3434This should not generate any compilation errors
3435(please report any found to the email address given in the
3436.Sx AUTHORS
3437section below).
3438C++ code can then be used in rule actions instead of C code.
3439Note that the default input source for scanners remains
3440.Fa yyin ,
3441and default echoing is still done to
3442.Fa yyout .
3443Both of these remain
3444.Fa FILE *
3445variables and not C++ streams.
3446.Pp
3447.Nm
3448can also be used to generate a C++ scanner class, using the
3449.Fl +
3450option (or, equivalently,
3451.Dq %option c++ ) ,
3452which is automatically specified if the name of the flex executable ends in a
3453.Sq + ,
3454such as
3455.Nm flex++ .
3456When using this option,
3457.Nm
3458defaults to generating the scanner to the file
3459.Pa lex.yy.cc
3460instead of
3461.Pa lex.yy.c .
3462The generated scanner includes the header file
3463.Aq Pa g++/FlexLexer.h ,
3464which defines the interface to two C++ classes.
3465.Pp
3466The first class,
3467.Em FlexLexer ,
3468provides an abstract base class defining the general scanner class interface.
3469It provides the following member functions:
3470.Bl -tag -width Ds
3471.It const char* YYText()
3472Returns the text of the most recently matched token, the equivalent of
3473.Fa yytext .
3474.It int YYLeng()
3475Returns the length of the most recently matched token, the equivalent of
3476.Fa yyleng .
3477.It int lineno() const
3478Returns the current input line number
3479(see
3480.Dq %option yylineno ) ,
3481or 1 if
3482.Dq %option yylineno
3483was not used.
3484.It void set_debug(int flag)
3485Sets the debugging flag for the scanner, equivalent to assigning to
3486.Fa yy_flex_debug
3487(see the
3488.Sx OPTIONS
3489section above).
3490Note that the scanner must be built using
3491.Dq %option debug
3492to include debugging information in it.
3493.It int debug() const
3494Returns the current setting of the debugging flag.
3495.El
3496.Pp
3497Also provided are member functions equivalent to
3498.Fn yy_switch_to_buffer ,
3499.Fn yy_create_buffer
3500(though the first argument is an
3501.Fa std::istream*
3502object pointer and not a
3503.Fa FILE* ) ,
3504.Fn yy_flush_buffer ,
3505.Fn yy_delete_buffer ,
3506and
3507.Fn yyrestart
3508(again, the first argument is an
3509.Fa std::istream*
3510object pointer).
3511.Pp
3512The second class defined in
3513.Aq Pa g++/FlexLexer.h
3514is
3515.Fa yyFlexLexer ,
3516which is derived from
3517.Fa FlexLexer .
3518It defines the following additional member functions:
3519.Bl -tag -width Ds
3520.It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
3521Constructs a
3522.Fa yyFlexLexer
3523object using the given streams for input and output.
3524If not specified, the streams default to
3525.Fa cin
3526and
3527.Fa cout ,
3528respectively.
3529.It virtual int yylex()
3530Performs the same role as
3531.Fn yylex
3532does for ordinary flex scanners: it scans the input stream, consuming
3533tokens, until a rule's action returns a value.
3534If subclass
3535.Sq S
3536is derived from
3537.Fa yyFlexLexer ,
3538in order to access the member functions and variables of
3539.Sq S
3540inside
3541.Fn yylex ,
3542use
3543.Dq %option yyclass="S"
3544to inform
3545.Nm
3546that the
3547.Sq S
3548subclass will be used instead of
3549.Fa yyFlexLexer .
3550In this case, rather than generating
3551.Dq yyFlexLexer::yylex() ,
3552.Nm
3553generates
3554.Dq S::yylex()
3555(and also generates a dummy
3556.Dq yyFlexLexer::yylex()
3557that calls
3558.Dq yyFlexLexer::LexerError()
3559if called).
3560.It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
3561Reassigns
3562.Fa yyin
3563to
3564.Fa new_in
3565.Pq if non-nil
3566and
3567.Fa yyout
3568to
3569.Fa new_out
3570.Pq ditto ,
3571deleting the previous input buffer if
3572.Fa yyin
3573is reassigned.
3574.It int yylex(std::istream* new_in, std::ostream* new_out = 0)
3575First switches the input streams via
3576.Dq switch_streams(new_in, new_out)
3577and then returns the value of
3578.Fn yylex .
3579.El
3580.Pp
3581In addition,
3582.Fa yyFlexLexer
3583defines the following protected virtual functions which can be redefined
3584in derived classes to tailor the scanner:
3585.Bl -tag -width Ds
3586.It virtual int LexerInput(char* buf, int max_size)
3587Reads up to
3588.Fa max_size
3589characters into
3590.Fa buf
3591and returns the number of characters read.
3592To indicate end-of-input, return 0 characters.
3593Note that
3594.Qq interactive
3595scanners (see the
3596.Fl B
3597and
3598.Fl I
3599flags) define the macro
3600.Dv YY_INTERACTIVE .
3601If
3602.Fn LexerInput
3603has been redefined, and it's necessary to take different actions depending on
3604whether or not the scanner might be scanning an interactive input source,
3605it's possible to test for the presence of this name via
3606.Dq #ifdef .
3607.It virtual void LexerOutput(const char* buf, int size)
3608Writes out
3609.Fa size
3610characters from the buffer
3611.Fa buf ,
3612which, while NUL-terminated, may also contain
3613.Qq internal
3614NUL's if the scanner's rules can match text with NUL's in them.
3615.It virtual void LexerError(const char* msg)
3616Reports a fatal error message.
3617The default version of this function writes the message to the stream
3618.Fa cerr
3619and exits.
3620.El
3621.Pp
3622Note that a
3623.Fa yyFlexLexer
3624object contains its entire scanning state.
3625Thus such objects can be used to create reentrant scanners.
3626Multiple instances of the same
3627.Fa yyFlexLexer
3628class can be instantiated, and multiple C++ scanner classes can be combined
3629in the same program using the
3630.Fl P
3631option discussed above.
3632.Pp
3633Finally, note that the
3634.Dq %array
3635feature is not available to C++ scanner classes;
3636.Dq %pointer
3637must be used
3638.Pq the default .
3639.Pp
3640Here is an example of a simple C++ scanner:
3641.Bd -literal -offset indent
3642// An example of using the flex C++ scanner class.
3643
3644%{
3645#include <errno.h>
3646int mylineno = 0;
3647%}
3648
3649string  \e"[^\en"]+\e"
3650
3651ws      [ \et]+
3652
3653alpha   [A-Za-z]
3654dig     [0-9]
3655name    ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3656num1    [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3657num2    [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3658number  {num1}|{num2}
3659
3660%%
3661
3662{ws}    /* skip blanks and tabs */
3663
3664"/*" {
3665        int c;
3666
3667        while ((c = yyinput()) != 0) {
3668                if(c == '\en')
3669                    ++mylineno;
3670                else if(c == '*') {
3671                    if ((c = yyinput()) == '/')
3672                        break;
3673                    else
3674                        unput(c);
3675                }
3676        }
3677}
3678
3679{number}  cout << "number " << YYText() << '\en';
3680
3681\en        mylineno++;
3682
3683{name}    cout << "name " << YYText() << '\en';
3684
3685{string}  cout << "string " << YYText() << '\en';
3686
3687%%
3688
3689int main(int /* argc */, char** /* argv */)
3690{
3691	FlexLexer* lexer = new yyFlexLexer;
3692	while(lexer->yylex() != 0)
3693	    ;
3694	return 0;
3695}
3696.Ed
3697.Pp
3698To create multiple
3699.Pq different
3700lexer classes, use the
3701.Fl P
3702flag
3703(or the
3704.Dq prefix=
3705option)
3706to rename each
3707.Fa yyFlexLexer
3708to some other
3709.Fa xxFlexLexer .
3710.Aq Pa g++/FlexLexer.h
3711can then be included in other sources once per lexer class, first renaming
3712.Fa yyFlexLexer
3713as follows:
3714.Bd -literal -offset indent
3715#undef yyFlexLexer
3716#define yyFlexLexer xxFlexLexer
3717#include <g++/FlexLexer.h>
3718
3719#undef yyFlexLexer
3720#define yyFlexLexer zzFlexLexer
3721#include <g++/FlexLexer.h>
3722.Ed
3723.Pp
3724If, for example,
3725.Dq %option prefix="xx"
3726is used for one scanner and
3727.Dq %option prefix="zz"
3728is used for the other.
3729.Pp
3730.Sy IMPORTANT :
3731the present form of the scanning class is experimental
3732and may change considerably between major releases.
3733.Sh INCOMPATIBILITIES WITH LEX AND POSIX
3734.Nm
3735is a rewrite of the
3736.At
3737.Nm lex
3738tool
3739(the two implementations do not share any code, though),
3740with some extensions and incompatibilities, both of which are of concern
3741to those who wish to write scanners acceptable to either implementation.
3742.Nm
3743is fully compliant with the
3744.Tn POSIX
3745.Nm lex
3746specification, except that when using
3747.Dq %pointer
3748.Pq the default ,
3749a call to
3750.Fn unput
3751destroys the contents of
3752.Fa yytext ,
3753which is counter to the
3754.Tn POSIX
3755specification.
3756.Pp
3757In this section we discuss all of the known areas of incompatibility between
3758.Nm ,
3759.At
3760.Nm lex ,
3761and the
3762.Tn POSIX
3763specification.
3764.Pp
3765.Nm flex Ns 's
3766.Fl l
3767option turns on maximum compatibility with the original
3768.At
3769.Nm lex
3770implementation, at the cost of a major loss in the generated scanner's
3771performance.
3772We note below which incompatibilities can be overcome using the
3773.Fl l
3774option.
3775.Pp
3776.Nm
3777is fully compatible with
3778.Nm lex
3779with the following exceptions:
3780.Bl -dash
3781.It
3782The undocumented
3783.Nm lex
3784scanner internal variable
3785.Fa yylineno
3786is not supported unless
3787.Fl l
3788or
3789.Dq %option yylineno
3790is used.
3791.Pp
3792.Fa yylineno
3793should be maintained on a per-buffer basis, rather than a per-scanner
3794.Pq single global variable
3795basis.
3796.Pp
3797.Fa yylineno
3798is not part of the
3799.Tn POSIX
3800specification.
3801.It
3802The
3803.Fn input
3804routine is not redefinable, though it may be called to read characters
3805following whatever has been matched by a rule.
3806If
3807.Fn input
3808encounters an end-of-file, the normal
3809.Fn yywrap
3810processing is done.
3811A
3812.Dq real
3813end-of-file is returned by
3814.Fn input
3815as
3816.Dv EOF .
3817.Pp
3818Input is instead controlled by defining the
3819.Dv YY_INPUT
3820macro.
3821.Pp
3822The
3823.Nm
3824restriction that
3825.Fn input
3826cannot be redefined is in accordance with the
3827.Tn POSIX
3828specification, which simply does not specify any way of controlling the
3829scanner's input other than by making an initial assignment to
3830.Fa yyin .
3831.It
3832The
3833.Fn unput
3834routine is not redefinable.
3835This restriction is in accordance with
3836.Tn POSIX .
3837.It
3838.Nm
3839scanners are not as reentrant as
3840.Nm lex
3841scanners.
3842In particular, if a scanner is interactive and
3843an interrupt handler long-jumps out of the scanner,
3844and the scanner is subsequently called again,
3845the following error message may be displayed:
3846.Pp
3847.D1 fatal flex scanner internal error--end of buffer missed
3848.Pp
3849To reenter the scanner, first use
3850.Pp
3851.Dl yyrestart(yyin);
3852.Pp
3853Note that this call will throw away any buffered input;
3854usually this isn't a problem with an interactive scanner.
3855.Pp
3856Also note that flex C++ scanner classes are reentrant,
3857so if using C++ is an option , they should be used instead.
3858See
3859.Sx GENERATING C++ SCANNERS
3860above for details.
3861.It
3862.Fn output
3863is not supported.
3864Output from the
3865.Em ECHO
3866macro is done to the file-pointer
3867.Fa yyout
3868.Pq default stdout .
3869.Pp
3870.Fn output
3871is not part of the
3872.Tn POSIX
3873specification.
3874.It
3875.Nm lex
3876does not support exclusive start conditions
3877.Pq %x ,
3878though they are in the
3879.Tn POSIX
3880specification.
3881.It
3882When definitions are expanded,
3883.Nm
3884encloses them in parentheses.
3885With
3886.Nm lex ,
3887the following:
3888.Bd -literal -offset indent
3889NAME    [A-Z][A-Z0-9]*
3890%%
3891foo{NAME}?      printf("Found it\en");
3892%%
3893.Ed
3894.Pp
3895will not match the string
3896.Qq foo
3897because when the macro is expanded the rule is equivalent to
3898.Qq foo[A-Z][A-Z0-9]*?
3899and the precedence is such that the
3900.Sq ?\&
3901is associated with
3902.Qq [A-Z0-9]* .
3903With
3904.Nm ,
3905the rule will be expanded to
3906.Qq foo([A-Z][A-Z0-9]*)?
3907and so the string
3908.Qq foo
3909will match.
3910.Pp
3911Note that if the definition begins with
3912.Sq ^
3913or ends with
3914.Sq $
3915then it is not expanded with parentheses, to allow these operators to appear in
3916definitions without losing their special meanings.
3917But the
3918.Sq Aq s ,
3919.Sq / ,
3920and
3921.Aq Aq EOF
3922operators cannot be used in a
3923.Nm
3924definition.
3925.Pp
3926Using
3927.Fl l
3928results in the
3929.Nm lex
3930behavior of no parentheses around the definition.
3931.Pp
3932The
3933.Tn POSIX
3934specification is that the definition be enclosed in parentheses.
3935.It
3936Some implementations of
3937.Nm lex
3938allow a rule's action to begin on a separate line,
3939if the rule's pattern has trailing whitespace:
3940.Bd -literal -offset indent
3941%%
3942foo|bar<space here>
3943  { foobar_action(); }
3944.Ed
3945.Pp
3946.Nm
3947does not support this feature.
3948.It
3949The
3950.Nm lex
3951.Sq %r
3952.Pq generate a Ratfor scanner
3953option is not supported.
3954It is not part of the
3955.Tn POSIX
3956specification.
3957.It
3958After a call to
3959.Fn unput ,
3960.Fa yytext
3961is undefined until the next token is matched,
3962unless the scanner was built using
3963.Dq %array .
3964This is not the case with
3965.Nm lex
3966or the
3967.Tn POSIX
3968specification.
3969The
3970.Fl l
3971option does away with this incompatibility.
3972.It
3973The precedence of the
3974.Sq {}
3975.Pq numeric range
3976operator is different.
3977.Nm lex
3978interprets
3979.Qq abc{1,3}
3980as match one, two, or three occurrences of
3981.Sq abc ,
3982whereas
3983.Nm
3984interprets it as match
3985.Sq ab
3986followed by one, two, or three occurrences of
3987.Sq c .
3988The latter is in agreement with the
3989.Tn POSIX
3990specification.
3991.It
3992The precedence of the
3993.Sq ^
3994operator is different.
3995.Nm lex
3996interprets
3997.Qq ^foo|bar
3998as match either
3999.Sq foo
4000at the beginning of a line, or
4001.Sq bar
4002anywhere, whereas
4003.Nm
4004interprets it as match either
4005.Sq foo
4006or
4007.Sq bar
4008if they come at the beginning of a line.
4009The latter is in agreement with the
4010.Tn POSIX
4011specification.
4012.It
4013The special table-size declarations such as
4014.Sq %a
4015supported by
4016.Nm lex
4017are not required by
4018.Nm
4019scanners;
4020.Nm
4021ignores them.
4022.It
4023The name
4024.Dv FLEX_SCANNER
4025is #define'd so scanners may be written for use with either
4026.Nm
4027or
4028.Nm lex .
4029Scanners also include
4030.Dv YY_FLEX_MAJOR_VERSION
4031and
4032.Dv YY_FLEX_MINOR_VERSION
4033indicating which version of
4034.Nm
4035generated the scanner
4036(for example, for the 2.5 release, these defines would be 2 and 5,
4037respectively).
4038.El
4039.Pp
4040The following
4041.Nm
4042features are not included in
4043.Nm lex
4044or the
4045.Tn POSIX
4046specification:
4047.Bd -unfilled -offset indent
4048C++ scanners
4049%option
4050start condition scopes
4051start condition stacks
4052interactive/non-interactive scanners
4053yy_scan_string() and friends
4054yyterminate()
4055yy_set_interactive()
4056yy_set_bol()
4057YY_AT_BOL()
4058<<EOF>>
4059<*>
4060YY_DECL
4061YY_START
4062YY_USER_ACTION
4063YY_USER_INIT
4064#line directives
4065%{}'s around actions
4066multiple actions on a line
4067.Ed
4068.Pp
4069plus almost all of the
4070.Nm
4071flags.
4072The last feature in the list refers to the fact that with
4073.Nm
4074multiple actions can be placed on the same line,
4075separated with semi-colons, while with
4076.Nm lex ,
4077the following
4078.Pp
4079.Dl foo    handle_foo(); ++num_foos_seen;
4080.Pp
4081is
4082.Pq rather surprisingly
4083truncated to
4084.Pp
4085.Dl foo    handle_foo();
4086.Pp
4087.Nm
4088does not truncate the action.
4089Actions that are not enclosed in braces
4090are simply terminated at the end of the line.
4091.Sh FILES
4092.Bl -tag -width "<g++/FlexLexer.h>"
4093.It flex.skl
4094Skeleton scanner.
4095This file is only used when building flex, not when
4096.Nm
4097executes.
4098.It lex.backup
4099Backing-up information for the
4100.Fl b
4101flag (called
4102.Pa lex.bck
4103on some systems).
4104.It lex.yy.c
4105Generated scanner
4106(called
4107.Pa lexyy.c
4108on some systems).
4109.It lex.yy.cc
4110Generated C++ scanner class, when using
4111.Fl + .
4112.It Aq g++/FlexLexer.h
4113Header file defining the C++ scanner base class,
4114.Fa FlexLexer ,
4115and its derived class,
4116.Fa yyFlexLexer .
4117.It /usr/lib/libl.*
4118.Nm
4119libraries.
4120The
4121.Pa /usr/lib/libfl.*\&
4122libraries are links to these.
4123Scanners must be linked using either
4124.Fl \&ll
4125or
4126.Fl lfl .
4127.El
4128.Sh EXIT STATUS
4129.Ex -std flex
4130.Sh DIAGNOSTICS
4131.Bl -diag
4132.It warning, rule cannot be matched
4133Indicates that the given rule cannot be matched because it follows other rules
4134that will always match the same text as it.
4135For example, in the following
4136.Dq foo
4137cannot be matched because it comes after an identifier
4138.Qq catch-all
4139rule:
4140.Bd -literal -offset indent
4141[a-z]+    got_identifier();
4142foo       got_foo();
4143.Ed
4144.Pp
4145Using
4146.Em REJECT
4147in a scanner suppresses this warning.
4148.It "warning, \-s option given but default rule can be matched"
4149Means that it is possible
4150.Pq perhaps only in a particular start condition
4151that the default rule
4152.Pq match any single character
4153is the only one that will match a particular input.
4154Since
4155.Fl s
4156was given, presumably this is not intended.
4157.It reject_used_but_not_detected undefined
4158.It yymore_used_but_not_detected undefined
4159These errors can occur at compile time.
4160They indicate that the scanner uses
4161.Em REJECT
4162or
4163.Fn yymore
4164but that
4165.Nm
4166failed to notice the fact, meaning that
4167.Nm
4168scanned the first two sections looking for occurrences of these actions
4169and failed to find any, but somehow they snuck in
4170.Pq via an #include file, for example .
4171Use
4172.Dq %option reject
4173or
4174.Dq %option yymore
4175to indicate to
4176.Nm
4177that these features are really needed.
4178.It flex scanner jammed
4179A scanner compiled with
4180.Fl s
4181has encountered an input string which wasn't matched by any of its rules.
4182This error can also occur due to internal problems.
4183.It token too large, exceeds YYLMAX
4184The scanner uses
4185.Dq %array
4186and one of its rules matched a string longer than the
4187.Dv YYLMAX
4188constant
4189.Pq 8K bytes by default .
4190The value can be increased by #define'ing
4191.Dv YYLMAX
4192in the definitions section of
4193.Nm
4194input.
4195.It "scanner requires \-8 flag to use the character 'x'"
4196The scanner specification includes recognizing the 8-bit character
4197.Sq x
4198and the
4199.Fl 8
4200flag was not specified, and defaulted to 7-bit because the
4201.Fl Cf
4202or
4203.Fl CF
4204table compression options were used.
4205See the discussion of the
4206.Fl 7
4207flag for details.
4208.It flex scanner push-back overflow
4209unput() was used to push back so much text that the scanner's buffer
4210could not hold both the pushed-back text and the current token in
4211.Fa yytext .
4212Ideally the scanner should dynamically resize the buffer in this case,
4213but at present it does not.
4214.It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4215The scanner was working on matching an extremely large token and needed
4216to expand the input buffer.
4217This doesn't work with scanners that use
4218.Em REJECT .
4219.It "fatal flex scanner internal error--end of buffer missed"
4220This can occur in an scanner which is reentered after a long-jump
4221has jumped out
4222.Pq or over
4223the scanner's activation frame.
4224Before reentering the scanner, use:
4225.Pp
4226.Dl yyrestart(yyin);
4227.Pp
4228or, as noted above, switch to using the C++ scanner class.
4229.It "too many start conditions in <> construct!"
4230More start conditions than exist were listed in a <> construct
4231(so at least one of them must have been listed twice).
4232.El
4233.Sh SEE ALSO
4234.Xr awk 1 ,
4235.Xr sed 1 ,
4236.Xr yacc 1
4237.Rs
4238.%A John Levine
4239.%A Tony Mason
4240.%A Doug Brown
4241.%B Lex & Yacc
4242.%I O'Reilly and Associates
4243.%N 2nd edition
4244.Re
4245.Rs
4246.%A Alfred Aho
4247.%A Ravi Sethi
4248.%A Jeffrey Ullman
4249.%B Compilers: Principles, Techniques and Tools
4250.%I Addison-Wesley
4251.%D 1986
4252.%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4253.Re
4254.Sh STANDARDS
4255The
4256.Nm lex
4257utility is compliant with the
4258.St -p1003.1-2008
4259specification,
4260though its presence is optional.
4261.Pp
4262The flags
4263.Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
4264.Op Fl -help ,
4265and
4266.Op Fl -version
4267are extensions to that specification.
4268.Pp
4269See also the
4270.Sx INCOMPATIBILITIES WITH LEX AND POSIX
4271section, above.
4272.Sh AUTHORS
4273Vern Paxson, with the help of many ideas and much inspiration from
4274Van Jacobson.
4275Original version by Jef Poskanzer.
4276The fast table representation is a partial implementation of a design done by
4277Van Jacobson.
4278The implementation was done by Kevin Gong and Vern Paxson.
4279.Pp
4280Thanks to the many
4281.Nm
4282beta-testers, feedbackers, and contributors, especially Francois Pinard,
4283Casey Leedom,
4284Robert Abramovitz,
4285Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4286Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4287Karl Berry, Peter A. Bigot, Simon Blanchard,
4288Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4289Brian Clapper, J.T. Conklin,
4290Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4291Daniels, Chris G. Demetriou, Theo de Raadt,
4292Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4293Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4294Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4295Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4296Jan Hajic, Charles Hemphill, NORO Hideo,
4297Jarkko Hietaniemi, Scott Hofmann,
4298Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4299Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4300Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4301Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4302Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4303Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4304David Loffredo, Mike Long,
4305Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4306Bengt Martensson, Chris Metcalf,
4307Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4308G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4309Richard Ohnemus, Karsten Pahnke,
4310Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4311Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4312Frederic Raimbault, Pat Rankin, Rick Richardson,
4313Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4314Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4315Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4316Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4317Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4318Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4319Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4320Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4321and those whose names have slipped my marginal mail-archiving skills
4322but whose contributions are appreciated all the
4323same.
4324.Pp
4325Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4326John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4327Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4328distribution headaches.
4329.Pp
4330Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4331to Benson Margulies and Fred Burke for C++ support;
4332to Kent Williams and Tom Epperly for C++ class support;
4333to Ove Ewerlid for support of NUL's;
4334and to Eric Hughes for support of multiple buffers.
4335.Pp
4336This work was primarily done when I was with the Real Time Systems Group
4337at the Lawrence Berkeley Laboratory in Berkeley, CA.
4338Many thanks to all there for the support I received.
4339.Pp
4340Send comments to
4341.Aq Mt vern@ee.lbl.gov .
4342.Sh BUGS
4343Some trailing context patterns cannot be properly matched and generate
4344warning messages
4345.Pq "dangerous trailing context" .
4346These are patterns where the ending of the first part of the rule
4347matches the beginning of the second part, such as
4348.Qq zx*/xy* ,
4349where the
4350.Sq x*
4351matches the
4352.Sq x
4353at the beginning of the trailing context.
4354(Note that the POSIX draft states that the text matched by such patterns
4355is undefined.)
4356.Pp
4357For some trailing context rules, parts which are actually fixed-length are
4358not recognized as such, leading to the above mentioned performance loss.
4359In particular, parts using
4360.Sq |\&
4361or
4362.Sq {n}
4363(such as
4364.Qq foo{3} )
4365are always considered variable-length.
4366.Pp
4367Combining trailing context with the special
4368.Sq |\&
4369action can result in fixed trailing context being turned into
4370the more expensive variable trailing context.
4371For example, in the following:
4372.Bd -literal -offset indent
4373%%
4374abc      |
4375xyz/def
4376.Ed
4377.Pp
4378Use of
4379.Fn unput
4380invalidates yytext and yyleng, unless the
4381.Dq %array
4382directive
4383or the
4384.Fl l
4385option has been used.
4386.Pp
4387Pattern-matching of NUL's is substantially slower than matching other
4388characters.
4389.Pp
4390Dynamic resizing of the input buffer is slow, as it entails rescanning
4391all the text matched so far by the current
4392.Pq generally huge
4393token.
4394.Pp
4395Due to both buffering of input and read-ahead,
4396it is not possible to intermix calls to
4397.Aq Pa stdio.h
4398routines, such as, for example,
4399.Fn getchar ,
4400with
4401.Nm
4402rules and expect it to work.
4403Call
4404.Fn input
4405instead.
4406.Pp
4407The total table entries listed by the
4408.Fl v
4409flag excludes the number of table entries needed to determine
4410what rule has been matched.
4411The number of entries is equal to the number of DFA states
4412if the scanner does not use
4413.Em REJECT ,
4414and somewhat greater than the number of states if it does.
4415.Pp
4416.Em REJECT
4417cannot be used with the
4418.Fl f
4419or
4420.Fl F
4421options.
4422.Pp
4423The
4424.Nm
4425internal algorithms need documentation.
4426