xref: /openbsd/usr.bin/lex/flex.1 (revision 898184e3)
1.\"	$OpenBSD: flex.1,v 1.33 2013/01/18 21:48:43 jmc Exp $
2.\"
3.\" Copyright (c) 1990 The Regents of the University of California.
4.\" All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Vern Paxson.
8.\"
9.\" The United States Government has rights in this work pursuant
10.\" to contract no. DE-AC03-76SF00098 between the United States
11.\" Department of Energy and the University of California.
12.\"
13.\" Redistribution and use in source and binary forms, with or without
14.\" modification, are permitted provided that the following conditions
15.\" are met:
16.\"
17.\" 1. Redistributions of source code must retain the above copyright
18.\"    notice, this list of conditions and the following disclaimer.
19.\" 2. Redistributions in binary form must reproduce the above copyright
20.\"    notice, this list of conditions and the following disclaimer in the
21.\"    documentation and/or other materials provided with the distribution.
22.\"
23.\" Neither the name of the University nor the names of its contributors
24.\" may be used to endorse or promote products derived from this software
25.\" without specific prior written permission.
26.\"
27.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28.\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30.\" PURPOSE.
31.\"
32.Dd $Mdocdate: January 18 2013 $
33.Dt FLEX 1
34.Os
35.Sh NAME
36.Nm flex
37.Nd fast lexical analyzer generator
38.Sh SYNOPSIS
39.Nm
40.Bk -words
41.Op Fl 78BbdFfhIiLlnpsTtVvw+?
42.Op Fl C Ns Op Cm aeFfmr
43.Op Fl Fl help
44.Op Fl Fl version
45.Op Fl o Ns Ar output
46.Op Fl P Ns Ar prefix
47.Op Fl S Ns Ar skeleton
48.Op Ar
49.Ek
50.Sh DESCRIPTION
51.Nm
52is a tool for generating
53.Em scanners :
54programs which recognize lexical patterns in text.
55.Nm
56reads the given input files, or its standard input if no file names are given,
57for a description of a scanner to generate.
58The description is in the form of pairs of regular expressions and C code,
59called
60.Em rules .
61.Nm
62generates as output a C source file,
63.Pa lex.yy.c ,
64which defines a routine
65.Fn yylex .
66This file is compiled and linked with the
67.Fl lfl
68library to produce an executable.
69When the executable is run, it analyzes its input for occurrences
70of the regular expressions.
71Whenever it finds one, it executes the corresponding C code.
72.Pp
73The manual includes both tutorial and reference sections:
74.Bl -ohang
75.It Sy Some Simple Examples
76.It Sy Format of the Input File
77.It Sy Patterns
78The extended regular expressions used by
79.Nm .
80.It Sy How the Input is Matched
81The rules for determining what has been matched.
82.It Sy Actions
83How to specify what to do when a pattern is matched.
84.It Sy The Generated Scanner
85Details regarding the scanner that
86.Nm
87produces;
88how to control the input source.
89.It Sy Start Conditions
90Introducing context into scanners, and managing
91.Qq mini-scanners .
92.It Sy Multiple Input Buffers
93How to manipulate multiple input sources;
94how to scan from strings instead of files.
95.It Sy End-of-File Rules
96Special rules for matching the end of the input.
97.It Sy Miscellaneous Macros
98A summary of macros available to the actions.
99.It Sy Values Available to the User
100A summary of values available to the actions.
101.It Sy Interfacing with Yacc
102Connecting flex scanners together with
103.Xr yacc 1
104parsers.
105.It Sy Options
106.Nm
107command-line options, and the
108.Dq %option
109directive.
110.It Sy Performance Considerations
111How to make scanners go as fast as possible.
112.It Sy Generating C++ Scanners
113The
114.Pq experimental
115facility for generating C++ scanner classes.
116.It Sy Incompatibilities with Lex and POSIX
117How
118.Nm
119differs from AT&T lex and the
120.Tn POSIX
121lex standard.
122.It Sy Files
123Files used by
124.Nm .
125.It Sy Diagnostics
126Those error messages produced by
127.Nm
128.Pq or scanners it generates
129whose meanings might not be apparent.
130.It Sy See Also
131Other documentation, related tools.
132.It Sy Authors
133Includes contact information.
134.It Sy Bugs
135Known problems with
136.Nm .
137.El
138.Sh SOME SIMPLE EXAMPLES
139First some simple examples to get the flavor of how one uses
140.Nm .
141The following
142.Nm
143input specifies a scanner which whenever it encounters the string
144.Qq username
145will replace it with the user's login name:
146.Bd -literal -offset indent
147%%
148username    printf("%s", getlogin());
149.Ed
150.Pp
151By default, any text not matched by a
152.Nm
153scanner is copied to the output, so the net effect of this scanner is
154to copy its input file to its output with each occurrence of
155.Qq username
156expanded.
157In this input, there is just one rule.
158.Qq username
159is the
160.Em pattern
161and the
162.Qq printf
163is the
164.Em action .
165The
166.Qq %%
167marks the beginning of the rules.
168.Pp
169Here's another simple example:
170.Bd -literal -offset indent
171%{
172int num_lines = 0, num_chars = 0;
173%}
174
175%%
176\en      ++num_lines; ++num_chars;
177\&.       ++num_chars;
178
179%%
180main()
181{
182	yylex();
183	printf("# of lines = %d, # of chars = %d\en",
184            num_lines, num_chars);
185}
186.Ed
187.Pp
188This scanner counts the number of characters and the number
189of lines in its input
190(it produces no output other than the final report on the counts).
191The first line declares two globals,
192.Qq num_lines
193and
194.Qq num_chars ,
195which are accessible both inside
196.Fn yylex
197and in the
198.Fn main
199routine declared after the second
200.Qq %% .
201There are two rules, one which matches a newline
202.Pq \&"\en\&"
203and increments both the line count and the character count,
204and one which matches any character other than a newline
205(indicated by the
206.Qq \&.
207regular expression).
208.Pp
209A somewhat more complicated example:
210.Bd -literal -offset indent
211/* scanner for a toy Pascal-like language */
212
213%{
214/* need this for the call to atof() below */
215#include <math.h>
216%}
217
218DIGIT    [0-9]
219ID       [a-z][a-z0-9]*
220
221%%
222
223{DIGIT}+ {
224        printf("An integer: %s (%d)\en", yytext,
225            atoi(yytext));
226}
227
228{DIGIT}+"."{DIGIT}* {
229        printf("A float: %s (%g)\en", yytext,
230            atof(yytext));
231}
232
233if|then|begin|end|procedure|function {
234        printf("A keyword: %s\en", yytext);
235}
236
237{ID}    printf("An identifier: %s\en", yytext);
238
239"+"|"-"|"*"|"/"   printf("An operator: %s\en", yytext);
240
241"{"[^}\en]*"}"     /* eat up one-line comments */
242
243[ \et\en]+          /* eat up whitespace */
244
245\&.       printf("Unrecognized character: %s\en", yytext);
246
247%%
248
249main(int argc, char *argv[])
250{
251        ++argv; --argc;  /* skip over program name */
252        if (argc > 0)
253                yyin = fopen(argv[0], "r");
254        else
255                yyin = stdin;
256
257        yylex();
258}
259.Ed
260.Pp
261This is the beginnings of a simple scanner for a language like Pascal.
262It identifies different types of
263.Em tokens
264and reports on what it has seen.
265.Pp
266The details of this example will be explained in the following sections.
267.Sh FORMAT OF THE INPUT FILE
268The
269.Nm
270input file consists of three sections, separated by a line with just
271.Qq %%
272in it:
273.Bd -unfilled -offset indent
274definitions
275%%
276rules
277%%
278user code
279.Ed
280.Pp
281The
282.Em definitions
283section contains declarations of simple
284.Em name
285definitions to simplify the scanner specification, and declarations of
286.Em start conditions ,
287which are explained in a later section.
288.Pp
289Name definitions have the form:
290.Pp
291.D1 name definition
292.Pp
293The
294.Qq name
295is a word beginning with a letter or an underscore
296.Pq Sq _
297followed by zero or more letters, digits,
298.Sq _ ,
299or
300.Sq -
301.Pq dash .
302The definition is taken to begin at the first non-whitespace character
303following the name and continuing to the end of the line.
304The definition can subsequently be referred to using
305.Qq {name} ,
306which will expand to
307.Qq (definition) .
308For example:
309.Bd -literal -offset indent
310DIGIT    [0-9]
311ID       [a-z][a-z0-9]*
312.Ed
313.Pp
314This defines
315.Qq DIGIT
316to be a regular expression which matches a single digit, and
317.Qq ID
318to be a regular expression which matches a letter
319followed by zero-or-more letters-or-digits.
320A subsequent reference to
321.Pp
322.Dl {DIGIT}+"."{DIGIT}*
323.Pp
324is identical to
325.Pp
326.Dl ([0-9])+"."([0-9])*
327.Pp
328and matches one-or-more digits followed by a
329.Sq .\&
330followed by zero-or-more digits.
331.Pp
332The
333.Em rules
334section of the
335.Nm
336input contains a series of rules of the form:
337.Pp
338.D1 pattern	action
339.Pp
340The pattern must be unindented and the action must begin
341on the same line.
342.Pp
343See below for a further description of patterns and actions.
344.Pp
345Finally, the user code section is simply copied to
346.Pa lex.yy.c
347verbatim.
348It is used for companion routines which call or are called by the scanner.
349The presence of this section is optional;
350if it is missing, the second
351.Qq %%
352in the input file may be skipped too.
353.Pp
354In the definitions and rules sections, any indented text or text enclosed in
355.Sq %{
356and
357.Sq %}
358is copied verbatim to the output
359.Pq with the %{}'s removed .
360The %{}'s must appear unindented on lines by themselves.
361.Pp
362In the rules section,
363any indented or %{} text appearing before the first rule may be used to
364declare variables which are local to the scanning routine and
365.Pq after the declarations
366code which is to be executed whenever the scanning routine is entered.
367Other indented or %{} text in the rule section is still copied to the output,
368but its meaning is not well-defined and it may well cause compile-time
369errors (this feature is present for
370.Tn POSIX
371compliance; see below for other such features).
372.Pp
373In the definitions section
374.Pq but not in the rules section ,
375an unindented comment
376(i.e., a line beginning with
377.Qq /* )
378is also copied verbatim to the output up to the next
379.Qq */ .
380.Sh PATTERNS
381The patterns in the input are written using an extended set of regular
382expressions.
383These are:
384.Bl -tag -width "XXXXXXXX"
385.It x
386Match the character
387.Sq x .
388.It .\&
389Any character
390.Pq byte
391except newline.
392.It [xyz]
393A
394.Qq character class ;
395in this case, the pattern matches either an
396.Sq x ,
397a
398.Sq y ,
399or a
400.Sq z .
401.It [abj-oZ]
402A
403.Qq character class
404with a range in it; matches an
405.Sq a ,
406a
407.Sq b ,
408any letter from
409.Sq j
410through
411.Sq o ,
412or a
413.Sq Z .
414.It [^A-Z]
415A
416.Qq negated character class ,
417i.e., any character but those in the class.
418In this case, any character EXCEPT an uppercase letter.
419.It [^A-Z\en]
420Any character EXCEPT an uppercase letter or a newline.
421.It r*
422Zero or more r's, where
423.Sq r
424is any regular expression.
425.It r+
426One or more r's.
427.It r?
428Zero or one r's (that is,
429.Qq an optional r ) .
430.It r{2,5}
431Anywhere from two to five r's.
432.It r{2,}
433Two or more r's.
434.It r{4}
435Exactly 4 r's.
436.It {name}
437The expansion of the
438.Qq name
439definition
440.Pq see above .
441.It \&"[xyz]\e\&"foo\&"
442The literal string: [xyz]"foo.
443.It \eX
444If
445.Sq X
446is an
447.Sq a ,
448.Sq b ,
449.Sq f ,
450.Sq n ,
451.Sq r ,
452.Sq t ,
453or
454.Sq v ,
455then the ANSI-C interpretation of
456.Sq \eX .
457Otherwise, a literal
458.Sq X
459(used to escape operators such as
460.Sq * ) .
461.It \e0
462A NUL character
463.Pq ASCII code 0 .
464.It \e123
465The character with octal value 123.
466.It \ex2a
467The character with hexadecimal value 2a.
468.It (r)
469Match an
470.Sq r ;
471parentheses are used to override precedence
472.Pq see below .
473.It rs
474The regular expression
475.Sq r
476followed by the regular expression
477.Sq s ;
478called
479.Qq concatenation .
480.It r|s
481Either an
482.Sq r
483or an
484.Sq s .
485.It r/s
486An
487.Sq r ,
488but only if it is followed by an
489.Sq s .
490The text matched by
491.Sq s
492is included when determining whether this rule is the
493.Qq longest match ,
494but is then returned to the input before the action is executed.
495So the action only sees the text matched by
496.Sq r .
497This type of pattern is called
498.Qq trailing context .
499(There are some combinations of r/s that
500.Nm
501cannot match correctly; see notes in the
502.Sx BUGS
503section below regarding
504.Qq dangerous trailing context . )
505.It ^r
506An
507.Sq r ,
508but only at the beginning of a line
509(i.e., just starting to scan, or right after a newline has been scanned).
510.It r$
511An
512.Sq r ,
513but only at the end of a line
514.Pq i.e., just before a newline .
515Equivalent to
516.Qq r/\en .
517.Pp
518Note that
519.Nm flex Ns 's
520notion of
521.Qq newline
522is exactly whatever the C compiler used to compile
523.Nm
524interprets
525.Sq \en
526as.
527.\" In particular, on some DOS systems you must either filter out \er's in the
528.\" input yourself, or explicitly use r/\er\en for
529.\" .Qq r$ .
530.It <s>r
531An
532.Sq r ,
533but only in start condition
534.Sq s
535.Pq see below for discussion of start conditions .
536.It <s1,s2,s3>r
537The same, but in any of start conditions s1, s2, or s3.
538.It <*>r
539An
540.Sq r
541in any start condition, even an exclusive one.
542.It <<EOF>>
543An end-of-file.
544.It <s1,s2><<EOF>>
545An end-of-file when in start condition s1 or s2.
546.El
547.Pp
548Note that inside of a character class, all regular expression operators
549lose their special meaning except escape
550.Pq Sq \e
551and the character class operators,
552.Sq - ,
553.Sq ]\& ,
554and, at the beginning of the class,
555.Sq ^ .
556.Pp
557The regular expressions listed above are grouped according to
558precedence, from highest precedence at the top to lowest at the bottom.
559Those grouped together have equal precedence.
560For example,
561.Pp
562.D1 foo|bar*
563.Pp
564is the same as
565.Pp
566.D1 (foo)|(ba(r*))
567.Pp
568since the
569.Sq *
570operator has higher precedence than concatenation,
571and concatenation higher than alternation
572.Pq Sq |\& .
573This pattern therefore matches
574.Em either
575the string
576.Qq foo
577.Em or
578the string
579.Qq ba
580followed by zero-or-more r's.
581To match
582.Qq foo
583or zero-or-more "bar"'s,
584use:
585.Pp
586.D1 foo|(bar)*
587.Pp
588and to match zero-or-more "foo"'s-or-"bar"'s:
589.Pp
590.D1 (foo|bar)*
591.Pp
592In addition to characters and ranges of characters, character classes
593can also contain character class
594.Em expressions .
595These are expressions enclosed inside
596.Sq [:
597and
598.Sq :]
599delimiters (which themselves must appear between the
600.Sq \&[
601and
602.Sq ]\&
603of the
604character class; other elements may occur inside the character class, too).
605The valid expressions are:
606.Bd -unfilled -offset indent
607[:alnum:] [:alpha:] [:blank:]
608[:cntrl:] [:digit:] [:graph:]
609[:lower:] [:print:] [:punct:]
610[:space:] [:upper:] [:xdigit:]
611.Ed
612.Pp
613These expressions all designate a set of characters equivalent to
614the corresponding standard C
615.Fn isXXX
616function.
617For example, [:alnum:] designates those characters for which
618.Xr isalnum 3
619returns true \- i.e., any alphabetic or numeric.
620Some systems don't provide
621.Xr isblank 3 ,
622so
623.Nm
624defines [:blank:] as a blank or a tab.
625.Pp
626For example, the following character classes are all equivalent:
627.Bd -unfilled -offset indent
628[[:alnum:]]
629[[:alpha:][:digit:]]
630[[:alpha:]0-9]
631[a-zA-Z0-9]
632.Ed
633.Pp
634If the scanner is case-insensitive (the
635.Fl i
636flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
637.Pp
638Some notes on patterns:
639.Bl -dash
640.It
641A negated character class such as the example
642.Qq [^A-Z]
643above will match a newline unless "\en"
644.Pq or an equivalent escape sequence
645is one of the characters explicitly present in the negated character class
646(e.g.,
647.Qq [^A-Z\en] ) .
648This is unlike how many other regular expression tools treat negated character
649classes, but unfortunately the inconsistency is historically entrenched.
650Matching newlines means that a pattern like
651.Qq [^"]*
652can match the entire input unless there's another quote in the input.
653.It
654A rule can have at most one instance of trailing context
655(the
656.Sq /
657operator or the
658.Sq $
659operator).
660The start condition,
661.Sq ^ ,
662and
663.Qq <<EOF>>
664patterns can only occur at the beginning of a pattern, and, as well as with
665.Sq /
666and
667.Sq $ ,
668cannot be grouped inside parentheses.
669A
670.Sq ^
671which does not occur at the beginning of a rule or a
672.Sq $
673which does not occur at the end of a rule loses its special properties
674and is treated as a normal character.
675.It
676The following are illegal:
677.Bd -unfilled -offset indent
678foo/bar$
679<sc1>foo<sc2>bar
680.Ed
681.Pp
682Note that the first of these, can be written
683.Qq foo/bar\en .
684.It
685The following will result in
686.Sq $
687or
688.Sq ^
689being treated as a normal character:
690.Bd -unfilled -offset indent
691foo|(bar$)
692foo|^bar
693.Ed
694.Pp
695If what's wanted is a
696.Qq foo
697or a bar-followed-by-a-newline, the following could be used
698(the special
699.Sq |\&
700action is explained below):
701.Bd -unfilled -offset indent
702foo      |
703bar$     /* action goes here */
704.Ed
705.Pp
706A similar trick will work for matching a foo or a
707bar-at-the-beginning-of-a-line.
708.El
709.Sh HOW THE INPUT IS MATCHED
710When the generated scanner is run,
711it analyzes its input looking for strings which match any of its patterns.
712If it finds more than one match,
713it takes the one matching the most text
714(for trailing context rules, this includes the length of the trailing part,
715even though it will then be returned to the input).
716If it finds two or more matches of the same length,
717the rule listed first in the
718.Nm
719input file is chosen.
720.Pp
721Once the match is determined, the text corresponding to the match
722(called the
723.Em token )
724is made available in the global character pointer
725.Fa yytext ,
726and its length in the global integer
727.Fa yyleng .
728The
729.Em action
730corresponding to the matched pattern is then executed
731.Pq a more detailed description of actions follows ,
732and then the remaining input is scanned for another match.
733.Pp
734If no match is found, then the default rule is executed:
735the next character in the input is considered matched and
736copied to the standard output.
737Thus, the simplest legal
738.Nm
739input is:
740.Pp
741.D1 %%
742.Pp
743which generates a scanner that simply copies its input
744.Pq one character at a time
745to its output.
746.Pp
747Note that
748.Fa yytext
749can be defined in two different ways:
750either as a character pointer or as a character array.
751Which definition
752.Nm
753uses can be controlled by including one of the special directives
754.Dq %pointer
755or
756.Dq %array
757in the first
758.Pq definitions
759section of flex input.
760The default is
761.Dq %pointer ,
762unless the
763.Fl l
764lex compatibility option is used, in which case
765.Fa yytext
766will be an array.
767The advantage of using
768.Dq %pointer
769is substantially faster scanning and no buffer overflow when matching
770very large tokens
771.Pq unless not enough dynamic memory is available .
772The disadvantage is that actions are restricted in how they can modify
773.Fa yytext
774.Pq see the next section ,
775and calls to the
776.Fn unput
777function destroy the present contents of
778.Fa yytext ,
779which can be a considerable porting headache when moving between different
780.Nm lex
781versions.
782.Pp
783The advantage of
784.Dq %array
785is that
786.Fa yytext
787can be modified as much as wanted, and calls to
788.Fn unput
789do not destroy
790.Fa yytext
791.Pq see below .
792Furthermore, existing
793.Nm lex
794programs sometimes access
795.Fa yytext
796externally using declarations of the form:
797.Pp
798.D1 extern char yytext[];
799.Pp
800This definition is erroneous when used with
801.Dq %pointer ,
802but correct for
803.Dq %array .
804.Pp
805.Dq %array
806defines
807.Fa yytext
808to be an array of
809.Dv YYLMAX
810characters, which defaults to a fairly large value.
811The size can be changed by simply #define'ing
812.Dv YYLMAX
813to a different value in the first section of
814.Nm
815input.
816As mentioned above, with
817.Dq %pointer
818yytext grows dynamically to accommodate large tokens.
819While this means a
820.Dq %pointer
821scanner can accommodate very large tokens
822.Pq such as matching entire blocks of comments ,
823bear in mind that each time the scanner must resize
824.Fa yytext
825it also must rescan the entire token from the beginning, so matching such
826tokens can prove slow.
827.Fa yytext
828presently does not dynamically grow if a call to
829.Fn unput
830results in too much text being pushed back; instead, a run-time error results.
831.Pp
832Also note that
833.Dq %array
834cannot be used with C++ scanner classes
835.Pq the c++ option; see below .
836.Sh ACTIONS
837Each pattern in a rule has a corresponding action,
838which can be any arbitrary C statement.
839The pattern ends at the first non-escaped whitespace character;
840the remainder of the line is its action.
841If the action is empty,
842then when the pattern is matched the input token is simply discarded.
843For example, here is the specification for a program
844which deletes all occurrences of
845.Qq zap me
846from its input:
847.Bd -literal -offset indent
848%%
849"zap me"
850.Ed
851.Pp
852(It will copy all other characters in the input to the output since
853they will be matched by the default rule.)
854.Pp
855Here is a program which compresses multiple blanks and tabs down to
856a single blank, and throws away whitespace found at the end of a line:
857.Bd -literal -offset indent
858%%
859[ \et]+        putchar(' ');
860[ \et]+$       /* ignore this token */
861.Ed
862.Pp
863If the action contains a
864.Sq { ,
865then the action spans till the balancing
866.Sq }
867is found, and the action may cross multiple lines.
868.Nm
869knows about C strings and comments and won't be fooled by braces found
870within them, but also allows actions to begin with
871.Sq %{
872and will consider the action to be all the text up to the next
873.Sq %}
874.Pq regardless of ordinary braces inside the action .
875.Pp
876An action consisting solely of a vertical bar
877.Pq Sq |\&
878means
879.Qq same as the action for the next rule .
880See below for an illustration.
881.Pp
882Actions can include arbitrary C code,
883including return statements to return a value to whatever routine called
884.Fn yylex .
885Each time
886.Fn yylex
887is called, it continues processing tokens from where it last left off
888until it either reaches the end of the file or executes a return.
889.Pp
890Actions are free to modify
891.Fa yytext
892except for lengthening it
893(adding characters to its end \- these will overwrite later characters in the
894input stream).
895This, however, does not apply when using
896.Dq %array
897.Pq see above ;
898in that case,
899.Fa yytext
900may be freely modified in any way.
901.Pp
902Actions are free to modify
903.Fa yyleng
904except they should not do so if the action also includes use of
905.Fn yymore
906.Pq see below .
907.Pp
908There are a number of special directives which can be included within
909an action:
910.Bl -tag -width Ds
911.It ECHO
912Copies
913.Fa yytext
914to the scanner's output.
915.It BEGIN
916Followed by the name of a start condition, places the scanner in the
917corresponding start condition
918.Pq see below .
919.It REJECT
920Directs the scanner to proceed on to the
921.Qq second best
922rule which matched the input
923.Pq or a prefix of the input .
924The rule is chosen as described above in
925.Sx HOW THE INPUT IS MATCHED ,
926and
927.Fa yytext
928and
929.Fa yyleng
930set up appropriately.
931It may either be one which matched as much text
932as the originally chosen rule but came later in the
933.Nm
934input file, or one which matched less text.
935For example, the following will both count the
936words in the input and call the routine
937.Fn special
938whenever
939.Qq frob
940is seen:
941.Bd -literal -offset indent
942int word_count = 0;
943%%
944
945frob        special(); REJECT;
946[^ \et\en]+   ++word_count;
947.Ed
948.Pp
949Without the
950.Em REJECT ,
951any "frob"'s in the input would not be counted as words,
952since the scanner normally executes only one action per token.
953Multiple
954.Em REJECT Ns 's
955are allowed,
956each one finding the next best choice to the currently active rule.
957For example, when the following scanner scans the token
958.Qq abcd ,
959it will write
960.Qq abcdabcaba
961to the output:
962.Bd -literal -offset indent
963%%
964a        |
965ab       |
966abc      |
967abcd     ECHO; REJECT;
968\&.|\en     /* eat up any unmatched character */
969.Ed
970.Pp
971(The first three rules share the fourth's action since they use
972the special
973.Sq |\&
974action.)
975.Em REJECT
976is a particularly expensive feature in terms of scanner performance;
977if it is used in any of the scanner's actions it will slow down
978all of the scanner's matching.
979Furthermore,
980.Em REJECT
981cannot be used with the
982.Fl Cf
983or
984.Fl CF
985options
986.Pq see below .
987.Pp
988Note also that unlike the other special actions,
989.Em REJECT
990is a
991.Em branch ;
992code immediately following it in the action will not be executed.
993.It yymore()
994Tells the scanner that the next time it matches a rule, the corresponding
995token should be appended onto the current value of
996.Fa yytext
997rather than replacing it.
998For example, given the input
999.Qq mega-kludge
1000the following will write
1001.Qq mega-mega-kludge
1002to the output:
1003.Bd -literal -offset indent
1004%%
1005mega-    ECHO; yymore();
1006kludge   ECHO;
1007.Ed
1008.Pp
1009First
1010.Qq mega-
1011is matched and echoed to the output.
1012Then
1013.Qq kludge
1014is matched, but the previous
1015.Qq mega-
1016is still hanging around at the beginning of
1017.Fa yytext
1018so the
1019.Em ECHO
1020for the
1021.Qq kludge
1022rule will actually write
1023.Qq mega-kludge .
1024.Pp
1025Two notes regarding use of
1026.Fn yymore :
1027First,
1028.Fn yymore
1029depends on the value of
1030.Fa yyleng
1031correctly reflecting the size of the current token, so
1032.Fa yyleng
1033must not be modified when using
1034.Fn yymore .
1035Second, the presence of
1036.Fn yymore
1037in the scanner's action entails a minor performance penalty in the
1038scanner's matching speed.
1039.It yyless(n)
1040Returns all but the first
1041.Ar n
1042characters of the current token back to the input stream, where they
1043will be rescanned when the scanner looks for the next match.
1044.Fa yytext
1045and
1046.Fa yyleng
1047are adjusted appropriately (e.g.,
1048.Fa yyleng
1049will now be equal to
1050.Ar n ) .
1051For example, on the input
1052.Qq foobar
1053the following will write out
1054.Qq foobarbar :
1055.Bd -literal -offset indent
1056%%
1057foobar    ECHO; yyless(3);
1058[a-z]+    ECHO;
1059.Ed
1060.Pp
1061An argument of 0 to
1062.Fa yyless
1063will cause the entire current input string to be scanned again.
1064Unless how the scanner will subsequently process its input has been changed
1065(using
1066.Em BEGIN ,
1067for example),
1068this will result in an endless loop.
1069.Pp
1070Note that
1071.Fa yyless
1072is a macro and can only be used in the
1073.Nm
1074input file, not from other source files.
1075.It unput(c)
1076Puts the character
1077.Ar c
1078back into the input stream.
1079It will be the next character scanned.
1080The following action will take the current token and cause it
1081to be rescanned enclosed in parentheses.
1082.Bd -literal -offset indent
1083{
1084        int i;
1085        char *yycopy;
1086
1087        /* Copy yytext because unput() trashes yytext */
1088        if ((yycopy = strdup(yytext)) == NULL)
1089                err(1, NULL);
1090        unput(')');
1091        for (i = yyleng - 1; i >= 0; --i)
1092                unput(yycopy[i]);
1093        unput('(');
1094        free(yycopy);
1095}
1096.Ed
1097.Pp
1098Note that since each
1099.Fn unput
1100puts the given character back at the beginning of the input stream,
1101pushing back strings must be done back-to-front.
1102.Pp
1103An important potential problem when using
1104.Fn unput
1105is that if using
1106.Dq %pointer
1107.Pq the default ,
1108a call to
1109.Fn unput
1110destroys the contents of
1111.Fa yytext ,
1112starting with its rightmost character and devouring one character to
1113the left with each call.
1114If the value of
1115.Fa yytext
1116should be preserved after a call to
1117.Fn unput
1118.Pq as in the above example ,
1119it must either first be copied elsewhere, or the scanner must be built using
1120.Dq %array
1121instead (see
1122.Sx HOW THE INPUT IS MATCHED ) .
1123.Pp
1124Finally, note that EOF cannot be put back
1125to attempt to mark the input stream with an end-of-file.
1126.It input()
1127Reads the next character from the input stream.
1128For example, the following is one way to eat up C comments:
1129.Bd -literal -offset indent
1130%%
1131"/*" {
1132        int c;
1133
1134        for (;;) {
1135                while ((c = input()) != '*' && c != EOF)
1136                        ; /* eat up text of comment */
1137
1138                if (c == '*') {
1139                        while ((c = input()) == '*')
1140                                ;
1141                        if (c == '/')
1142                                break; /* found the end */
1143                }
1144
1145                if (c == EOF) {
1146                        errx(1, "EOF in comment");
1147                        break;
1148                }
1149        }
1150}
1151.Ed
1152.Pp
1153(Note that if the scanner is compiled using C++, then
1154.Fn input
1155is instead referred to as
1156.Fn yyinput ,
1157in order to avoid a name clash with the C++ stream by the name of input.)
1158.It YY_FLUSH_BUFFER
1159Flushes the scanner's internal buffer
1160so that the next time the scanner attempts to match a token,
1161it will first refill the buffer using
1162.Dv YY_INPUT
1163(see
1164.Sx THE GENERATED SCANNER ,
1165below).
1166This action is a special case of the more general
1167.Fn yy_flush_buffer
1168function, described below in the section
1169.Sx MULTIPLE INPUT BUFFERS .
1170.It yyterminate()
1171Can be used in lieu of a return statement in an action.
1172It terminates the scanner and returns a 0 to the scanner's caller, indicating
1173.Qq all done .
1174By default,
1175.Fn yyterminate
1176is also called when an end-of-file is encountered.
1177It is a macro and may be redefined.
1178.El
1179.Sh THE GENERATED SCANNER
1180The output of
1181.Nm
1182is the file
1183.Pa lex.yy.c ,
1184which contains the scanning routine
1185.Fn yylex ,
1186a number of tables used by it for matching tokens,
1187and a number of auxiliary routines and macros.
1188By default,
1189.Fn yylex
1190is declared as follows:
1191.Bd -unfilled -offset indent
1192int yylex()
1193{
1194    ... various definitions and the actions in here ...
1195}
1196.Ed
1197.Pp
1198(If the environment supports function prototypes, then it will
1199be "int yylex(void)".)
1200This definition may be changed by defining the
1201.Dv YY_DECL
1202macro.
1203For example:
1204.Bd -literal -offset indent
1205#define YY_DECL float lexscan(a, b) float a, b;
1206.Ed
1207.Pp
1208would give the scanning routine the name
1209.Em lexscan ,
1210returning a float, and taking two floats as arguments.
1211Note that if arguments are given to the scanning routine using a
1212K&R-style/non-prototyped function declaration,
1213the definition must be terminated with a semi-colon
1214.Pq Sq ;\& .
1215.Pp
1216Whenever
1217.Fn yylex
1218is called, it scans tokens from the global input file
1219.Pa yyin
1220.Pq which defaults to stdin .
1221It continues until it either reaches an end-of-file
1222.Pq at which point it returns the value 0
1223or one of its actions executes a
1224.Em return
1225statement.
1226.Pp
1227If the scanner reaches an end-of-file, subsequent calls are undefined
1228unless either
1229.Em yyin
1230is pointed at a new input file
1231.Pq in which case scanning continues from that file ,
1232or
1233.Fn yyrestart
1234is called.
1235.Fn yyrestart
1236takes one argument, a
1237.Fa FILE *
1238pointer (which can be nil, if
1239.Dv YY_INPUT
1240has been set up to scan from a source other than
1241.Em yyin ) ,
1242and initializes
1243.Em yyin
1244for scanning from that file.
1245Essentially there is no difference between just assigning
1246.Em yyin
1247to a new input file or using
1248.Fn yyrestart
1249to do so; the latter is available for compatibility with previous versions of
1250.Nm ,
1251and because it can be used to switch input files in the middle of scanning.
1252It can also be used to throw away the current input buffer,
1253by calling it with an argument of
1254.Em yyin ;
1255but better is to use
1256.Dv YY_FLUSH_BUFFER
1257.Pq see above .
1258Note that
1259.Fn yyrestart
1260does not reset the start condition to
1261.Em INITIAL
1262(see
1263.Sx START CONDITIONS ,
1264below).
1265.Pp
1266If
1267.Fn yylex
1268stops scanning due to executing a
1269.Em return
1270statement in one of the actions, the scanner may then be called again and it
1271will resume scanning where it left off.
1272.Pp
1273By default
1274.Pq and for purposes of efficiency ,
1275the scanner uses block-reads rather than simple
1276.Xr getc 3
1277calls to read characters from
1278.Em yyin .
1279The nature of how it gets its input can be controlled by defining the
1280.Dv YY_INPUT
1281macro.
1282.Dv YY_INPUT Ns 's
1283calling sequence is
1284.Qq YY_INPUT(buf,result,max_size) .
1285Its action is to place up to
1286.Dv max_size
1287characters in the character array
1288.Em buf
1289and return in the integer variable
1290.Em result
1291either the number of characters read or the constant
1292.Dv YY_NULL
1293(0 on
1294.Ux
1295systems)
1296to indicate
1297.Dv EOF .
1298The default
1299.Dv YY_INPUT
1300reads from the global file-pointer
1301.Qq yyin .
1302.Pp
1303A sample definition of
1304.Dv YY_INPUT
1305.Pq in the definitions section of the input file :
1306.Bd -unfilled -offset indent
1307%{
1308#define YY_INPUT(buf,result,max_size) \e
1309{ \e
1310        int c = getchar(); \e
1311        result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1312}
1313%}
1314.Ed
1315.Pp
1316This definition will change the input processing to occur
1317one character at a time.
1318.Pp
1319When the scanner receives an end-of-file indication from
1320.Dv YY_INPUT ,
1321it then checks the
1322.Fn yywrap
1323function.
1324If
1325.Fn yywrap
1326returns false
1327.Pq zero ,
1328then it is assumed that the function has gone ahead and set up
1329.Em yyin
1330to point to another input file, and scanning continues.
1331If it returns true
1332.Pq non-zero ,
1333then the scanner terminates, returning 0 to its caller.
1334Note that in either case, the start condition remains unchanged;
1335it does not revert to
1336.Em INITIAL .
1337.Pp
1338If you do not supply your own version of
1339.Fn yywrap ,
1340then you must either use
1341.Dq %option noyywrap
1342(in which case the scanner behaves as though
1343.Fn yywrap
1344returned 1), or you must link with
1345.Fl lfl
1346to obtain the default version of the routine, which always returns 1.
1347.Pp
1348Three routines are available for scanning from in-memory buffers rather
1349than files:
1350.Fn yy_scan_string ,
1351.Fn yy_scan_bytes ,
1352and
1353.Fn yy_scan_buffer .
1354See the discussion of them below in the section
1355.Sx MULTIPLE INPUT BUFFERS .
1356.Pp
1357The scanner writes its
1358.Em ECHO
1359output to the
1360.Em yyout
1361global
1362.Pq default, stdout ,
1363which may be redefined by the user simply by assigning it to some other
1364.Va FILE
1365pointer.
1366.Sh START CONDITIONS
1367.Nm
1368provides a mechanism for conditionally activating rules.
1369Any rule whose pattern is prefixed with
1370.Qq Aq sc
1371will only be active when the scanner is in the start condition named
1372.Qq sc .
1373For example,
1374.Bd -literal -offset indent
1375<STRING>[^"]* { /* eat up the string body ... */
1376        ...
1377}
1378.Ed
1379.Pp
1380will be active only when the scanner is in the
1381.Qq STRING
1382start condition, and
1383.Bd -literal -offset indent
1384<INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1385        ...
1386}
1387.Ed
1388.Pp
1389will be active only when the current start condition is either
1390.Qq INITIAL ,
1391.Qq STRING ,
1392or
1393.Qq QUOTE .
1394.Pp
1395Start conditions are declared in the definitions
1396.Pq first
1397section of the input using unindented lines beginning with either
1398.Sq %s
1399or
1400.Sq %x
1401followed by a list of names.
1402The former declares
1403.Em inclusive
1404start conditions, the latter
1405.Em exclusive
1406start conditions.
1407A start condition is activated using the
1408.Em BEGIN
1409action.
1410Until the next
1411.Em BEGIN
1412action is executed, rules with the given start condition will be active and
1413rules with other start conditions will be inactive.
1414If the start condition is inclusive,
1415then rules with no start conditions at all will also be active.
1416If it is exclusive,
1417then only rules qualified with the start condition will be active.
1418A set of rules contingent on the same exclusive start condition
1419describe a scanner which is independent of any of the other rules in the
1420.Nm
1421input.
1422Because of this, exclusive start conditions make it easy to specify
1423.Qq mini-scanners
1424which scan portions of the input that are syntactically different
1425from the rest
1426.Pq e.g., comments .
1427.Pp
1428If the distinction between inclusive and exclusive start conditions
1429is still a little vague, here's a simple example illustrating the
1430connection between the two.
1431The set of rules:
1432.Bd -literal -offset indent
1433%s example
1434%%
1435
1436<example>foo   do_something();
1437
1438bar            something_else();
1439.Ed
1440.Pp
1441is equivalent to
1442.Bd -literal -offset indent
1443%x example
1444%%
1445
1446<example>foo   do_something();
1447
1448<INITIAL,example>bar    something_else();
1449.Ed
1450.Pp
1451Without the
1452.Aq INITIAL,example
1453qualifier, the
1454.Dq bar
1455pattern in the second example wouldn't be active
1456.Pq i.e., couldn't match
1457when in start condition
1458.Dq example .
1459If we just used
1460.Aq example
1461to qualify
1462.Dq bar ,
1463though, then it would only be active in
1464.Dq example
1465and not in
1466.Em INITIAL ,
1467while in the first example it's active in both,
1468because in the first example the
1469.Dq example
1470start condition is an inclusive
1471.Pq Sq %s
1472start condition.
1473.Pp
1474Also note that the special start-condition specifier
1475.Sq Aq *
1476matches every start condition.
1477Thus, the above example could also have been written:
1478.Bd -literal -offset indent
1479%x example
1480%%
1481
1482<example>foo   do_something();
1483
1484<*>bar         something_else();
1485.Ed
1486.Pp
1487The default rule (to
1488.Em ECHO
1489any unmatched character) remains active in start conditions.
1490It is equivalent to:
1491.Bd -literal -offset indent
1492<*>.|\en     ECHO;
1493.Ed
1494.Pp
1495.Dq BEGIN(0)
1496returns to the original state where only the rules with
1497no start conditions are active.
1498This state can also be referred to as the start-condition
1499.Em INITIAL ,
1500so
1501.Dq BEGIN(INITIAL)
1502is equivalent to
1503.Dq BEGIN(0) .
1504(The parentheses around the start condition name are not required but
1505are considered good style.)
1506.Pp
1507.Em BEGIN
1508actions can also be given as indented code at the beginning
1509of the rules section.
1510For example, the following will cause the scanner to enter the
1511.Qq SPECIAL
1512start condition whenever
1513.Fn yylex
1514is called and the global variable
1515.Fa enter_special
1516is true:
1517.Bd -literal -offset indent
1518int enter_special;
1519
1520%x SPECIAL
1521%%
1522        if (enter_special)
1523                BEGIN(SPECIAL);
1524
1525<SPECIAL>blahblahblah
1526\&...more rules follow...
1527.Ed
1528.Pp
1529To illustrate the uses of start conditions,
1530here is a scanner which provides two different interpretations
1531of a string like
1532.Qq 123.456 .
1533By default it will treat it as three tokens: the integer
1534.Qq 123 ,
1535a dot
1536.Pq Sq .\& ,
1537and the integer
1538.Qq 456 .
1539But if the string is preceded earlier in the line by the string
1540.Qq expect-floats
1541it will treat it as a single token, the floating-point number 123.456:
1542.Bd -literal -offset indent
1543%{
1544#include <math.h>
1545%}
1546%s expect
1547
1548%%
1549expect-floats        BEGIN(expect);
1550
1551<expect>[0-9]+"."[0-9]+ {
1552        printf("found a float, = %f\en",
1553            atof(yytext));
1554}
1555<expect>\en {
1556        /*
1557         * That's the end of the line, so
1558         * we need another "expect-number"
1559         * before we'll recognize any more
1560         * numbers.
1561         */
1562        BEGIN(INITIAL);
1563}
1564
1565[0-9]+ {
1566        printf("found an integer, = %d\en",
1567            atoi(yytext));
1568}
1569
1570"."     printf("found a dot\en");
1571.Ed
1572.Pp
1573Here is a scanner which recognizes
1574.Pq and discards
1575C comments while maintaining a count of the current input line:
1576.Bd -literal -offset indent
1577%x comment
1578%%
1579int line_num = 1;
1580
1581"/*"                    BEGIN(comment);
1582
1583<comment>[^*\en]*        /* eat anything that's not a '*' */
1584<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1585<comment>\en             ++line_num;
1586<comment>"*"+"/"        BEGIN(INITIAL);
1587.Ed
1588.Pp
1589This scanner goes to a bit of trouble to match as much
1590text as possible with each rule.
1591In general, when attempting to write a high-speed scanner
1592try to match as much as possible in each rule, as it's a big win.
1593.Pp
1594Note that start-condition names are really integer values and
1595can be stored as such.
1596Thus, the above could be extended in the following fashion:
1597.Bd -literal -offset indent
1598%x comment foo
1599%%
1600int line_num = 1;
1601int comment_caller;
1602
1603"/*" {
1604        comment_caller = INITIAL;
1605        BEGIN(comment);
1606}
1607
1608\&...
1609
1610<foo>"/*" {
1611        comment_caller = foo;
1612        BEGIN(comment);
1613}
1614
1615<comment>[^*\en]*        /* eat anything that's not a '*' */
1616<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1617<comment>\en             ++line_num;
1618<comment>"*"+"/"        BEGIN(comment_caller);
1619.Ed
1620.Pp
1621Furthermore, the current start condition can be accessed by using
1622the integer-valued
1623.Dv YY_START
1624macro.
1625For example, the above assignments to
1626.Em comment_caller
1627could instead be written
1628.Pp
1629.Dl comment_caller = YY_START;
1630.Pp
1631Flex provides
1632.Dv YYSTATE
1633as an alias for
1634.Dv YY_START
1635(since that is what's used by AT&T
1636.Nm lex ) .
1637.Pp
1638Note that start conditions do not have their own name-space;
1639%s's and %x's declare names in the same fashion as #define's.
1640.Pp
1641Finally, here's an example of how to match C-style quoted strings using
1642exclusive start conditions, including expanded escape sequences
1643(but not including checking for a string that's too long):
1644.Bd -literal -offset indent
1645%x str
1646
1647%%
1648#define MAX_STR_CONST 1024
1649char string_buf[MAX_STR_CONST];
1650char *string_buf_ptr;
1651
1652\e"      string_buf_ptr = string_buf; BEGIN(str);
1653
1654<str>\e" { /* saw closing quote - all done */
1655        BEGIN(INITIAL);
1656        *string_buf_ptr = '\e0';
1657        /*
1658         * return string constant token type and
1659         * value to parser
1660         */
1661}
1662
1663<str>\en {
1664        /* error - unterminated string constant */
1665        /* generate error message */
1666}
1667
1668<str>\e\e[0-7]{1,3} {
1669        /* octal escape sequence */
1670        int result;
1671
1672        (void) sscanf(yytext + 1, "%o", &result);
1673
1674        if (result > 0xff) {
1675                /* error, constant is out-of-bounds */
1676	} else
1677	        *string_buf_ptr++ = result;
1678}
1679
1680<str>\e\e[0-9]+ {
1681        /*
1682         * generate error - bad escape sequence; something
1683         * like '\e48' or '\e0777777'
1684         */
1685}
1686
1687<str>\e\en  *string_buf_ptr++ = '\en';
1688<str>\e\et  *string_buf_ptr++ = '\et';
1689<str>\e\er  *string_buf_ptr++ = '\er';
1690<str>\e\eb  *string_buf_ptr++ = '\eb';
1691<str>\e\ef  *string_buf_ptr++ = '\ef';
1692
1693<str>\e\e(.|\en)  *string_buf_ptr++ = yytext[1];
1694
1695<str>[^\e\e\en\e"]+ {
1696        char *yptr = yytext;
1697
1698        while (*yptr)
1699                *string_buf_ptr++ = *yptr++;
1700}
1701.Ed
1702.Pp
1703Often, such as in some of the examples above,
1704a whole bunch of rules are all preceded by the same start condition(s).
1705.Nm
1706makes this a little easier and cleaner by introducing a notion of
1707start condition
1708.Em scope .
1709A start condition scope is begun with:
1710.Pp
1711.Dl <SCs>{
1712.Pp
1713where
1714.Dq SCs
1715is a list of one or more start conditions.
1716Inside the start condition scope, every rule automatically has the prefix
1717.Aq SCs
1718applied to it, until a
1719.Sq }
1720which matches the initial
1721.Sq { .
1722So, for example,
1723.Bd -literal -offset indent
1724<ESC>{
1725    "\e\en"   return '\en';
1726    "\e\er"   return '\er';
1727    "\e\ef"   return '\ef';
1728    "\e\e0"   return '\e0';
1729}
1730.Ed
1731.Pp
1732is equivalent to:
1733.Bd -literal -offset indent
1734<ESC>"\e\en"  return '\en';
1735<ESC>"\e\er"  return '\er';
1736<ESC>"\e\ef"  return '\ef';
1737<ESC>"\e\e0"  return '\e0';
1738.Ed
1739.Pp
1740Start condition scopes may be nested.
1741.Pp
1742Three routines are available for manipulating stacks of start conditions:
1743.Bl -tag -width Ds
1744.It void yy_push_state(int new_state)
1745Pushes the current start condition onto the top of the start condition
1746stack and switches to
1747.Fa new_state
1748as though
1749.Dq BEGIN new_state
1750had been used
1751.Pq recall that start condition names are also integers .
1752.It void yy_pop_state()
1753Pops the top of the stack and switches to it via
1754.Em BEGIN .
1755.It int yy_top_state()
1756Returns the top of the stack without altering the stack's contents.
1757.El
1758.Pp
1759The start condition stack grows dynamically and so has no built-in
1760size limitation.
1761If memory is exhausted, program execution aborts.
1762.Pp
1763To use start condition stacks, scanners must include a
1764.Dq %option stack
1765directive (see
1766.Sx OPTIONS
1767below).
1768.Sh MULTIPLE INPUT BUFFERS
1769Some scanners
1770(such as those which support
1771.Qq include
1772files)
1773require reading from several input streams.
1774As
1775.Nm
1776scanners do a large amount of buffering, one cannot control
1777where the next input will be read from by simply writing a
1778.Dv YY_INPUT
1779which is sensitive to the scanning context.
1780.Dv YY_INPUT
1781is only called when the scanner reaches the end of its buffer, which
1782may be a long time after scanning a statement such as an
1783.Qq include
1784which requires switching the input source.
1785.Pp
1786To negotiate these sorts of problems,
1787.Nm
1788provides a mechanism for creating and switching between multiple
1789input buffers.
1790An input buffer is created by using:
1791.Pp
1792.D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1793.Pp
1794which takes a
1795.Fa FILE
1796pointer and a
1797.Fa size
1798and creates a buffer associated with the given file and large enough to hold
1799.Fa size
1800characters (when in doubt, use
1801.Dv YY_BUF_SIZE
1802for the size).
1803It returns a
1804.Dv YY_BUFFER_STATE
1805handle, which may then be passed to other routines
1806.Pq see below .
1807The
1808.Dv YY_BUFFER_STATE
1809type is a pointer to an opaque
1810.Dq struct yy_buffer_state
1811structure, so
1812.Dv YY_BUFFER_STATE
1813variables may be safely initialized to
1814.Dq ((YY_BUFFER_STATE) 0)
1815if desired, and the opaque structure can also be referred to in order to
1816correctly declare input buffers in source files other than that of scanners.
1817Note that the
1818.Fa FILE
1819pointer in the call to
1820.Fn yy_create_buffer
1821is only used as the value of
1822.Fa yyin
1823seen by
1824.Dv YY_INPUT ;
1825if
1826.Dv YY_INPUT
1827is redefined so that it no longer uses
1828.Fa yyin ,
1829then a nil
1830.Fa FILE
1831pointer can safely be passed to
1832.Fn yy_create_buffer .
1833To select a particular buffer to scan:
1834.Pp
1835.D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1836.Pp
1837It switches the scanner's input buffer so subsequent tokens will
1838come from
1839.Fa new_buffer .
1840Note that
1841.Fn yy_switch_to_buffer
1842may be used by
1843.Fn yywrap
1844to set things up for continued scanning,
1845instead of opening a new file and pointing
1846.Fa yyin
1847at it.
1848Note also that switching input sources via either
1849.Fn yy_switch_to_buffer
1850or
1851.Fn yywrap
1852does not change the start condition.
1853.Pp
1854.D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1855.Pp
1856is used to reclaim the storage associated with a buffer.
1857.Pf ( Fa buffer
1858can be nil, in which case the routine does nothing.)
1859To clear the current contents of a buffer:
1860.Pp
1861.D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1862.Pp
1863This function discards the buffer's contents,
1864so the next time the scanner attempts to match a token from the buffer,
1865it will first fill the buffer anew using
1866.Dv YY_INPUT .
1867.Pp
1868.Fn yy_new_buffer
1869is an alias for
1870.Fn yy_create_buffer ,
1871provided for compatibility with the C++ use of
1872.Em new
1873and
1874.Em delete
1875for creating and destroying dynamic objects.
1876.Pp
1877Finally, the
1878.Dv YY_CURRENT_BUFFER
1879macro returns a
1880.Dv YY_BUFFER_STATE
1881handle to the current buffer.
1882.Pp
1883Here is an example of using these features for writing a scanner
1884which expands include files (the
1885.Aq Aq EOF
1886feature is discussed below):
1887.Bd -literal -offset indent
1888/*
1889 * the "incl" state is used for picking up the name
1890 * of an include file
1891 */
1892%x incl
1893
1894%{
1895#define MAX_INCLUDE_DEPTH 10
1896YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1897int include_stack_ptr = 0;
1898%}
1899
1900%%
1901include             BEGIN(incl);
1902
1903[a-z]+              ECHO;
1904[^a-z\en]*\en?        ECHO;
1905
1906<incl>[ \et]*        /* eat the whitespace */
1907<incl>[^ \et\en]+ {   /* got the include file name */
1908        if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1909                errx(1, "Includes nested too deeply");
1910
1911        include_stack[include_stack_ptr++] =
1912            YY_CURRENT_BUFFER;
1913
1914        yyin = fopen(yytext, "r");
1915
1916        if (yyin == NULL)
1917                err(1, NULL);
1918
1919        yy_switch_to_buffer(
1920            yy_create_buffer(yyin, YY_BUF_SIZE));
1921
1922        BEGIN(INITIAL);
1923}
1924
1925<<EOF>> {
1926        if (--include_stack_ptr < 0)
1927                yyterminate();
1928        else {
1929                yy_delete_buffer(YY_CURRENT_BUFFER);
1930                yy_switch_to_buffer(
1931                    include_stack[include_stack_ptr]);
1932       }
1933}
1934.Ed
1935.Pp
1936Three routines are available for setting up input buffers for
1937scanning in-memory strings instead of files.
1938All of them create a new input buffer for scanning the string,
1939and return a corresponding
1940.Dv YY_BUFFER_STATE
1941handle (which should be deleted afterwards using
1942.Fn yy_delete_buffer ) .
1943They also switch to the new buffer using
1944.Fn yy_switch_to_buffer ,
1945so the next call to
1946.Fn yylex
1947will start scanning the string.
1948.Bl -tag -width Ds
1949.It yy_scan_string(const char *str)
1950Scans a NUL-terminated string.
1951.It yy_scan_bytes(const char *bytes, int len)
1952Scans
1953.Fa len
1954bytes
1955.Pq including possibly NUL's
1956starting at location
1957.Fa bytes .
1958.El
1959.Pp
1960Note that both of these functions create and scan a copy
1961of the string or bytes.
1962(This may be desirable, since
1963.Fn yylex
1964modifies the contents of the buffer it is scanning.)
1965The copy can be avoided by using:
1966.Bl -tag -width Ds
1967.It yy_scan_buffer(char *base, yy_size_t size)
1968Which scans the buffer starting at
1969.Fa base ,
1970consisting of
1971.Fa size
1972bytes, the last two bytes of which must be
1973.Dv YY_END_OF_BUFFER_CHAR
1974.Pq ASCII NUL .
1975These last two bytes are not scanned; thus, scanning consists of
1976base[0] through base[size-2], inclusive.
1977.Pp
1978If
1979.Fa base
1980is not set up in this manner
1981(i.e., forget the final two
1982.Dv YY_END_OF_BUFFER_CHAR
1983bytes), then
1984.Fn yy_scan_buffer
1985returns a nil pointer instead of creating a new input buffer.
1986.Pp
1987The type
1988.Fa yy_size_t
1989is an integral type which can be cast to an integer expression
1990reflecting the size of the buffer.
1991.El
1992.Sh END-OF-FILE RULES
1993The special rule
1994.Qq Aq Aq EOF
1995indicates actions which are to be taken when an end-of-file is encountered and
1996.Fn yywrap
1997returns non-zero
1998.Pq i.e., indicates no further files to process .
1999The action must finish by doing one of four things:
2000.Bl -dash
2001.It
2002Assigning
2003.Em yyin
2004to a new input file
2005(in previous versions of
2006.Nm ,
2007after doing the assignment, it was necessary to call the special action
2008.Dv YY_NEW_FILE ;
2009this is no longer necessary).
2010.It
2011Executing a
2012.Em return
2013statement.
2014.It
2015Executing the special
2016.Fn yyterminate
2017action.
2018.It
2019Switching to a new buffer using
2020.Fn yy_switch_to_buffer
2021as shown in the example above.
2022.El
2023.Pp
2024.Aq Aq EOF
2025rules may not be used with other patterns;
2026they may only be qualified with a list of start conditions.
2027If an unqualified
2028.Aq Aq EOF
2029rule is given, it applies to all start conditions which do not already have
2030.Aq Aq EOF
2031actions.
2032To specify an
2033.Aq Aq EOF
2034rule for only the initial start condition, use
2035.Pp
2036.Dl <INITIAL><<EOF>>
2037.Pp
2038These rules are useful for catching things like unclosed comments.
2039An example:
2040.Bd -literal -offset indent
2041%x quote
2042%%
2043
2044\&...other rules for dealing with quotes...
2045
2046<quote><<EOF>> {
2047         error("unterminated quote");
2048         yyterminate();
2049}
2050<<EOF>> {
2051         if (*++filelist)
2052                 yyin = fopen(*filelist, "r");
2053         else
2054                 yyterminate();
2055}
2056.Ed
2057.Sh MISCELLANEOUS MACROS
2058The macro
2059.Dv YY_USER_ACTION
2060can be defined to provide an action
2061which is always executed prior to the matched rule's action.
2062For example,
2063it could be #define'd to call a routine to convert yytext to lower-case.
2064When
2065.Dv YY_USER_ACTION
2066is invoked, the variable
2067.Fa yy_act
2068gives the number of the matched rule
2069.Pq rules are numbered starting with 1 .
2070For example, to profile how often each rule is matched,
2071the following would do the trick:
2072.Pp
2073.Dl #define YY_USER_ACTION ++ctr[yy_act]
2074.Pp
2075where
2076.Fa ctr
2077is an array to hold the counts for the different rules.
2078Note that the macro
2079.Dv YY_NUM_RULES
2080gives the total number of rules
2081(including the default rule, even if
2082.Fl s
2083is used),
2084so a correct declaration for
2085.Fa ctr
2086is:
2087.Pp
2088.Dl int ctr[YY_NUM_RULES];
2089.Pp
2090The macro
2091.Dv YY_USER_INIT
2092may be defined to provide an action which is always executed before
2093the first scan
2094.Pq and before the scanner's internal initializations are done .
2095For example, it could be used to call a routine to read
2096in a data table or open a logging file.
2097.Pp
2098The macro
2099.Dv yy_set_interactive(is_interactive)
2100can be used to control whether the current buffer is considered
2101.Em interactive .
2102An interactive buffer is processed more slowly,
2103but must be used when the scanner's input source is indeed
2104interactive to avoid problems due to waiting to fill buffers
2105(see the discussion of the
2106.Fl I
2107flag below).
2108A non-zero value in the macro invocation marks the buffer as interactive,
2109a zero value as non-interactive.
2110Note that use of this macro overrides
2111.Dq %option always-interactive
2112or
2113.Dq %option never-interactive
2114(see
2115.Sx OPTIONS
2116below).
2117.Fn yy_set_interactive
2118must be invoked prior to beginning to scan the buffer that is
2119.Pq or is not
2120to be considered interactive.
2121.Pp
2122The macro
2123.Dv yy_set_bol(at_bol)
2124can be used to control whether the current buffer's scanning
2125context for the next token match is done as though at the
2126beginning of a line.
2127A non-zero macro argument makes rules anchored with
2128.Sq ^
2129active, while a zero argument makes
2130.Sq ^
2131rules inactive.
2132.Pp
2133The macro
2134.Dv YY_AT_BOL
2135returns true if the next token scanned from the current buffer will have
2136.Sq ^
2137rules active, false otherwise.
2138.Pp
2139In the generated scanner, the actions are all gathered in one large
2140switch statement and separated using
2141.Dv YY_BREAK ,
2142which may be redefined.
2143By default, it is simply a
2144.Qq break ,
2145to separate each rule's action from the following rules.
2146Redefining
2147.Dv YY_BREAK
2148allows, for example, C++ users to
2149.Dq #define YY_BREAK
2150to do nothing
2151(while being very careful that every rule ends with a
2152.Qq break
2153or a
2154.Qq return ! )
2155to avoid suffering from unreachable statement warnings where because a rule's
2156action ends with
2157.Dq return ,
2158the
2159.Dv YY_BREAK
2160is inaccessible.
2161.Sh VALUES AVAILABLE TO THE USER
2162This section summarizes the various values available to the user
2163in the rule actions.
2164.Bl -tag -width Ds
2165.It char *yytext
2166Holds the text of the current token.
2167It may be modified but not lengthened
2168.Pq characters cannot be appended to the end .
2169.Pp
2170If the special directive
2171.Dq %array
2172appears in the first section of the scanner description, then
2173.Fa yytext
2174is instead declared
2175.Dq char yytext[YYLMAX] ,
2176where
2177.Dv YYLMAX
2178is a macro definition that can be redefined in the first section
2179to change the default value
2180.Pq generally 8KB .
2181Using
2182.Dq %array
2183results in somewhat slower scanners, but the value of
2184.Fa yytext
2185becomes immune to calls to
2186.Fn input
2187and
2188.Fn unput ,
2189which potentially destroy its value when
2190.Fa yytext
2191is a character pointer.
2192The opposite of
2193.Dq %array
2194is
2195.Dq %pointer ,
2196which is the default.
2197.Pp
2198.Dq %array
2199cannot be used when generating C++ scanner classes
2200(the
2201.Fl +
2202flag).
2203.It int yyleng
2204Holds the length of the current token.
2205.It FILE *yyin
2206Is the file which by default
2207.Nm
2208reads from.
2209It may be redefined, but doing so only makes sense before
2210scanning begins or after an
2211.Dv EOF
2212has been encountered.
2213Changing it in the midst of scanning will have unexpected results since
2214.Nm
2215buffers its input; use
2216.Fn yyrestart
2217instead.
2218Once scanning terminates because an end-of-file
2219has been seen,
2220.Fa yyin
2221can be assigned as the new input file
2222and the scanner can be called again to continue scanning.
2223.It void yyrestart(FILE *new_file)
2224May be called to point
2225.Fa yyin
2226at the new input file.
2227The switch-over to the new file is immediate
2228.Pq any previously buffered-up input is lost .
2229Note that calling
2230.Fn yyrestart
2231with
2232.Fa yyin
2233as an argument thus throws away the current input buffer and continues
2234scanning the same input file.
2235.It FILE *yyout
2236Is the file to which
2237.Em ECHO
2238actions are done.
2239It can be reassigned by the user.
2240.It YY_CURRENT_BUFFER
2241Returns a
2242.Dv YY_BUFFER_STATE
2243handle to the current buffer.
2244.It YY_START
2245Returns an integer value corresponding to the current start condition.
2246This value can subsequently be used with
2247.Em BEGIN
2248to return to that start condition.
2249.El
2250.Sh INTERFACING WITH YACC
2251One of the main uses of
2252.Nm
2253is as a companion to the
2254.Xr yacc 1
2255parser-generator.
2256yacc parsers expect to call a routine named
2257.Fn yylex
2258to find the next input token.
2259The routine is supposed to return the type of the next token
2260as well as putting any associated value in the global
2261.Fa yylval ,
2262which is defined externally,
2263and can be a union or any other complex data structure.
2264To use
2265.Nm
2266with yacc, one specifies the
2267.Fl d
2268option to yacc to instruct it to generate the file
2269.Pa y.tab.h
2270containing definitions of all the
2271.Dq %tokens
2272appearing in the yacc input.
2273This file is then included in the
2274.Nm
2275scanner.
2276For example, if one of the tokens is
2277.Qq TOK_NUMBER ,
2278part of the scanner might look like:
2279.Bd -literal -offset indent
2280%{
2281#include "y.tab.h"
2282%}
2283
2284%%
2285
2286[0-9]+        yylval = atoi(yytext); return TOK_NUMBER;
2287.Ed
2288.Sh OPTIONS
2289.Nm
2290has the following options:
2291.Bl -tag -width Ds
2292.It Fl 7
2293Instructs
2294.Nm
2295to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2296characters in its input.
2297The advantage of using
2298.Fl 7
2299is that the scanner's tables can be up to half the size of those generated
2300using the
2301.Fl 8
2302option
2303.Pq see below .
2304The disadvantage is that such scanners often hang
2305or crash if their input contains an 8-bit character.
2306.Pp
2307Note, however, that unless generating a scanner using the
2308.Fl Cf
2309or
2310.Fl CF
2311table compression options, use of
2312.Fl 7
2313will save only a small amount of table space,
2314and make the scanner considerably less portable.
2315.Nm flex Ns 's
2316default behavior is to generate an 8-bit scanner unless
2317.Fl Cf
2318or
2319.Fl CF
2320is specified, in which case
2321.Nm
2322defaults to generating 7-bit scanners unless it was
2323configured to generate 8-bit scanners
2324(as will often be the case with non-USA sites).
2325It is possible tell whether
2326.Nm
2327generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2328.Fl v
2329output as described below.
2330.Pp
2331Note that if
2332.Fl Cfe
2333or
2334.Fl CFe
2335are used
2336(the table compression options, but also using equivalence classes as
2337discussed below),
2338.Nm
2339still defaults to generating an 8-bit scanner,
2340since usually with these compression options full 8-bit tables
2341are not much more expensive than 7-bit tables.
2342.It Fl 8
2343Instructs
2344.Nm
2345to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2346characters.
2347This flag is only needed for scanners generated using
2348.Fl Cf
2349or
2350.Fl CF ,
2351as otherwise
2352.Nm
2353defaults to generating an 8-bit scanner anyway.
2354.Pp
2355See the discussion of
2356.Fl 7
2357above for
2358.Nm flex Ns 's
2359default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2360.It Fl B
2361Instructs
2362.Nm
2363to generate a
2364.Em batch
2365scanner, the opposite of
2366.Em interactive
2367scanners generated by
2368.Fl I
2369.Pq see below .
2370In general,
2371.Fl B
2372is used when the scanner will never be used interactively,
2373and you want to squeeze a little more performance out of it.
2374If the aim is instead to squeeze out a lot more performance,
2375use the
2376.Fl Cf
2377or
2378.Fl CF
2379options
2380.Pq discussed below ,
2381which turn on
2382.Fl B
2383automatically anyway.
2384.It Fl b
2385Generate backing-up information to
2386.Pa lex.backup .
2387This is a list of scanner states which require backing up
2388and the input characters on which they do so.
2389By adding rules one can remove backing-up states.
2390If all backing-up states are eliminated and
2391.Fl Cf
2392or
2393.Fl CF
2394is used, the generated scanner will run faster (see the
2395.Fl p
2396flag).
2397Only users who wish to squeeze every last cycle out of their
2398scanners need worry about this option.
2399(See the section on
2400.Sx PERFORMANCE CONSIDERATIONS
2401below.)
2402.It Fl C Ns Op Cm aeFfmr
2403Controls the degree of table compression and, more generally, trade-offs
2404between small scanners and fast scanners.
2405.Bl -tag -width Ds
2406.It Fl Ca
2407Instructs
2408.Nm
2409to trade off larger tables in the generated scanner for faster performance
2410because the elements of the tables are better aligned for memory access
2411and computation.
2412On some
2413.Tn RISC
2414architectures, fetching and manipulating longwords is more efficient
2415than with smaller-sized units such as shortwords.
2416This option can double the size of the tables used by the scanner.
2417.It Fl Ce
2418Directs
2419.Nm
2420to construct
2421.Em equivalence classes ,
2422i.e., sets of characters which have identical lexical properties
2423(for example, if the only appearance of digits in the
2424.Nm
2425input is in the character class
2426.Qq [0-9]
2427then the digits
2428.Sq 0 ,
2429.Sq 1 ,
2430.Sq ... ,
2431.Sq 9
2432will all be put in the same equivalence class).
2433Equivalence classes usually give dramatic reductions in the final
2434table/object file sizes
2435.Pq typically a factor of 2\-5
2436and are pretty cheap performance-wise
2437.Pq one array look-up per character scanned .
2438.It Fl CF
2439Specifies that the alternate fast scanner representation
2440(described below under the
2441.Fl F
2442option)
2443should be used.
2444This option cannot be used with
2445.Fl + .
2446.It Fl Cf
2447Specifies that the
2448.Em full
2449scanner tables should be generated \-
2450.Nm
2451should not compress the tables by taking advantage of
2452similar transition functions for different states.
2453.It Fl \&Cm
2454Directs
2455.Nm
2456to construct
2457.Em meta-equivalence classes ,
2458which are sets of equivalence classes
2459(or characters, if equivalence classes are not being used)
2460that are commonly used together.
2461Meta-equivalence classes are often a big win when using compressed tables,
2462but they have a moderate performance impact
2463(one or two
2464.Qq if
2465tests and one array look-up per character scanned).
2466.It Fl Cr
2467Causes the generated scanner to
2468.Em bypass
2469use of the standard I/O library
2470.Pq stdio
2471for input.
2472Instead of calling
2473.Xr fread 3
2474or
2475.Xr getc 3 ,
2476the scanner will use the
2477.Xr read 2
2478system call,
2479resulting in a performance gain which varies from system to system,
2480but in general is probably negligible unless
2481.Fl Cf
2482or
2483.Fl CF
2484are being used.
2485Using
2486.Fl Cr
2487can cause strange behavior if, for example, reading from
2488.Fa yyin
2489using stdio prior to calling the scanner
2490(because the scanner will miss whatever text previous reads left
2491in the stdio input buffer).
2492.Pp
2493.Fl Cr
2494has no effect if
2495.Dv YY_INPUT
2496is defined
2497(see
2498.Sx THE GENERATED SCANNER
2499above).
2500.El
2501.Pp
2502A lone
2503.Fl C
2504specifies that the scanner tables should be compressed but neither
2505equivalence classes nor meta-equivalence classes should be used.
2506.Pp
2507The options
2508.Fl Cf
2509or
2510.Fl CF
2511and
2512.Fl \&Cm
2513do not make sense together \- there is no opportunity for meta-equivalence
2514classes if the table is not being compressed.
2515Otherwise the options may be freely mixed, and are cumulative.
2516.Pp
2517The default setting is
2518.Fl Cem
2519which specifies that
2520.Nm
2521should generate equivalence classes and meta-equivalence classes.
2522This setting provides the highest degree of table compression.
2523It is possible to trade off faster-executing scanners at the cost of
2524larger tables with the following generally being true:
2525.Bd -unfilled -offset indent
2526slowest & smallest
2527      -Cem
2528      -Cm
2529      -Ce
2530      -C
2531      -C{f,F}e
2532      -C{f,F}
2533      -C{f,F}a
2534fastest & largest
2535.Ed
2536.Pp
2537Note that scanners with the smallest tables are usually generated and
2538compiled the quickest,
2539so during development the default is usually best,
2540maximal compression.
2541.Pp
2542.Fl Cfe
2543is often a good compromise between speed and size for production scanners.
2544.It Fl d
2545Makes the generated scanner run in debug mode.
2546Whenever a pattern is recognized and the global
2547.Fa yy_flex_debug
2548is non-zero
2549.Pq which is the default ,
2550the scanner will write to stderr a line of the form:
2551.Pp
2552.D1 --accepting rule at line 53 ("the matched text")
2553.Pp
2554The line number refers to the location of the rule in the file
2555defining the scanner
2556(i.e., the file that was fed to
2557.Nm ) .
2558Messages are also generated when the scanner backs up,
2559accepts the default rule,
2560reaches the end of its input buffer
2561(or encounters a NUL;
2562at this point, the two look the same as far as the scanner's concerned),
2563or reaches an end-of-file.
2564.It Fl F
2565Specifies that the fast scanner table representation should be used
2566.Pq and stdio bypassed .
2567This representation is about as fast as the full table representation
2568.Pq Fl f ,
2569and for some sets of patterns will be considerably smaller
2570.Pq and for others, larger .
2571In general, if the pattern set contains both
2572.Qq keywords
2573and a catch-all,
2574.Qq identifier
2575rule, such as in the set:
2576.Bd -unfilled -offset indent
2577"case"    return TOK_CASE;
2578"switch"  return TOK_SWITCH;
2579\&...
2580"default" return TOK_DEFAULT;
2581[a-z]+    return TOK_ID;
2582.Ed
2583.Pp
2584then it's better to use the full table representation.
2585If only the
2586.Qq identifier
2587rule is present and a hash table or some such is used to detect the keywords,
2588it's better to use
2589.Fl F .
2590.Pp
2591This option is equivalent to
2592.Fl CFr
2593.Pq see above .
2594It cannot be used with
2595.Fl + .
2596.It Fl f
2597Specifies
2598.Em fast scanner .
2599No table compression is done and stdio is bypassed.
2600The result is large but fast.
2601This option is equivalent to
2602.Fl Cfr
2603.Pq see above .
2604.It Fl h
2605Generates a help summary of
2606.Nm flex Ns 's
2607options to stdout and then exits.
2608.Fl ?\&
2609and
2610.Fl Fl help
2611are synonyms for
2612.Fl h .
2613.It Fl I
2614Instructs
2615.Nm
2616to generate an
2617.Em interactive
2618scanner.
2619An interactive scanner is one that only looks ahead to decide
2620what token has been matched if it absolutely must.
2621It turns out that always looking one extra character ahead,
2622even if the scanner has already seen enough text
2623to disambiguate the current token, is a bit faster than
2624only looking ahead when necessary.
2625But scanners that always look ahead give dreadful interactive performance;
2626for example, when a user types a newline,
2627it is not recognized as a newline token until they enter
2628.Em another
2629token, which often means typing in another whole line.
2630.Pp
2631.Nm
2632scanners default to
2633.Em interactive
2634unless
2635.Fl Cf
2636or
2637.Fl CF
2638table-compression options are specified
2639.Pq see above .
2640That's because if high-performance is most important,
2641one of these options should be used,
2642so if they weren't,
2643.Nm
2644assumes it is preferable to trade off a bit of run-time performance for
2645intuitive interactive behavior.
2646Note also that
2647.Fl I
2648cannot be used in conjunction with
2649.Fl Cf
2650or
2651.Fl CF .
2652Thus, this option is not really needed; it is on by default for all those
2653cases in which it is allowed.
2654.Pp
2655A scanner can be forced to not be interactive by using
2656.Fl B
2657.Pq see above .
2658.It Fl i
2659Instructs
2660.Nm
2661to generate a case-insensitive scanner.
2662The case of letters given in the
2663.Nm
2664input patterns will be ignored,
2665and tokens in the input will be matched regardless of case.
2666The matched text given in
2667.Fa yytext
2668will have the preserved case
2669.Pq i.e., it will not be folded .
2670.It Fl L
2671Instructs
2672.Nm
2673not to generate
2674.Dq #line
2675directives.
2676Without this option,
2677.Nm
2678peppers the generated scanner with #line directives so error messages
2679in the actions will be correctly located with respect to either the original
2680.Nm
2681input file
2682(if the errors are due to code in the input file),
2683or
2684.Pa lex.yy.c
2685(if the errors are
2686.Nm flex Ns 's
2687fault \- these sorts of errors should be reported to the email address
2688given below).
2689.It Fl l
2690Turns on maximum compatibility with the original AT&T
2691.Nm lex
2692implementation.
2693Note that this does not mean full compatibility.
2694Use of this option costs a considerable amount of performance,
2695and it cannot be used with the
2696.Fl + , f , F , Cf ,
2697or
2698.Fl CF
2699options.
2700For details on the compatibilities it provides, see the section
2701.Sx INCOMPATIBILITIES WITH LEX AND POSIX
2702below.
2703This option also results in the name
2704.Dv YY_FLEX_LEX_COMPAT
2705being #define'd in the generated scanner.
2706.It Fl n
2707Another do-nothing, deprecated option included only for
2708.Tn POSIX
2709compliance.
2710.It Fl o Ns Ar output
2711Directs
2712.Nm
2713to write the scanner to the file
2714.Ar output
2715instead of
2716.Pa lex.yy.c .
2717If
2718.Fl o
2719is combined with the
2720.Fl t
2721option, then the scanner is written to stdout but its
2722.Dq #line
2723directives
2724(see the
2725.Fl L
2726option above)
2727refer to the file
2728.Ar output .
2729.It Fl P Ns Ar prefix
2730Changes the default
2731.Qq yy
2732prefix used by
2733.Nm
2734for all globally visible variable and function names to instead be
2735.Ar prefix .
2736For example,
2737.Fl P Ns Ar foo
2738changes the name of
2739.Fa yytext
2740to
2741.Fa footext .
2742It also changes the name of the default output file from
2743.Pa lex.yy.c
2744to
2745.Pa lex.foo.c .
2746Here are all of the names affected:
2747.Bd -unfilled -offset indent
2748yy_create_buffer
2749yy_delete_buffer
2750yy_flex_debug
2751yy_init_buffer
2752yy_flush_buffer
2753yy_load_buffer_state
2754yy_switch_to_buffer
2755yyin
2756yyleng
2757yylex
2758yylineno
2759yyout
2760yyrestart
2761yytext
2762yywrap
2763.Ed
2764.Pp
2765(If using a C++ scanner, then only
2766.Fa yywrap
2767and
2768.Fa yyFlexLexer
2769are affected.)
2770Within the scanner itself, it is still possible to refer to the global variables
2771and functions using either version of their name; but externally, they
2772have the modified name.
2773.Pp
2774This option allows multiple
2775.Nm
2776programs to be easily linked together into the same executable.
2777Note, though, that using this option also renames
2778.Fn yywrap ,
2779so now either an
2780.Pq appropriately named
2781version of the routine for the scanner must be supplied, or
2782.Dq %option noyywrap
2783must be used, as linking with
2784.Fl lfl
2785no longer provides one by default.
2786.It Fl p
2787Generates a performance report to stderr.
2788The report consists of comments regarding features of the
2789.Nm
2790input file which will cause a serious loss of performance in the resulting
2791scanner.
2792If the flag is specified twice,
2793comments regarding features that lead to minor performance losses
2794will also be reported>
2795.Pp
2796Note that the use of
2797.Em REJECT ,
2798.Dq %option yylineno ,
2799and variable trailing context
2800(see the
2801.Sx BUGS
2802section below)
2803entails a substantial performance penalty; use of
2804.Fn yymore ,
2805the
2806.Sq ^
2807operator, and the
2808.Fl I
2809flag entail minor performance penalties.
2810.It Fl S Ns Ar skeleton
2811Overrides the default skeleton file from which
2812.Nm
2813constructs its scanners.
2814This option is needed only for
2815.Nm
2816maintenance or development.
2817.It Fl s
2818Causes the default rule
2819.Pq that unmatched scanner input is echoed to stdout
2820to be suppressed.
2821If the scanner encounters input that does not
2822match any of its rules, it aborts with an error.
2823This option is useful for finding holes in a scanner's rule set.
2824.It Fl T
2825Makes
2826.Nm
2827run in
2828.Em trace
2829mode.
2830It will generate a lot of messages to stderr concerning
2831the form of the input and the resultant non-deterministic and deterministic
2832finite automata.
2833This option is mostly for use in maintaining
2834.Nm .
2835.It Fl t
2836Instructs
2837.Nm
2838to write the scanner it generates to standard output instead of
2839.Pa lex.yy.c .
2840.It Fl V
2841Prints the version number to stdout and exits.
2842.Fl Fl version
2843is a synonym for
2844.Fl V .
2845.It Fl v
2846Specifies that
2847.Nm
2848should write to stderr
2849a summary of statistics regarding the scanner it generates.
2850Most of the statistics are meaningless to the casual
2851.Nm
2852user, but the first line identifies the version of
2853.Nm
2854(same as reported by
2855.Fl V ) ,
2856and the next line the flags used when generating the scanner,
2857including those that are on by default.
2858.It Fl w
2859Suppresses warning messages.
2860.It Fl +
2861Specifies that
2862.Nm
2863should generate a C++ scanner class.
2864See the section on
2865.Sx GENERATING C++ SCANNERS
2866below for details.
2867.El
2868.Pp
2869.Nm
2870also provides a mechanism for controlling options within the
2871scanner specification itself, rather than from the
2872.Nm
2873command line.
2874This is done by including
2875.Dq %option
2876directives in the first section of the scanner specification.
2877Multiple options can be specified with a single
2878.Dq %option
2879directive, and multiple directives in the first section of the
2880.Nm
2881input file.
2882.Pp
2883Most options are given simply as names, optionally preceded by the word
2884.Qq no
2885.Pq with no intervening whitespace
2886to negate their meaning.
2887A number are equivalent to
2888.Nm
2889flags or their negation:
2890.Bd -unfilled -offset indent
28917bit            -7 option
28928bit            -8 option
2893align           -Ca option
2894backup          -b option
2895batch           -B option
2896c++             -+ option
2897
2898caseful or
2899case-sensitive  opposite of -i (default)
2900
2901case-insensitive or
2902caseless        -i option
2903
2904debug           -d option
2905default         opposite of -s option
2906ecs             -Ce option
2907fast            -F option
2908full            -f option
2909interactive     -I option
2910lex-compat      -l option
2911meta-ecs        -Cm option
2912perf-report     -p option
2913read            -Cr option
2914stdout          -t option
2915verbose         -v option
2916warn            opposite of -w option
2917                (use "%option nowarn" for -w)
2918
2919array           equivalent to "%array"
2920pointer         equivalent to "%pointer" (default)
2921.Ed
2922.Pp
2923Some %option's provide features otherwise not available:
2924.Bl -tag -width Ds
2925.It always-interactive
2926Instructs
2927.Nm
2928to generate a scanner which always considers its input
2929.Qq interactive .
2930Normally, on each new input file the scanner calls
2931.Fn isatty
2932in an attempt to determine whether the scanner's input source is interactive
2933and thus should be read a character at a time.
2934When this option is used, however, no such call is made.
2935.It main
2936Directs
2937.Nm
2938to provide a default
2939.Fn main
2940program for the scanner, which simply calls
2941.Fn yylex .
2942This option implies
2943.Dq noyywrap
2944.Pq see below .
2945.It never-interactive
2946Instructs
2947.Nm
2948to generate a scanner which never considers its input
2949.Qq interactive
2950(again, no call made to
2951.Fn isatty ) .
2952This is the opposite of
2953.Dq always-interactive .
2954.It stack
2955Enables the use of start condition stacks
2956(see
2957.Sx START CONDITIONS
2958above).
2959.It stdinit
2960If set (i.e.,
2961.Dq %option stdinit ) ,
2962initializes
2963.Fa yyin
2964and
2965.Fa yyout
2966to stdin and stdout, instead of the default of
2967.Dq nil .
2968Some existing
2969.Nm lex
2970programs depend on this behavior, even though it is not compliant with ANSI C,
2971which does not require stdin and stdout to be compile-time constant.
2972.It yylineno
2973Directs
2974.Nm
2975to generate a scanner that maintains the number of the current line
2976read from its input in the global variable
2977.Fa yylineno .
2978This option is implied by
2979.Dq %option lex-compat .
2980.It yywrap
2981If unset (i.e.,
2982.Dq %option noyywrap ) ,
2983makes the scanner not call
2984.Fn yywrap
2985upon an end-of-file, but simply assume that there are no more files to scan
2986(until the user points
2987.Fa yyin
2988at a new file and calls
2989.Fn yylex
2990again).
2991.El
2992.Pp
2993.Nm
2994scans rule actions to determine whether the
2995.Em REJECT
2996or
2997.Fn yymore
2998features are being used.
2999The
3000.Dq reject
3001and
3002.Dq yymore
3003options are available to override its decision as to whether to use the
3004options, either by setting them (e.g.,
3005.Dq %option reject )
3006to indicate the feature is indeed used,
3007or unsetting them to indicate it actually is not used
3008(e.g.,
3009.Dq %option noyymore ) .
3010.Pp
3011Three options take string-delimited values, offset with
3012.Sq = :
3013.Pp
3014.D1 %option outfile="ABC"
3015.Pp
3016is equivalent to
3017.Fl o Ns Ar ABC ,
3018and
3019.Pp
3020.D1 %option prefix="XYZ"
3021.Pp
3022is equivalent to
3023.Fl P Ns Ar XYZ .
3024Finally,
3025.Pp
3026.D1 %option yyclass="foo"
3027.Pp
3028only applies when generating a C++ scanner
3029.Pf ( Fl +
3030option).
3031It informs
3032.Nm
3033that
3034.Dq foo
3035has been derived as a subclass of yyFlexLexer, so
3036.Nm
3037will place actions in the member function
3038.Dq foo::yylex()
3039instead of
3040.Dq yyFlexLexer::yylex() .
3041It also generates a
3042.Dq yyFlexLexer::yylex()
3043member function that emits a run-time error (by invoking
3044.Dq yyFlexLexer::LexerError() )
3045if called.
3046See
3047.Sx GENERATING C++ SCANNERS ,
3048below, for additional information.
3049.Pp
3050A number of options are available for
3051lint
3052purists who want to suppress the appearance of unneeded routines
3053in the generated scanner.
3054Each of the following, if unset
3055(e.g.,
3056.Dq %option nounput ) ,
3057results in the corresponding routine not appearing in the generated scanner:
3058.Bd -unfilled -offset indent
3059input, unput
3060yy_push_state, yy_pop_state, yy_top_state
3061yy_scan_buffer, yy_scan_bytes, yy_scan_string
3062.Ed
3063.Pp
3064(though
3065.Fn yy_push_state
3066and friends won't appear anyway unless
3067.Dq %option stack
3068is being used).
3069.Sh PERFORMANCE CONSIDERATIONS
3070The main design goal of
3071.Nm
3072is that it generate high-performance scanners.
3073It has been optimized for dealing well with large sets of rules.
3074Aside from the effects on scanner speed of the table compression
3075.Fl C
3076options outlined above,
3077there are a number of options/actions which degrade performance.
3078These are, from most expensive to least:
3079.Bd -unfilled -offset indent
3080REJECT
3081%option yylineno
3082arbitrary trailing context
3083
3084pattern sets that require backing up
3085%array
3086%option interactive
3087%option always-interactive
3088
3089\&'^' beginning-of-line operator
3090yymore()
3091.Ed
3092.Pp
3093with the first three all being quite expensive
3094and the last two being quite cheap.
3095Note also that
3096.Fn unput
3097is implemented as a routine call that potentially does quite a bit of work,
3098while
3099.Fn yyless
3100is a quite-cheap macro; so if just putting back some excess text,
3101use
3102.Fn yyless .
3103.Pp
3104.Em REJECT
3105should be avoided at all costs when performance is important.
3106It is a particularly expensive option.
3107.Pp
3108Getting rid of backing up is messy and often may be an enormous
3109amount of work for a complicated scanner.
3110In principal, one begins by using the
3111.Fl b
3112flag to generate a
3113.Pa lex.backup
3114file.
3115For example, on the input
3116.Bd -literal -offset indent
3117%%
3118foo        return TOK_KEYWORD;
3119foobar     return TOK_KEYWORD;
3120.Ed
3121.Pp
3122the file looks like:
3123.Bd -literal -offset indent
3124State #6 is non-accepting -
3125 associated rule line numbers:
3126       2       3
3127 out-transitions: [ o ]
3128 jam-transitions: EOF [ \e001-n  p-\e177 ]
3129
3130State #8 is non-accepting -
3131 associated rule line numbers:
3132       3
3133 out-transitions: [ a ]
3134 jam-transitions: EOF [ \e001-`  b-\e177 ]
3135
3136State #9 is non-accepting -
3137 associated rule line numbers:
3138       3
3139 out-transitions: [ r ]
3140 jam-transitions: EOF [ \e001-q  s-\e177 ]
3141
3142Compressed tables always back up.
3143.Ed
3144.Pp
3145The first few lines tell us that there's a scanner state in
3146which it can make a transition on an
3147.Sq o
3148but not on any other character,
3149and that in that state the currently scanned text does not match any rule.
3150The state occurs when trying to match the rules found
3151at lines 2 and 3 in the input file.
3152If the scanner is in that state and then reads something other than an
3153.Sq o ,
3154it will have to back up to find a rule which is matched.
3155With a bit of headscratching one can see that this must be the
3156state it's in when it has seen
3157.Sq fo .
3158When this has happened, if anything other than another
3159.Sq o
3160is seen, the scanner will have to back up to simply match the
3161.Sq f
3162.Pq by the default rule .
3163.Pp
3164The comment regarding State #8 indicates there's a problem when
3165.Qq foob
3166has been scanned.
3167Indeed, on any character other than an
3168.Sq a ,
3169the scanner will have to back up to accept
3170.Qq foo .
3171Similarly, the comment for State #9 concerns when
3172.Qq fooba
3173has been scanned and an
3174.Sq r
3175does not follow.
3176.Pp
3177The final comment reminds us that there's no point going to
3178all the trouble of removing backing up from the rules unless we're using
3179.Fl Cf
3180or
3181.Fl CF ,
3182since there's no performance gain doing so with compressed scanners.
3183.Pp
3184The way to remove the backing up is to add
3185.Qq error
3186rules:
3187.Bd -literal -offset indent
3188%%
3189foo    return TOK_KEYWORD;
3190foobar return TOK_KEYWORD;
3191
3192fooba  |
3193foob   |
3194fo {
3195        /* false alarm, not really a keyword */
3196        return TOK_ID;
3197}
3198.Ed
3199.Pp
3200Eliminating backing up among a list of keywords can also be done using a
3201.Qq catch-all
3202rule:
3203.Bd -literal -offset indent
3204%%
3205foo    return TOK_KEYWORD;
3206foobar return TOK_KEYWORD;
3207
3208[a-z]+ return TOK_ID;
3209.Ed
3210.Pp
3211This is usually the best solution when appropriate.
3212.Pp
3213Backing up messages tend to cascade.
3214With a complicated set of rules it's not uncommon to get hundreds of messages.
3215If one can decipher them, though,
3216it often only takes a dozen or so rules to eliminate the backing up
3217(though it's easy to make a mistake and have an error rule accidentally match
3218a valid token; a possible future
3219.Nm
3220feature will be to automatically add rules to eliminate backing up).
3221.Pp
3222It's important to keep in mind that the benefits of eliminating
3223backing up are gained only if
3224.Em every
3225instance of backing up is eliminated.
3226Leaving just one gains nothing.
3227.Pp
3228.Em Variable
3229trailing context
3230(where both the leading and trailing parts do not have a fixed length)
3231entails almost the same performance loss as
3232.Em REJECT
3233.Pq i.e., substantial .
3234So when possible a rule like:
3235.Bd -literal -offset indent
3236%%
3237mouse|rat/(cat|dog)   run();
3238.Ed
3239.Pp
3240is better written:
3241.Bd -literal -offset indent
3242%%
3243mouse/cat|dog         run();
3244rat/cat|dog           run();
3245.Ed
3246.Pp
3247or as
3248.Bd -literal -offset indent
3249%%
3250mouse|rat/cat         run();
3251mouse|rat/dog         run();
3252.Ed
3253.Pp
3254Note that here the special
3255.Sq |\&
3256action does not provide any savings, and can even make things worse (see
3257.Sx BUGS
3258below).
3259.Pp
3260Another area where the user can increase a scanner's performance
3261.Pq and one that's easier to implement
3262arises from the fact that the longer the tokens matched,
3263the faster the scanner will run.
3264This is because with long tokens the processing of most input
3265characters takes place in the
3266.Pq short
3267inner scanning loop, and does not often have to go through the additional work
3268of setting up the scanning environment (e.g.,
3269.Fa yytext )
3270for the action.
3271Recall the scanner for C comments:
3272.Bd -literal -offset indent
3273%x comment
3274%%
3275int line_num = 1;
3276
3277"/*"                    BEGIN(comment);
3278
3279<comment>[^*\en]*
3280<comment>"*"+[^*/\en]*
3281<comment>\en             ++line_num;
3282<comment>"*"+"/"        BEGIN(INITIAL);
3283.Ed
3284.Pp
3285This could be sped up by writing it as:
3286.Bd -literal -offset indent
3287%x comment
3288%%
3289int line_num = 1;
3290
3291"/*"                    BEGIN(comment);
3292
3293<comment>[^*\en]*
3294<comment>[^*\en]*\en      ++line_num;
3295<comment>"*"+[^*/\en]*
3296<comment>"*"+[^*/\en]*\en ++line_num;
3297<comment>"*"+"/"        BEGIN(INITIAL);
3298.Ed
3299.Pp
3300Now instead of each newline requiring the processing of another action,
3301recognizing the newlines is
3302.Qq distributed
3303over the other rules to keep the matched text as long as possible.
3304Note that adding rules does
3305.Em not
3306slow down the scanner!
3307The speed of the scanner is independent of the number of rules or
3308(modulo the considerations given at the beginning of this section)
3309how complicated the rules are with regard to operators such as
3310.Sq *
3311and
3312.Sq |\& .
3313.Pp
3314A final example in speeding up a scanner:
3315scan through a file containing identifiers and keywords, one per line
3316and with no other extraneous characters, and recognize all the keywords.
3317A natural first approach is:
3318.Bd -literal -offset indent
3319%%
3320asm      |
3321auto     |
3322break    |
3323\&... etc ...
3324volatile |
3325while    /* it's a keyword */
3326
3327\&.|\en     /* it's not a keyword */
3328.Ed
3329.Pp
3330To eliminate the back-tracking, introduce a catch-all rule:
3331.Bd -literal -offset indent
3332%%
3333asm      |
3334auto     |
3335break    |
3336\&... etc ...
3337volatile |
3338while    /* it's a keyword */
3339
3340[a-z]+   |
3341\&.|\en     /* it's not a keyword */
3342.Ed
3343.Pp
3344Now, if it's guaranteed that there's exactly one word per line,
3345then we can reduce the total number of matches by a half by
3346merging in the recognition of newlines with that of the other tokens:
3347.Bd -literal -offset indent
3348%%
3349asm\en      |
3350auto\en     |
3351break\en    |
3352\&... etc ...
3353volatile\en |
3354while\en    /* it's a keyword */
3355
3356[a-z]+\en   |
3357\&.|\en       /* it's not a keyword */
3358.Ed
3359.Pp
3360One has to be careful here,
3361as we have now reintroduced backing up into the scanner.
3362In particular, while we know that there will never be any characters
3363in the input stream other than letters or newlines,
3364.Nm
3365can't figure this out, and it will plan for possibly needing to back up
3366when it has scanned a token like
3367.Qq auto
3368and then the next character is something other than a newline or a letter.
3369Previously it would then just match the
3370.Qq auto
3371rule and be done, but now it has no
3372.Qq auto
3373rule, only an
3374.Qq auto\en
3375rule.
3376To eliminate the possibility of backing up,
3377we could either duplicate all rules but without final newlines, or,
3378since we never expect to encounter such an input and therefore don't
3379how it's classified, we can introduce one more catch-all rule,
3380this one which doesn't include a newline:
3381.Bd -literal -offset indent
3382%%
3383asm\en      |
3384auto\en     |
3385break\en    |
3386\&... etc ...
3387volatile\en |
3388while\en    /* it's a keyword */
3389
3390[a-z]+\en   |
3391[a-z]+     |
3392\&.|\en       /* it's not a keyword */
3393.Ed
3394.Pp
3395Compiled with
3396.Fl Cf ,
3397this is about as fast as one can get a
3398.Nm
3399scanner to go for this particular problem.
3400.Pp
3401A final note:
3402.Nm
3403is slow when matching NUL's,
3404particularly when a token contains multiple NUL's.
3405It's best to write rules which match short
3406amounts of text if it's anticipated that the text will often include NUL's.
3407.Pp
3408Another final note regarding performance: as mentioned above in the section
3409.Sx HOW THE INPUT IS MATCHED ,
3410dynamically resizing
3411.Fa yytext
3412to accommodate huge tokens is a slow process because it presently requires that
3413the
3414.Pq huge
3415token be rescanned from the beginning.
3416Thus if performance is vital, it is better to attempt to match
3417.Qq large
3418quantities of text but not
3419.Qq huge
3420quantities, where the cutoff between the two is at about 8K characters/token.
3421.Sh GENERATING C++ SCANNERS
3422.Nm
3423provides two different ways to generate scanners for use with C++.
3424The first way is to simply compile a scanner generated by
3425.Nm
3426using a C++ compiler instead of a C compiler.
3427This should not generate any compilation errors
3428(please report any found to the email address given in the
3429.Sx AUTHORS
3430section below).
3431C++ code can then be used in rule actions instead of C code.
3432Note that the default input source for scanners remains
3433.Fa yyin ,
3434and default echoing is still done to
3435.Fa yyout .
3436Both of these remain
3437.Fa FILE *
3438variables and not C++ streams.
3439.Pp
3440.Nm
3441can also be used to generate a C++ scanner class, using the
3442.Fl +
3443option (or, equivalently,
3444.Dq %option c++ ) ,
3445which is automatically specified if the name of the flex executable ends in a
3446.Sq + ,
3447such as
3448.Nm flex++ .
3449When using this option,
3450.Nm
3451defaults to generating the scanner to the file
3452.Pa lex.yy.cc
3453instead of
3454.Pa lex.yy.c .
3455The generated scanner includes the header file
3456.Aq Pa g++/FlexLexer.h ,
3457which defines the interface to two C++ classes.
3458.Pp
3459The first class,
3460.Em FlexLexer ,
3461provides an abstract base class defining the general scanner class interface.
3462It provides the following member functions:
3463.Bl -tag -width Ds
3464.It const char* YYText()
3465Returns the text of the most recently matched token, the equivalent of
3466.Fa yytext .
3467.It int YYLeng()
3468Returns the length of the most recently matched token, the equivalent of
3469.Fa yyleng .
3470.It int lineno() const
3471Returns the current input line number
3472(see
3473.Dq %option yylineno ) ,
3474or 1 if
3475.Dq %option yylineno
3476was not used.
3477.It void set_debug(int flag)
3478Sets the debugging flag for the scanner, equivalent to assigning to
3479.Fa yy_flex_debug
3480(see the
3481.Sx OPTIONS
3482section above).
3483Note that the scanner must be built using
3484.Dq %option debug
3485to include debugging information in it.
3486.It int debug() const
3487Returns the current setting of the debugging flag.
3488.El
3489.Pp
3490Also provided are member functions equivalent to
3491.Fn yy_switch_to_buffer ,
3492.Fn yy_create_buffer
3493(though the first argument is an
3494.Fa std::istream*
3495object pointer and not a
3496.Fa FILE* ) ,
3497.Fn yy_flush_buffer ,
3498.Fn yy_delete_buffer ,
3499and
3500.Fn yyrestart
3501(again, the first argument is an
3502.Fa std::istream*
3503object pointer).
3504.Pp
3505The second class defined in
3506.Aq Pa g++/FlexLexer.h
3507is
3508.Fa yyFlexLexer ,
3509which is derived from
3510.Fa FlexLexer .
3511It defines the following additional member functions:
3512.Bl -tag -width Ds
3513.It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
3514Constructs a
3515.Fa yyFlexLexer
3516object using the given streams for input and output.
3517If not specified, the streams default to
3518.Fa cin
3519and
3520.Fa cout ,
3521respectively.
3522.It virtual int yylex()
3523Performs the same role as
3524.Fn yylex
3525does for ordinary flex scanners: it scans the input stream, consuming
3526tokens, until a rule's action returns a value.
3527If subclass
3528.Sq S
3529is derived from
3530.Fa yyFlexLexer ,
3531in order to access the member functions and variables of
3532.Sq S
3533inside
3534.Fn yylex ,
3535use
3536.Dq %option yyclass="S"
3537to inform
3538.Nm
3539that the
3540.Sq S
3541subclass will be used instead of
3542.Fa yyFlexLexer .
3543In this case, rather than generating
3544.Dq yyFlexLexer::yylex() ,
3545.Nm
3546generates
3547.Dq S::yylex()
3548(and also generates a dummy
3549.Dq yyFlexLexer::yylex()
3550that calls
3551.Dq yyFlexLexer::LexerError()
3552if called).
3553.It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
3554Reassigns
3555.Fa yyin
3556to
3557.Fa new_in
3558.Pq if non-nil
3559and
3560.Fa yyout
3561to
3562.Fa new_out
3563.Pq ditto ,
3564deleting the previous input buffer if
3565.Fa yyin
3566is reassigned.
3567.It int yylex(std::istream* new_in, std::ostream* new_out = 0)
3568First switches the input streams via
3569.Dq switch_streams(new_in, new_out)
3570and then returns the value of
3571.Fn yylex .
3572.El
3573.Pp
3574In addition,
3575.Fa yyFlexLexer
3576defines the following protected virtual functions which can be redefined
3577in derived classes to tailor the scanner:
3578.Bl -tag -width Ds
3579.It virtual int LexerInput(char* buf, int max_size)
3580Reads up to
3581.Fa max_size
3582characters into
3583.Fa buf
3584and returns the number of characters read.
3585To indicate end-of-input, return 0 characters.
3586Note that
3587.Qq interactive
3588scanners (see the
3589.Fl B
3590and
3591.Fl I
3592flags) define the macro
3593.Dv YY_INTERACTIVE .
3594If
3595.Fn LexerInput
3596has been redefined, and it's necessary to take different actions depending on
3597whether or not the scanner might be scanning an interactive input source,
3598it's possible to test for the presence of this name via
3599.Dq #ifdef .
3600.It virtual void LexerOutput(const char* buf, int size)
3601Writes out
3602.Fa size
3603characters from the buffer
3604.Fa buf ,
3605which, while NUL-terminated, may also contain
3606.Qq internal
3607NUL's if the scanner's rules can match text with NUL's in them.
3608.It virtual void LexerError(const char* msg)
3609Reports a fatal error message.
3610The default version of this function writes the message to the stream
3611.Fa cerr
3612and exits.
3613.El
3614.Pp
3615Note that a
3616.Fa yyFlexLexer
3617object contains its entire scanning state.
3618Thus such objects can be used to create reentrant scanners.
3619Multiple instances of the same
3620.Fa yyFlexLexer
3621class can be instantiated, and multiple C++ scanner classes can be combined
3622in the same program using the
3623.Fl P
3624option discussed above.
3625.Pp
3626Finally, note that the
3627.Dq %array
3628feature is not available to C++ scanner classes;
3629.Dq %pointer
3630must be used
3631.Pq the default .
3632.Pp
3633Here is an example of a simple C++ scanner:
3634.Bd -literal -offset indent
3635// An example of using the flex C++ scanner class.
3636
3637%{
3638#include <errno.h>
3639int mylineno = 0;
3640%}
3641
3642string  \e"[^\en"]+\e"
3643
3644ws      [ \et]+
3645
3646alpha   [A-Za-z]
3647dig     [0-9]
3648name    ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3649num1    [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3650num2    [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3651number  {num1}|{num2}
3652
3653%%
3654
3655{ws}    /* skip blanks and tabs */
3656
3657"/*" {
3658        int c;
3659
3660        while ((c = yyinput()) != 0) {
3661                if(c == '\en')
3662                    ++mylineno;
3663                else if(c == '*') {
3664                    if ((c = yyinput()) == '/')
3665                        break;
3666                    else
3667                        unput(c);
3668                }
3669        }
3670}
3671
3672{number}  cout << "number " << YYText() << '\en';
3673
3674\en        mylineno++;
3675
3676{name}    cout << "name " << YYText() << '\en';
3677
3678{string}  cout << "string " << YYText() << '\en';
3679
3680%%
3681
3682int main(int /* argc */, char** /* argv */)
3683{
3684	FlexLexer* lexer = new yyFlexLexer;
3685	while(lexer->yylex() != 0)
3686	    ;
3687	return 0;
3688}
3689.Ed
3690.Pp
3691To create multiple
3692.Pq different
3693lexer classes, use the
3694.Fl P
3695flag
3696(or the
3697.Dq prefix=
3698option)
3699to rename each
3700.Fa yyFlexLexer
3701to some other
3702.Fa xxFlexLexer .
3703.Aq Pa g++/FlexLexer.h
3704can then be included in other sources once per lexer class, first renaming
3705.Fa yyFlexLexer
3706as follows:
3707.Bd -literal -offset indent
3708#undef yyFlexLexer
3709#define yyFlexLexer xxFlexLexer
3710#include <g++/FlexLexer.h>
3711
3712#undef yyFlexLexer
3713#define yyFlexLexer zzFlexLexer
3714#include <g++/FlexLexer.h>
3715.Ed
3716.Pp
3717If, for example,
3718.Dq %option prefix="xx"
3719is used for one scanner and
3720.Dq %option prefix="zz"
3721is used for the other.
3722.Pp
3723.Sy IMPORTANT :
3724the present form of the scanning class is experimental
3725and may change considerably between major releases.
3726.Sh INCOMPATIBILITIES WITH LEX AND POSIX
3727.Nm
3728is a rewrite of the
3729.At
3730.Nm lex
3731tool
3732(the two implementations do not share any code, though),
3733with some extensions and incompatibilities, both of which are of concern
3734to those who wish to write scanners acceptable to either implementation.
3735.Nm
3736is fully compliant with the
3737.Tn POSIX
3738.Nm lex
3739specification, except that when using
3740.Dq %pointer
3741.Pq the default ,
3742a call to
3743.Fn unput
3744destroys the contents of
3745.Fa yytext ,
3746which is counter to the
3747.Tn POSIX
3748specification.
3749.Pp
3750In this section we discuss all of the known areas of incompatibility between
3751.Nm ,
3752AT&T
3753.Nm lex ,
3754and the
3755.Tn POSIX
3756specification.
3757.Pp
3758.Nm flex Ns 's
3759.Fl l
3760option turns on maximum compatibility with the original AT&T
3761.Nm lex
3762implementation, at the cost of a major loss in the generated scanner's
3763performance.
3764We note below which incompatibilities can be overcome using the
3765.Fl l
3766option.
3767.Pp
3768.Nm
3769is fully compatible with
3770.Nm lex
3771with the following exceptions:
3772.Bl -dash
3773.It
3774The undocumented
3775.Nm lex
3776scanner internal variable
3777.Fa yylineno
3778is not supported unless
3779.Fl l
3780or
3781.Dq %option yylineno
3782is used.
3783.Pp
3784.Fa yylineno
3785should be maintained on a per-buffer basis, rather than a per-scanner
3786.Pq single global variable
3787basis.
3788.Pp
3789.Fa yylineno
3790is not part of the
3791.Tn POSIX
3792specification.
3793.It
3794The
3795.Fn input
3796routine is not redefinable, though it may be called to read characters
3797following whatever has been matched by a rule.
3798If
3799.Fn input
3800encounters an end-of-file, the normal
3801.Fn yywrap
3802processing is done.
3803A
3804.Dq real
3805end-of-file is returned by
3806.Fn input
3807as
3808.Dv EOF .
3809.Pp
3810Input is instead controlled by defining the
3811.Dv YY_INPUT
3812macro.
3813.Pp
3814The
3815.Nm
3816restriction that
3817.Fn input
3818cannot be redefined is in accordance with the
3819.Tn POSIX
3820specification, which simply does not specify any way of controlling the
3821scanner's input other than by making an initial assignment to
3822.Fa yyin .
3823.It
3824The
3825.Fn unput
3826routine is not redefinable.
3827This restriction is in accordance with
3828.Tn POSIX .
3829.It
3830.Nm
3831scanners are not as reentrant as
3832.Nm lex
3833scanners.
3834In particular, if a scanner is interactive and
3835an interrupt handler long-jumps out of the scanner,
3836and the scanner is subsequently called again,
3837the following error message may be displayed:
3838.Pp
3839.D1 fatal flex scanner internal error--end of buffer missed
3840.Pp
3841To reenter the scanner, first use
3842.Pp
3843.Dl yyrestart(yyin);
3844.Pp
3845Note that this call will throw away any buffered input;
3846usually this isn't a problem with an interactive scanner.
3847.Pp
3848Also note that flex C++ scanner classes are reentrant,
3849so if using C++ is an option , they should be used instead.
3850See
3851.Sx GENERATING C++ SCANNERS
3852above for details.
3853.It
3854.Fn output
3855is not supported.
3856Output from the
3857.Em ECHO
3858macro is done to the file-pointer
3859.Fa yyout
3860.Pq default stdout .
3861.Pp
3862.Fn output
3863is not part of the
3864.Tn POSIX
3865specification.
3866.It
3867.Nm lex
3868does not support exclusive start conditions
3869.Pq %x ,
3870though they are in the
3871.Tn POSIX
3872specification.
3873.It
3874When definitions are expanded,
3875.Nm
3876encloses them in parentheses.
3877With
3878.Nm lex ,
3879the following:
3880.Bd -literal -offset indent
3881NAME    [A-Z][A-Z0-9]*
3882%%
3883foo{NAME}?      printf("Found it\en");
3884%%
3885.Ed
3886.Pp
3887will not match the string
3888.Qq foo
3889because when the macro is expanded the rule is equivalent to
3890.Qq foo[A-Z][A-Z0-9]*?
3891and the precedence is such that the
3892.Sq ?\&
3893is associated with
3894.Qq [A-Z0-9]* .
3895With
3896.Nm ,
3897the rule will be expanded to
3898.Qq foo([A-Z][A-Z0-9]*)?
3899and so the string
3900.Qq foo
3901will match.
3902.Pp
3903Note that if the definition begins with
3904.Sq ^
3905or ends with
3906.Sq $
3907then it is not expanded with parentheses, to allow these operators to appear in
3908definitions without losing their special meanings.
3909But the
3910.Sq Aq s ,
3911.Sq / ,
3912and
3913.Aq Aq EOF
3914operators cannot be used in a
3915.Nm
3916definition.
3917.Pp
3918Using
3919.Fl l
3920results in the
3921.Nm lex
3922behavior of no parentheses around the definition.
3923.Pp
3924The
3925.Tn POSIX
3926specification is that the definition be enclosed in parentheses.
3927.It
3928Some implementations of
3929.Nm lex
3930allow a rule's action to begin on a separate line,
3931if the rule's pattern has trailing whitespace:
3932.Bd -literal -offset indent
3933%%
3934foo|bar<space here>
3935  { foobar_action(); }
3936.Ed
3937.Pp
3938.Nm
3939does not support this feature.
3940.It
3941The
3942.Nm lex
3943.Sq %r
3944.Pq generate a Ratfor scanner
3945option is not supported.
3946It is not part of the
3947.Tn POSIX
3948specification.
3949.It
3950After a call to
3951.Fn unput ,
3952.Fa yytext
3953is undefined until the next token is matched,
3954unless the scanner was built using
3955.Dq %array .
3956This is not the case with
3957.Nm lex
3958or the
3959.Tn POSIX
3960specification.
3961The
3962.Fl l
3963option does away with this incompatibility.
3964.It
3965The precedence of the
3966.Sq {}
3967.Pq numeric range
3968operator is different.
3969.Nm lex
3970interprets
3971.Qq abc{1,3}
3972as match one, two, or three occurrences of
3973.Sq abc ,
3974whereas
3975.Nm
3976interprets it as match
3977.Sq ab
3978followed by one, two, or three occurrences of
3979.Sq c .
3980The latter is in agreement with the
3981.Tn POSIX
3982specification.
3983.It
3984The precedence of the
3985.Sq ^
3986operator is different.
3987.Nm lex
3988interprets
3989.Qq ^foo|bar
3990as match either
3991.Sq foo
3992at the beginning of a line, or
3993.Sq bar
3994anywhere, whereas
3995.Nm
3996interprets it as match either
3997.Sq foo
3998or
3999.Sq bar
4000if they come at the beginning of a line.
4001The latter is in agreement with the
4002.Tn POSIX
4003specification.
4004.It
4005The special table-size declarations such as
4006.Sq %a
4007supported by
4008.Nm lex
4009are not required by
4010.Nm
4011scanners;
4012.Nm
4013ignores them.
4014.It
4015The name
4016.Dv FLEX_SCANNER
4017is #define'd so scanners may be written for use with either
4018.Nm
4019or
4020.Nm lex .
4021Scanners also include
4022.Dv YY_FLEX_MAJOR_VERSION
4023and
4024.Dv YY_FLEX_MINOR_VERSION
4025indicating which version of
4026.Nm
4027generated the scanner
4028(for example, for the 2.5 release, these defines would be 2 and 5,
4029respectively).
4030.El
4031.Pp
4032The following
4033.Nm
4034features are not included in
4035.Nm lex
4036or the
4037.Tn POSIX
4038specification:
4039.Bd -unfilled -offset indent
4040C++ scanners
4041%option
4042start condition scopes
4043start condition stacks
4044interactive/non-interactive scanners
4045yy_scan_string() and friends
4046yyterminate()
4047yy_set_interactive()
4048yy_set_bol()
4049YY_AT_BOL()
4050<<EOF>>
4051<*>
4052YY_DECL
4053YY_START
4054YY_USER_ACTION
4055YY_USER_INIT
4056#line directives
4057%{}'s around actions
4058multiple actions on a line
4059.Ed
4060.Pp
4061plus almost all of the
4062.Nm
4063flags.
4064The last feature in the list refers to the fact that with
4065.Nm
4066Multiple actions ican be placed on the same line,
4067separated with semi-colons, while with
4068.Nm lex ,
4069the following
4070.Pp
4071.Dl foo    handle_foo(); ++num_foos_seen;
4072.Pp
4073is
4074.Pq rather surprisingly
4075truncated to
4076.Pp
4077.Dl foo    handle_foo();
4078.Pp
4079.Nm
4080does not truncate the action.
4081Actions that are not enclosed in braces
4082are simply terminated at the end of the line.
4083.Sh FILES
4084.Bl -tag -width "<g++/FlexLexer.h>"
4085.It flex.skl
4086Skeleton scanner.
4087This file is only used when building flex, not when
4088.Nm
4089executes.
4090.It lex.backup
4091Backing-up information for the
4092.Fl b
4093flag (called
4094.Pa lex.bck
4095on some systems).
4096.It lex.yy.c
4097Generated scanner
4098(called
4099.Pa lexyy.c
4100on some systems).
4101.It lex.yy.cc
4102Generated C++ scanner class, when using
4103.Fl + .
4104.It Aq g++/FlexLexer.h
4105Header file defining the C++ scanner base class,
4106.Fa FlexLexer ,
4107and its derived class,
4108.Fa yyFlexLexer .
4109.It /usr/lib/libl.*
4110.Nm
4111libraries.
4112The
4113.Pa /usr/lib/libfl.*\&
4114libraries are links to these.
4115Scanners must be linked using either
4116.Fl \&ll
4117or
4118.Fl lfl .
4119.El
4120.Sh EXIT STATUS
4121.Ex -std flex
4122.Sh DIAGNOSTICS
4123.Bl -diag
4124.It warning, rule cannot be matched
4125Indicates that the given rule cannot be matched because it follows other rules
4126that will always match the same text as it.
4127For example, in the following
4128.Dq foo
4129cannot be matched because it comes after an identifier
4130.Qq catch-all
4131rule:
4132.Bd -literal -offset indent
4133[a-z]+    got_identifier();
4134foo       got_foo();
4135.Ed
4136.Pp
4137Using
4138.Em REJECT
4139in a scanner suppresses this warning.
4140.It "warning, \-s option given but default rule can be matched"
4141Means that it is possible
4142.Pq perhaps only in a particular start condition
4143that the default rule
4144.Pq match any single character
4145is the only one that will match a particular input.
4146Since
4147.Fl s
4148was given, presumably this is not intended.
4149.It reject_used_but_not_detected undefined
4150.It yymore_used_but_not_detected undefined
4151These errors can occur at compile time.
4152They indicate that the scanner uses
4153.Em REJECT
4154or
4155.Fn yymore
4156but that
4157.Nm
4158failed to notice the fact, meaning that
4159.Nm
4160scanned the first two sections looking for occurrences of these actions
4161and failed to find any, but somehow they snuck in
4162.Pq via an #include file, for example .
4163Use
4164.Dq %option reject
4165or
4166.Dq %option yymore
4167to indicate to
4168.Nm
4169that these features are really needed.
4170.It flex scanner jammed
4171A scanner compiled with
4172.Fl s
4173has encountered an input string which wasn't matched by any of its rules.
4174This error can also occur due to internal problems.
4175.It token too large, exceeds YYLMAX
4176The scanner uses
4177.Dq %array
4178and one of its rules matched a string longer than the
4179.Dv YYLMAX
4180constant
4181.Pq 8K bytes by default .
4182The value can be increased by #define'ing
4183.Dv YYLMAX
4184in the definitions section of
4185.Nm
4186input.
4187.It "scanner requires \-8 flag to use the character 'x'"
4188The scanner specification includes recognizing the 8-bit character
4189.Sq x
4190and the
4191.Fl 8
4192flag was not specified, and defaulted to 7-bit because the
4193.Fl Cf
4194or
4195.Fl CF
4196table compression options were used.
4197See the discussion of the
4198.Fl 7
4199flag for details.
4200.It flex scanner push-back overflow
4201unput() was used to push back so much text that the scanner's buffer
4202could not hold both the pushed-back text and the current token in
4203.Fa yytext .
4204Ideally the scanner should dynamically resize the buffer in this case,
4205but at present it does not.
4206.It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4207The scanner was working on matching an extremely large token and needed
4208to expand the input buffer.
4209This doesn't work with scanners that use
4210.Em REJECT .
4211.It "fatal flex scanner internal error--end of buffer missed"
4212This can occur in an scanner which is reentered after a long-jump
4213has jumped out
4214.Pq or over
4215the scanner's activation frame.
4216Before reentering the scanner, use:
4217.Pp
4218.Dl yyrestart(yyin);
4219.Pp
4220or, as noted above, switch to using the C++ scanner class.
4221.It "too many start conditions in <> construct!"
4222More start conditions than exist were listed in a <> construct
4223(so at least one of them must have been listed twice).
4224.El
4225.Sh SEE ALSO
4226.Xr awk 1 ,
4227.Xr sed 1 ,
4228.Xr yacc 1
4229.Rs
4230.%A John Levine
4231.%A Tony Mason
4232.%A Doug Brown
4233.%B Lex & Yacc
4234.%I O'Reilly and Associates
4235.%N 2nd edition
4236.Re
4237.Rs
4238.%A Alfred Aho
4239.%A Ravi Sethi
4240.%A Jeffrey Ullman
4241.%B Compilers: Principles, Techniques and Tools
4242.%I Addison-Wesley
4243.%D 1986
4244.%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4245.Re
4246.Sh STANDARDS
4247The
4248.Nm lex
4249utility is compliant with the
4250.St -p1003.1-2008
4251specification,
4252though its presence is optional.
4253.Pp
4254The flags
4255.Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
4256.Op Fl -help ,
4257and
4258.Op Fl -version
4259are extensions to that specification.
4260.Sh AUTHORS
4261Vern Paxson, with the help of many ideas and much inspiration from
4262Van Jacobson.
4263Original version by Jef Poskanzer.
4264The fast table representation is a partial implementation of a design done by
4265Van Jacobson.
4266The implementation was done by Kevin Gong and Vern Paxson.
4267.Pp
4268Thanks to the many
4269.Nm
4270beta-testers, feedbackers, and contributors, especially Francois Pinard,
4271Casey Leedom,
4272Robert Abramovitz,
4273Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4274Neal Becker, Nelson H.F. Beebe, benson@odi.com,
4275Karl Berry, Peter A. Bigot, Simon Blanchard,
4276Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4277Brian Clapper, J.T. Conklin,
4278Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4279Daniels, Chris G. Demetriou, Theo de Raadt,
4280Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4281Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4282Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4283Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4284Jan Hajic, Charles Hemphill, NORO Hideo,
4285Jarkko Hietaniemi, Scott Hofmann,
4286Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4287Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4288Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4289Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
4290Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4291Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4292David Loffredo, Mike Long,
4293Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4294Bengt Martensson, Chris Metcalf,
4295Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4296G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4297Richard Ohnemus, Karsten Pahnke,
4298Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4299Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4300Frederic Raimbault, Pat Rankin, Rick Richardson,
4301Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4302Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4303Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4304Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4305Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4306Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4307Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4308Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4309and those whose names have slipped my marginal mail-archiving skills
4310but whose contributions are appreciated all the
4311same.
4312.Pp
4313Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4314John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4315Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4316distribution headaches.
4317.Pp
4318Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4319to Benson Margulies and Fred Burke for C++ support;
4320to Kent Williams and Tom Epperly for C++ class support;
4321to Ove Ewerlid for support of NUL's;
4322and to Eric Hughes for support of multiple buffers.
4323.Pp
4324This work was primarily done when I was with the Real Time Systems Group
4325at the Lawrence Berkeley Laboratory in Berkeley, CA.
4326Many thanks to all there for the support I received.
4327.Pp
4328Send comments to
4329.Aq vern@ee.lbl.gov .
4330.Sh BUGS
4331Some trailing context patterns cannot be properly matched and generate
4332warning messages
4333.Pq "dangerous trailing context" .
4334These are patterns where the ending of the first part of the rule
4335matches the beginning of the second part, such as
4336.Qq zx*/xy* ,
4337where the
4338.Sq x*
4339matches the
4340.Sq x
4341at the beginning of the trailing context.
4342(Note that the POSIX draft states that the text matched by such patterns
4343is undefined.)
4344.Pp
4345For some trailing context rules, parts which are actually fixed-length are
4346not recognized as such, leading to the above mentioned performance loss.
4347In particular, parts using
4348.Sq |\&
4349or
4350.Sq {n}
4351(such as
4352.Qq foo{3} )
4353are always considered variable-length.
4354.Pp
4355Combining trailing context with the special
4356.Sq |\&
4357action can result in fixed trailing context being turned into
4358the more expensive variable trailing context.
4359For example, in the following:
4360.Bd -literal -offset indent
4361%%
4362abc      |
4363xyz/def
4364.Ed
4365.Pp
4366Use of
4367.Fn unput
4368invalidates yytext and yyleng, unless the
4369.Dq %array
4370directive
4371or the
4372.Fl l
4373option has been used.
4374.Pp
4375Pattern-matching of NUL's is substantially slower than matching other
4376characters.
4377.Pp
4378Dynamic resizing of the input buffer is slow, as it entails rescanning
4379all the text matched so far by the current
4380.Pq generally huge
4381token.
4382.Pp
4383Due to both buffering of input and read-ahead,
4384it is not possible to intermix calls to
4385.Aq Pa stdio.h
4386routines, such as, for example,
4387.Fn getchar ,
4388with
4389.Nm
4390rules and expect it to work.
4391Call
4392.Fn input
4393instead.
4394.Pp
4395The total table entries listed by the
4396.Fl v
4397flag excludes the number of table entries needed to determine
4398what rule has been matched.
4399The number of entries is equal to the number of DFA states
4400if the scanner does not use
4401.Em REJECT ,
4402and somewhat greater than the number of states if it does.
4403.Pp
4404.Em REJECT
4405cannot be used with the
4406.Fl f
4407or
4408.Fl F
4409options.
4410.Pp
4411The
4412.Nm
4413internal algorithms need documentation.
4414