xref: /openbsd/usr.bin/lex/flex.1 (revision 3cab2bb3)
1.\"	$OpenBSD: flex.1,v 1.43 2015/09/21 10:03:46 jmc Exp $
2.\"
3.\" Copyright (c) 1990 The Regents of the University of California.
4.\" All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Vern Paxson.
8.\"
9.\" The United States Government has rights in this work pursuant
10.\" to contract no. DE-AC03-76SF00098 between the United States
11.\" Department of Energy and the University of California.
12.\"
13.\" Redistribution and use in source and binary forms, with or without
14.\" modification, are permitted provided that the following conditions
15.\" are met:
16.\"
17.\" 1. Redistributions of source code must retain the above copyright
18.\"    notice, this list of conditions and the following disclaimer.
19.\" 2. Redistributions in binary form must reproduce the above copyright
20.\"    notice, this list of conditions and the following disclaimer in the
21.\"    documentation and/or other materials provided with the distribution.
22.\"
23.\" Neither the name of the University nor the names of its contributors
24.\" may be used to endorse or promote products derived from this software
25.\" without specific prior written permission.
26.\"
27.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28.\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30.\" PURPOSE.
31.\"
32.Dd $Mdocdate: September 21 2015 $
33.Dt FLEX 1
34.Os
35.Sh NAME
36.Nm flex ,
37.Nm flex++ ,
38.Nm lex
39.Nd fast lexical analyzer generator
40.Sh SYNOPSIS
41.Nm
42.Bk -words
43.Op Fl 78BbdFfhIiLlnpsTtVvw+?
44.Op Fl C Ns Op Cm aeFfmr
45.Op Fl Fl help
46.Op Fl Fl version
47.Op Fl o Ns Ar output
48.Op Fl P Ns Ar prefix
49.Op Fl S Ns Ar skeleton
50.Op Ar
51.Ek
52.Sh DESCRIPTION
53.Nm
54is a tool for generating
55.Em scanners :
56programs which recognize lexical patterns in text.
57.Nm
58reads the given input files, or its standard input if no file names are given,
59for a description of a scanner to generate.
60The description is in the form of pairs of regular expressions and C code,
61called
62.Em rules .
63.Nm
64generates as output a C source file,
65.Pa lex.yy.c ,
66which defines a routine
67.Fn yylex .
68This file is compiled and linked with the
69.Fl lfl
70library to produce an executable.
71When the executable is run, it analyzes its input for occurrences
72of the regular expressions.
73Whenever it finds one, it executes the corresponding C code.
74.Pp
75.Nm lex
76is a synonym for
77.Nm flex .
78.Nm flex++
79is a synonym for
80.Nm
81.Fl + .
82.Pp
83The manual includes both tutorial and reference sections:
84.Bl -ohang
85.It Sy Some Simple Examples
86.It Sy Format of the Input File
87.It Sy Patterns
88The extended regular expressions used by
89.Nm .
90.It Sy How the Input is Matched
91The rules for determining what has been matched.
92.It Sy Actions
93How to specify what to do when a pattern is matched.
94.It Sy The Generated Scanner
95Details regarding the scanner that
96.Nm
97produces;
98how to control the input source.
99.It Sy Start Conditions
100Introducing context into scanners, and managing
101.Qq mini-scanners .
102.It Sy Multiple Input Buffers
103How to manipulate multiple input sources;
104how to scan from strings instead of files.
105.It Sy End-of-File Rules
106Special rules for matching the end of the input.
107.It Sy Miscellaneous Macros
108A summary of macros available to the actions.
109.It Sy Values Available to the User
110A summary of values available to the actions.
111.It Sy Interfacing with Yacc
112Connecting flex scanners together with
113.Xr yacc 1
114parsers.
115.It Sy Options
116.Nm
117command-line options, and the
118.Dq %option
119directive.
120.It Sy Performance Considerations
121How to make scanners go as fast as possible.
122.It Sy Generating C++ Scanners
123The
124.Pq experimental
125facility for generating C++ scanner classes.
126.It Sy Incompatibilities with Lex and POSIX
127How
128.Nm
129differs from
130.At
131.Nm lex
132and the
133.Tn POSIX
134.Nm lex
135standard.
136.It Sy Files
137Files used by
138.Nm .
139.It Sy Diagnostics
140Those error messages produced by
141.Nm
142.Pq or scanners it generates
143whose meanings might not be apparent.
144.It Sy See Also
145Other documentation, related tools.
146.It Sy Authors
147Includes contact information.
148.It Sy Bugs
149Known problems with
150.Nm .
151.El
152.Sh SOME SIMPLE EXAMPLES
153First some simple examples to get the flavor of how one uses
154.Nm .
155The following
156.Nm
157input specifies a scanner which whenever it encounters the string
158.Qq username
159will replace it with the user's login name:
160.Bd -literal -offset indent
161%%
162username    printf("%s", getlogin());
163.Ed
164.Pp
165By default, any text not matched by a
166.Nm
167scanner is copied to the output, so the net effect of this scanner is
168to copy its input file to its output with each occurrence of
169.Qq username
170expanded.
171In this input, there is just one rule.
172.Qq username
173is the
174.Em pattern
175and the
176.Qq printf
177is the
178.Em action .
179The
180.Qq %%
181marks the beginning of the rules.
182.Pp
183Here's another simple example:
184.Bd -literal -offset indent
185%{
186int num_lines = 0, num_chars = 0;
187%}
188
189%%
190\en      ++num_lines; ++num_chars;
191\&.       ++num_chars;
192
193%%
194main()
195{
196	yylex();
197	printf("# of lines = %d, # of chars = %d\en",
198            num_lines, num_chars);
199}
200.Ed
201.Pp
202This scanner counts the number of characters and the number
203of lines in its input
204(it produces no output other than the final report on the counts).
205The first line declares two globals,
206.Qq num_lines
207and
208.Qq num_chars ,
209which are accessible both inside
210.Fn yylex
211and in the
212.Fn main
213routine declared after the second
214.Qq %% .
215There are two rules, one which matches a newline
216.Pq \&"\en\&"
217and increments both the line count and the character count,
218and one which matches any character other than a newline
219(indicated by the
220.Qq \&.
221regular expression).
222.Pp
223A somewhat more complicated example:
224.Bd -literal -offset indent
225/* scanner for a toy Pascal-like language */
226
227%{
228/* need this for the call to atof() below */
229#include <math.h>
230%}
231
232DIGIT    [0-9]
233ID       [a-z][a-z0-9]*
234
235%%
236
237{DIGIT}+ {
238        printf("An integer: %s (%d)\en", yytext,
239            atoi(yytext));
240}
241
242{DIGIT}+"."{DIGIT}* {
243        printf("A float: %s (%g)\en", yytext,
244            atof(yytext));
245}
246
247if|then|begin|end|procedure|function {
248        printf("A keyword: %s\en", yytext);
249}
250
251{ID}    printf("An identifier: %s\en", yytext);
252
253"+"|"-"|"*"|"/"   printf("An operator: %s\en", yytext);
254
255"{"[^}\en]*"}"     /* eat up one-line comments */
256
257[ \et\en]+          /* eat up whitespace */
258
259\&.       printf("Unrecognized character: %s\en", yytext);
260
261%%
262
263main(int argc, char *argv[])
264{
265        ++argv; --argc;  /* skip over program name */
266        if (argc > 0)
267                yyin = fopen(argv[0], "r");
268        else
269                yyin = stdin;
270
271        yylex();
272}
273.Ed
274.Pp
275This is the beginnings of a simple scanner for a language like Pascal.
276It identifies different types of
277.Em tokens
278and reports on what it has seen.
279.Pp
280The details of this example will be explained in the following sections.
281.Sh FORMAT OF THE INPUT FILE
282The
283.Nm
284input file consists of three sections, separated by a line with just
285.Qq %%
286in it:
287.Bd -unfilled -offset indent
288definitions
289%%
290rules
291%%
292user code
293.Ed
294.Pp
295The
296.Em definitions
297section contains declarations of simple
298.Em name
299definitions to simplify the scanner specification, and declarations of
300.Em start conditions ,
301which are explained in a later section.
302.Pp
303Name definitions have the form:
304.Pp
305.D1 name definition
306.Pp
307The
308.Qq name
309is a word beginning with a letter or an underscore
310.Pq Sq _
311followed by zero or more letters, digits,
312.Sq _ ,
313or
314.Sq -
315.Pq dash .
316The definition is taken to begin at the first non-whitespace character
317following the name and continuing to the end of the line.
318The definition can subsequently be referred to using
319.Qq {name} ,
320which will expand to
321.Qq (definition) .
322For example:
323.Bd -literal -offset indent
324DIGIT    [0-9]
325ID       [a-z][a-z0-9]*
326.Ed
327.Pp
328This defines
329.Qq DIGIT
330to be a regular expression which matches a single digit, and
331.Qq ID
332to be a regular expression which matches a letter
333followed by zero-or-more letters-or-digits.
334A subsequent reference to
335.Pp
336.Dl {DIGIT}+"."{DIGIT}*
337.Pp
338is identical to
339.Pp
340.Dl ([0-9])+"."([0-9])*
341.Pp
342and matches one-or-more digits followed by a
343.Sq .\&
344followed by zero-or-more digits.
345.Pp
346The
347.Em rules
348section of the
349.Nm
350input contains a series of rules of the form:
351.Pp
352.Dl pattern	action
353.Pp
354The pattern must be unindented and the action must begin
355on the same line.
356.Pp
357See below for a further description of patterns and actions.
358.Pp
359Finally, the user code section is simply copied to
360.Pa lex.yy.c
361verbatim.
362It is used for companion routines which call or are called by the scanner.
363The presence of this section is optional;
364if it is missing, the second
365.Qq %%
366in the input file may be skipped too.
367.Pp
368In the definitions and rules sections, any indented text or text enclosed in
369.Sq %{
370and
371.Sq %}
372is copied verbatim to the output
373.Pq with the %{}'s removed .
374The %{}'s must appear unindented on lines by themselves.
375.Pp
376In the rules section,
377any indented or %{} text appearing before the first rule may be used to
378declare variables which are local to the scanning routine and
379.Pq after the declarations
380code which is to be executed whenever the scanning routine is entered.
381Other indented or %{} text in the rule section is still copied to the output,
382but its meaning is not well-defined and it may well cause compile-time
383errors (this feature is present for
384.Tn POSIX
385compliance; see below for other such features).
386.Pp
387In the definitions section
388.Pq but not in the rules section ,
389an unindented comment
390(i.e., a line beginning with
391.Qq /* )
392is also copied verbatim to the output up to the next
393.Qq */ .
394.Sh PATTERNS
395The patterns in the input are written using an extended set of regular
396expressions.
397These are:
398.Bl -tag -width "XXXXXXXX"
399.It x
400Match the character
401.Sq x .
402.It .\&
403Any character
404.Pq byte
405except newline.
406.It [xyz]
407A
408.Qq character class ;
409in this case, the pattern matches either an
410.Sq x ,
411a
412.Sq y ,
413or a
414.Sq z .
415.It [abj-oZ]
416A
417.Qq character class
418with a range in it; matches an
419.Sq a ,
420a
421.Sq b ,
422any letter from
423.Sq j
424through
425.Sq o ,
426or a
427.Sq Z .
428.It [^A-Z]
429A
430.Qq negated character class ,
431i.e., any character but those in the class.
432In this case, any character EXCEPT an uppercase letter.
433.It [^A-Z\en]
434Any character EXCEPT an uppercase letter or a newline.
435.It r*
436Zero or more r's, where
437.Sq r
438is any regular expression.
439.It r+
440One or more r's.
441.It r?
442Zero or one r's (that is,
443.Qq an optional r ) .
444.It r{2,5}
445Anywhere from two to five r's.
446.It r{2,}
447Two or more r's.
448.It r{4}
449Exactly 4 r's.
450.It {name}
451The expansion of the
452.Qq name
453definition
454.Pq see above .
455.It \&"[xyz]\e\&"foo\&"
456The literal string: [xyz]"foo.
457.It \eX
458If
459.Sq X
460is an
461.Sq a ,
462.Sq b ,
463.Sq f ,
464.Sq n ,
465.Sq r ,
466.Sq t ,
467or
468.Sq v ,
469then the ANSI-C interpretation of
470.Sq \eX .
471Otherwise, a literal
472.Sq X
473(used to escape operators such as
474.Sq * ) .
475.It \e0
476A NUL character
477.Pq ASCII code 0 .
478.It \e123
479The character with octal value 123.
480.It \ex2a
481The character with hexadecimal value 2a.
482.It (r)
483Match an
484.Sq r ;
485parentheses are used to override precedence
486.Pq see below .
487.It rs
488The regular expression
489.Sq r
490followed by the regular expression
491.Sq s ;
492called
493.Qq concatenation .
494.It r|s
495Either an
496.Sq r
497or an
498.Sq s .
499.It r/s
500An
501.Sq r ,
502but only if it is followed by an
503.Sq s .
504The text matched by
505.Sq s
506is included when determining whether this rule is the
507.Qq longest match ,
508but is then returned to the input before the action is executed.
509So the action only sees the text matched by
510.Sq r .
511This type of pattern is called
512.Qq trailing context .
513(There are some combinations of r/s that
514.Nm
515cannot match correctly; see notes in the
516.Sx BUGS
517section below regarding
518.Qq dangerous trailing context . )
519.It ^r
520An
521.Sq r ,
522but only at the beginning of a line
523(i.e., just starting to scan, or right after a newline has been scanned).
524.It r$
525An
526.Sq r ,
527but only at the end of a line
528.Pq i.e., just before a newline .
529Equivalent to
530.Qq r/\en .
531.Pp
532Note that
533.Nm flex Ns 's
534notion of
535.Qq newline
536is exactly whatever the C compiler used to compile
537.Nm
538interprets
539.Sq \en
540as.
541.\" In particular, on some DOS systems you must either filter out \er's in the
542.\" input yourself, or explicitly use r/\er\en for
543.\" .Qq r$ .
544.It <s>r
545An
546.Sq r ,
547but only in start condition
548.Sq s
549.Pq see below for discussion of start conditions .
550.It <s1,s2,s3>r
551The same, but in any of start conditions s1, s2, or s3.
552.It <*>r
553An
554.Sq r
555in any start condition, even an exclusive one.
556.It <<EOF>>
557An end-of-file.
558.It <s1,s2><<EOF>>
559An end-of-file when in start condition s1 or s2.
560.El
561.Pp
562Note that inside of a character class, all regular expression operators
563lose their special meaning except escape
564.Pq Sq \e
565and the character class operators,
566.Sq - ,
567.Sq ]\& ,
568and, at the beginning of the class,
569.Sq ^ .
570.Pp
571The regular expressions listed above are grouped according to
572precedence, from highest precedence at the top to lowest at the bottom.
573Those grouped together have equal precedence.
574For example,
575.Pp
576.D1 foo|bar*
577.Pp
578is the same as
579.Pp
580.D1 (foo)|(ba(r*))
581.Pp
582since the
583.Sq *
584operator has higher precedence than concatenation,
585and concatenation higher than alternation
586.Pq Sq |\& .
587This pattern therefore matches
588.Em either
589the string
590.Qq foo
591.Em or
592the string
593.Qq ba
594followed by zero-or-more r's.
595To match
596.Qq foo
597or zero-or-more "bar"'s,
598use:
599.Pp
600.D1 foo|(bar)*
601.Pp
602and to match zero-or-more "foo"'s-or-"bar"'s:
603.Pp
604.D1 (foo|bar)*
605.Pp
606In addition to characters and ranges of characters, character classes
607can also contain character class
608.Em expressions .
609These are expressions enclosed inside
610.Sq [:
611and
612.Sq :]
613delimiters (which themselves must appear between the
614.Sq \&[
615and
616.Sq ]\&
617of the
618character class; other elements may occur inside the character class, too).
619The valid expressions are:
620.Bd -unfilled -offset indent
621[:alnum:] [:alpha:] [:blank:]
622[:cntrl:] [:digit:] [:graph:]
623[:lower:] [:print:] [:punct:]
624[:space:] [:upper:] [:xdigit:]
625.Ed
626.Pp
627These expressions all designate a set of characters equivalent to
628the corresponding standard C
629.Fn isXXX
630function.
631For example, [:alnum:] designates those characters for which
632.Xr isalnum 3
633returns true \- i.e., any alphabetic or numeric.
634Some systems don't provide
635.Xr isblank 3 ,
636so
637.Nm
638defines [:blank:] as a blank or a tab.
639.Pp
640For example, the following character classes are all equivalent:
641.Bd -unfilled -offset indent
642[[:alnum:]]
643[[:alpha:][:digit:]]
644[[:alpha:]0-9]
645[a-zA-Z0-9]
646.Ed
647.Pp
648If the scanner is case-insensitive (the
649.Fl i
650flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
651.Pp
652Some notes on patterns:
653.Bl -dash
654.It
655A negated character class such as the example
656.Qq [^A-Z]
657above will match a newline unless "\en"
658.Pq or an equivalent escape sequence
659is one of the characters explicitly present in the negated character class
660(e.g.,
661.Qq [^A-Z\en] ) .
662This is unlike how many other regular expression tools treat negated character
663classes, but unfortunately the inconsistency is historically entrenched.
664Matching newlines means that a pattern like
665.Qq [^"]*
666can match the entire input unless there's another quote in the input.
667.It
668A rule can have at most one instance of trailing context
669(the
670.Sq /
671operator or the
672.Sq $
673operator).
674The start condition,
675.Sq ^ ,
676and
677.Qq <<EOF>>
678patterns can only occur at the beginning of a pattern and, as well as with
679.Sq /
680and
681.Sq $ ,
682cannot be grouped inside parentheses.
683A
684.Sq ^
685which does not occur at the beginning of a rule or a
686.Sq $
687which does not occur at the end of a rule loses its special properties
688and is treated as a normal character.
689.It
690The following are illegal:
691.Bd -unfilled -offset indent
692foo/bar$
693<sc1>foo<sc2>bar
694.Ed
695.Pp
696Note that the first of these, can be written
697.Qq foo/bar\en .
698.It
699The following will result in
700.Sq $
701or
702.Sq ^
703being treated as a normal character:
704.Bd -unfilled -offset indent
705foo|(bar$)
706foo|^bar
707.Ed
708.Pp
709If what's wanted is a
710.Qq foo
711or a bar-followed-by-a-newline, the following could be used
712(the special
713.Sq |\&
714action is explained below):
715.Bd -unfilled -offset indent
716foo      |
717bar$     /* action goes here */
718.Ed
719.Pp
720A similar trick will work for matching a foo or a
721bar-at-the-beginning-of-a-line.
722.El
723.Sh HOW THE INPUT IS MATCHED
724When the generated scanner is run,
725it analyzes its input looking for strings which match any of its patterns.
726If it finds more than one match,
727it takes the one matching the most text
728(for trailing context rules, this includes the length of the trailing part,
729even though it will then be returned to the input).
730If it finds two or more matches of the same length,
731the rule listed first in the
732.Nm
733input file is chosen.
734.Pp
735Once the match is determined, the text corresponding to the match
736(called the
737.Em token )
738is made available in the global character pointer
739.Fa yytext ,
740and its length in the global integer
741.Fa yyleng .
742The
743.Em action
744corresponding to the matched pattern is then executed
745.Pq a more detailed description of actions follows ,
746and then the remaining input is scanned for another match.
747.Pp
748If no match is found, then the default rule is executed:
749the next character in the input is considered matched and
750copied to the standard output.
751Thus, the simplest legal
752.Nm
753input is:
754.Pp
755.D1 %%
756.Pp
757which generates a scanner that simply copies its input
758.Pq one character at a time
759to its output.
760.Pp
761Note that
762.Fa yytext
763can be defined in two different ways:
764either as a character pointer or as a character array.
765Which definition
766.Nm
767uses can be controlled by including one of the special directives
768.Dq %pointer
769or
770.Dq %array
771in the first
772.Pq definitions
773section of flex input.
774The default is
775.Dq %pointer ,
776unless the
777.Fl l
778.Nm lex
779compatibility option is used, in which case
780.Fa yytext
781will be an array.
782The advantage of using
783.Dq %pointer
784is substantially faster scanning and no buffer overflow when matching
785very large tokens
786.Pq unless not enough dynamic memory is available .
787The disadvantage is that actions are restricted in how they can modify
788.Fa yytext
789.Pq see the next section ,
790and calls to the
791.Fn unput
792function destroy the present contents of
793.Fa yytext ,
794which can be a considerable porting headache when moving between different
795.Nm lex
796versions.
797.Pp
798The advantage of
799.Dq %array
800is that
801.Fa yytext
802can be modified as much as wanted, and calls to
803.Fn unput
804do not destroy
805.Fa yytext
806.Pq see below .
807Furthermore, existing
808.Nm lex
809programs sometimes access
810.Fa yytext
811externally using declarations of the form:
812.Pp
813.D1 extern char yytext[];
814.Pp
815This definition is erroneous when used with
816.Dq %pointer ,
817but correct for
818.Dq %array .
819.Pp
820.Dq %array
821defines
822.Fa yytext
823to be an array of
824.Dv YYLMAX
825characters, which defaults to a fairly large value.
826The size can be changed by simply #define'ing
827.Dv YYLMAX
828to a different value in the first section of
829.Nm
830input.
831As mentioned above, with
832.Dq %pointer
833yytext grows dynamically to accommodate large tokens.
834While this means a
835.Dq %pointer
836scanner can accommodate very large tokens
837.Pq such as matching entire blocks of comments ,
838bear in mind that each time the scanner must resize
839.Fa yytext
840it also must rescan the entire token from the beginning, so matching such
841tokens can prove slow.
842.Fa yytext
843presently does not dynamically grow if a call to
844.Fn unput
845results in too much text being pushed back; instead, a run-time error results.
846.Pp
847Also note that
848.Dq %array
849cannot be used with C++ scanner classes
850.Pq the c++ option; see below .
851.Sh ACTIONS
852Each pattern in a rule has a corresponding action,
853which can be any arbitrary C statement.
854The pattern ends at the first non-escaped whitespace character;
855the remainder of the line is its action.
856If the action is empty,
857then when the pattern is matched the input token is simply discarded.
858For example, here is the specification for a program
859which deletes all occurrences of
860.Qq zap me
861from its input:
862.Bd -literal -offset indent
863%%
864"zap me"
865.Ed
866.Pp
867(It will copy all other characters in the input to the output since
868they will be matched by the default rule.)
869.Pp
870Here is a program which compresses multiple blanks and tabs down to
871a single blank, and throws away whitespace found at the end of a line:
872.Bd -literal -offset indent
873%%
874[ \et]+        putchar(' ');
875[ \et]+$       /* ignore this token */
876.Ed
877.Pp
878If the action contains a
879.Sq { ,
880then the action spans till the balancing
881.Sq }
882is found, and the action may cross multiple lines.
883.Nm
884knows about C strings and comments and won't be fooled by braces found
885within them, but also allows actions to begin with
886.Sq %{
887and will consider the action to be all the text up to the next
888.Sq %}
889.Pq regardless of ordinary braces inside the action .
890.Pp
891An action consisting solely of a vertical bar
892.Pq Sq |\&
893means
894.Qq same as the action for the next rule .
895See below for an illustration.
896.Pp
897Actions can include arbitrary C code,
898including return statements to return a value to whatever routine called
899.Fn yylex .
900Each time
901.Fn yylex
902is called, it continues processing tokens from where it last left off
903until it either reaches the end of the file or executes a return.
904.Pp
905Actions are free to modify
906.Fa yytext
907except for lengthening it
908(adding characters to its end \- these will overwrite later characters in the
909input stream).
910This, however, does not apply when using
911.Dq %array
912.Pq see above ;
913in that case,
914.Fa yytext
915may be freely modified in any way.
916.Pp
917Actions are free to modify
918.Fa yyleng
919except they should not do so if the action also includes use of
920.Fn yymore
921.Pq see below .
922.Pp
923There are a number of special directives which can be included within
924an action:
925.Bl -tag -width Ds
926.It ECHO
927Copies
928.Fa yytext
929to the scanner's output.
930.It BEGIN
931Followed by the name of a start condition, places the scanner in the
932corresponding start condition
933.Pq see below .
934.It REJECT
935Directs the scanner to proceed on to the
936.Qq second best
937rule which matched the input
938.Pq or a prefix of the input .
939The rule is chosen as described above in
940.Sx HOW THE INPUT IS MATCHED ,
941and
942.Fa yytext
943and
944.Fa yyleng
945set up appropriately.
946It may either be one which matched as much text
947as the originally chosen rule but came later in the
948.Nm
949input file, or one which matched less text.
950For example, the following will both count the
951words in the input and call the routine
952.Fn special
953whenever
954.Qq frob
955is seen:
956.Bd -literal -offset indent
957int word_count = 0;
958%%
959
960frob        special(); REJECT;
961[^ \et\en]+   ++word_count;
962.Ed
963.Pp
964Without the
965.Em REJECT ,
966any "frob"'s in the input would not be counted as words,
967since the scanner normally executes only one action per token.
968Multiple
969.Em REJECT Ns 's
970are allowed,
971each one finding the next best choice to the currently active rule.
972For example, when the following scanner scans the token
973.Qq abcd ,
974it will write
975.Qq abcdabcaba
976to the output:
977.Bd -literal -offset indent
978%%
979a        |
980ab       |
981abc      |
982abcd     ECHO; REJECT;
983\&.|\en     /* eat up any unmatched character */
984.Ed
985.Pp
986(The first three rules share the fourth's action since they use
987the special
988.Sq |\&
989action.)
990.Em REJECT
991is a particularly expensive feature in terms of scanner performance;
992if it is used in any of the scanner's actions it will slow down
993all of the scanner's matching.
994Furthermore,
995.Em REJECT
996cannot be used with the
997.Fl Cf
998or
999.Fl CF
1000options
1001.Pq see below .
1002.Pp
1003Note also that unlike the other special actions,
1004.Em REJECT
1005is a
1006.Em branch ;
1007code immediately following it in the action will not be executed.
1008.It yymore()
1009Tells the scanner that the next time it matches a rule, the corresponding
1010token should be appended onto the current value of
1011.Fa yytext
1012rather than replacing it.
1013For example, given the input
1014.Qq mega-kludge
1015the following will write
1016.Qq mega-mega-kludge
1017to the output:
1018.Bd -literal -offset indent
1019%%
1020mega-    ECHO; yymore();
1021kludge   ECHO;
1022.Ed
1023.Pp
1024First
1025.Qq mega-
1026is matched and echoed to the output.
1027Then
1028.Qq kludge
1029is matched, but the previous
1030.Qq mega-
1031is still hanging around at the beginning of
1032.Fa yytext
1033so the
1034.Em ECHO
1035for the
1036.Qq kludge
1037rule will actually write
1038.Qq mega-kludge .
1039.Pp
1040Two notes regarding use of
1041.Fn yymore :
1042First,
1043.Fn yymore
1044depends on the value of
1045.Fa yyleng
1046correctly reflecting the size of the current token, so
1047.Fa yyleng
1048must not be modified when using
1049.Fn yymore .
1050Second, the presence of
1051.Fn yymore
1052in the scanner's action entails a minor performance penalty in the
1053scanner's matching speed.
1054.It yyless(n)
1055Returns all but the first
1056.Ar n
1057characters of the current token back to the input stream, where they
1058will be rescanned when the scanner looks for the next match.
1059.Fa yytext
1060and
1061.Fa yyleng
1062are adjusted appropriately (e.g.,
1063.Fa yyleng
1064will now be equal to
1065.Ar n ) .
1066For example, on the input
1067.Qq foobar
1068the following will write out
1069.Qq foobarbar :
1070.Bd -literal -offset indent
1071%%
1072foobar    ECHO; yyless(3);
1073[a-z]+    ECHO;
1074.Ed
1075.Pp
1076An argument of 0 to
1077.Fa yyless
1078will cause the entire current input string to be scanned again.
1079Unless how the scanner will subsequently process its input has been changed
1080(using
1081.Em BEGIN ,
1082for example),
1083this will result in an endless loop.
1084.Pp
1085Note that
1086.Fa yyless
1087is a macro and can only be used in the
1088.Nm
1089input file, not from other source files.
1090.It unput(c)
1091Puts the character
1092.Ar c
1093back into the input stream.
1094It will be the next character scanned.
1095The following action will take the current token and cause it
1096to be rescanned enclosed in parentheses.
1097.Bd -literal -offset indent
1098{
1099        int i;
1100        char *yycopy;
1101
1102        /* Copy yytext because unput() trashes yytext */
1103        if ((yycopy = strdup(yytext)) == NULL)
1104                err(1, NULL);
1105        unput(')');
1106        for (i = yyleng - 1; i >= 0; --i)
1107                unput(yycopy[i]);
1108        unput('(');
1109        free(yycopy);
1110}
1111.Ed
1112.Pp
1113Note that since each
1114.Fn unput
1115puts the given character back at the beginning of the input stream,
1116pushing back strings must be done back-to-front.
1117.Pp
1118An important potential problem when using
1119.Fn unput
1120is that if using
1121.Dq %pointer
1122.Pq the default ,
1123a call to
1124.Fn unput
1125destroys the contents of
1126.Fa yytext ,
1127starting with its rightmost character and devouring one character to
1128the left with each call.
1129If the value of
1130.Fa yytext
1131should be preserved after a call to
1132.Fn unput
1133.Pq as in the above example ,
1134it must either first be copied elsewhere, or the scanner must be built using
1135.Dq %array
1136instead (see
1137.Sx HOW THE INPUT IS MATCHED ) .
1138.Pp
1139Finally, note that EOF cannot be put back
1140to attempt to mark the input stream with an end-of-file.
1141.It input()
1142Reads the next character from the input stream.
1143For example, the following is one way to eat up C comments:
1144.Bd -literal -offset indent
1145%%
1146"/*" {
1147        int c;
1148
1149        for (;;) {
1150                while ((c = input()) != '*' && c != EOF)
1151                        ; /* eat up text of comment */
1152
1153                if (c == '*') {
1154                        while ((c = input()) == '*')
1155                                ;
1156                        if (c == '/')
1157                                break; /* found the end */
1158                }
1159
1160                if (c == EOF) {
1161                        errx(1, "EOF in comment");
1162                        break;
1163                }
1164        }
1165}
1166.Ed
1167.Pp
1168(Note that if the scanner is compiled using C++, then
1169.Fn input
1170is instead referred to as
1171.Fn yyinput ,
1172in order to avoid a name clash with the C++ stream by the name of input.)
1173.It YY_FLUSH_BUFFER
1174Flushes the scanner's internal buffer
1175so that the next time the scanner attempts to match a token,
1176it will first refill the buffer using
1177.Dv YY_INPUT
1178(see
1179.Sx THE GENERATED SCANNER ,
1180below).
1181This action is a special case of the more general
1182.Fn yy_flush_buffer
1183function, described below in the section
1184.Sx MULTIPLE INPUT BUFFERS .
1185.It yyterminate()
1186Can be used in lieu of a return statement in an action.
1187It terminates the scanner and returns a 0 to the scanner's caller, indicating
1188.Qq all done .
1189By default,
1190.Fn yyterminate
1191is also called when an end-of-file is encountered.
1192It is a macro and may be redefined.
1193.El
1194.Sh THE GENERATED SCANNER
1195The output of
1196.Nm
1197is the file
1198.Pa lex.yy.c ,
1199which contains the scanning routine
1200.Fn yylex ,
1201a number of tables used by it for matching tokens,
1202and a number of auxiliary routines and macros.
1203By default,
1204.Fn yylex
1205is declared as follows:
1206.Bd -unfilled -offset indent
1207int yylex()
1208{
1209    ... various definitions and the actions in here ...
1210}
1211.Ed
1212.Pp
1213(If the environment supports function prototypes, then it will
1214be "int yylex(void)".)
1215This definition may be changed by defining the
1216.Dv YY_DECL
1217macro.
1218For example:
1219.Bd -literal -offset indent
1220#define YY_DECL float lexscan(a, b) float a, b;
1221.Ed
1222.Pp
1223would give the scanning routine the name
1224.Em lexscan ,
1225returning a float, and taking two floats as arguments.
1226Note that if arguments are given to the scanning routine using a
1227K&R-style/non-prototyped function declaration,
1228the definition must be terminated with a semi-colon
1229.Pq Sq ;\& .
1230.Pp
1231Whenever
1232.Fn yylex
1233is called, it scans tokens from the global input file
1234.Pa yyin
1235.Pq which defaults to stdin .
1236It continues until it either reaches an end-of-file
1237.Pq at which point it returns the value 0
1238or one of its actions executes a
1239.Em return
1240statement.
1241.Pp
1242If the scanner reaches an end-of-file, subsequent calls are undefined
1243unless either
1244.Em yyin
1245is pointed at a new input file
1246.Pq in which case scanning continues from that file ,
1247or
1248.Fn yyrestart
1249is called.
1250.Fn yyrestart
1251takes one argument, a
1252.Fa FILE *
1253pointer (which can be nil, if
1254.Dv YY_INPUT
1255has been set up to scan from a source other than
1256.Em yyin ) ,
1257and initializes
1258.Em yyin
1259for scanning from that file.
1260Essentially there is no difference between just assigning
1261.Em yyin
1262to a new input file or using
1263.Fn yyrestart
1264to do so; the latter is available for compatibility with previous versions of
1265.Nm ,
1266and because it can be used to switch input files in the middle of scanning.
1267It can also be used to throw away the current input buffer,
1268by calling it with an argument of
1269.Em yyin ;
1270but better is to use
1271.Dv YY_FLUSH_BUFFER
1272.Pq see above .
1273Note that
1274.Fn yyrestart
1275does not reset the start condition to
1276.Em INITIAL
1277(see
1278.Sx START CONDITIONS ,
1279below).
1280.Pp
1281If
1282.Fn yylex
1283stops scanning due to executing a
1284.Em return
1285statement in one of the actions, the scanner may then be called again and it
1286will resume scanning where it left off.
1287.Pp
1288By default
1289.Pq and for purposes of efficiency ,
1290the scanner uses block-reads rather than simple
1291.Xr getc 3
1292calls to read characters from
1293.Em yyin .
1294The nature of how it gets its input can be controlled by defining the
1295.Dv YY_INPUT
1296macro.
1297.Dv YY_INPUT Ns 's
1298calling sequence is
1299.Qq YY_INPUT(buf,result,max_size) .
1300Its action is to place up to
1301.Dv max_size
1302characters in the character array
1303.Em buf
1304and return in the integer variable
1305.Em result
1306either the number of characters read or the constant
1307.Dv YY_NULL
1308(0 on
1309.Ux
1310systems)
1311to indicate
1312.Dv EOF .
1313The default
1314.Dv YY_INPUT
1315reads from the global file-pointer
1316.Qq yyin .
1317.Pp
1318A sample definition of
1319.Dv YY_INPUT
1320.Pq in the definitions section of the input file :
1321.Bd -unfilled -offset indent
1322%{
1323#define YY_INPUT(buf,result,max_size) \e
1324{ \e
1325        int c = getchar(); \e
1326        result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1327}
1328%}
1329.Ed
1330.Pp
1331This definition will change the input processing to occur
1332one character at a time.
1333.Pp
1334When the scanner receives an end-of-file indication from
1335.Dv YY_INPUT ,
1336it then checks the
1337.Fn yywrap
1338function.
1339If
1340.Fn yywrap
1341returns false
1342.Pq zero ,
1343then it is assumed that the function has gone ahead and set up
1344.Em yyin
1345to point to another input file, and scanning continues.
1346If it returns true
1347.Pq non-zero ,
1348then the scanner terminates, returning 0 to its caller.
1349Note that in either case, the start condition remains unchanged;
1350it does not revert to
1351.Em INITIAL .
1352.Pp
1353If you do not supply your own version of
1354.Fn yywrap ,
1355then you must either use
1356.Dq %option noyywrap
1357(in which case the scanner behaves as though
1358.Fn yywrap
1359returned 1), or you must link with
1360.Fl lfl
1361to obtain the default version of the routine, which always returns 1.
1362.Pp
1363Three routines are available for scanning from in-memory buffers rather
1364than files:
1365.Fn yy_scan_string ,
1366.Fn yy_scan_bytes ,
1367and
1368.Fn yy_scan_buffer .
1369See the discussion of them below in the section
1370.Sx MULTIPLE INPUT BUFFERS .
1371.Pp
1372The scanner writes its
1373.Em ECHO
1374output to the
1375.Em yyout
1376global
1377.Pq default, stdout ,
1378which may be redefined by the user simply by assigning it to some other
1379.Va FILE
1380pointer.
1381.Sh START CONDITIONS
1382.Nm
1383provides a mechanism for conditionally activating rules.
1384Any rule whose pattern is prefixed with
1385.Qq Aq sc
1386will only be active when the scanner is in the start condition named
1387.Qq sc .
1388For example,
1389.Bd -literal -offset indent
1390<STRING>[^"]* { /* eat up the string body ... */
1391        ...
1392}
1393.Ed
1394.Pp
1395will be active only when the scanner is in the
1396.Qq STRING
1397start condition, and
1398.Bd -literal -offset indent
1399<INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1400        ...
1401}
1402.Ed
1403.Pp
1404will be active only when the current start condition is either
1405.Qq INITIAL ,
1406.Qq STRING ,
1407or
1408.Qq QUOTE .
1409.Pp
1410Start conditions are declared in the definitions
1411.Pq first
1412section of the input using unindented lines beginning with either
1413.Sq %s
1414or
1415.Sq %x
1416followed by a list of names.
1417The former declares
1418.Em inclusive
1419start conditions, the latter
1420.Em exclusive
1421start conditions.
1422A start condition is activated using the
1423.Em BEGIN
1424action.
1425Until the next
1426.Em BEGIN
1427action is executed, rules with the given start condition will be active and
1428rules with other start conditions will be inactive.
1429If the start condition is inclusive,
1430then rules with no start conditions at all will also be active.
1431If it is exclusive,
1432then only rules qualified with the start condition will be active.
1433A set of rules contingent on the same exclusive start condition
1434describe a scanner which is independent of any of the other rules in the
1435.Nm
1436input.
1437Because of this, exclusive start conditions make it easy to specify
1438.Qq mini-scanners
1439which scan portions of the input that are syntactically different
1440from the rest
1441.Pq e.g., comments .
1442.Pp
1443If the distinction between inclusive and exclusive start conditions
1444is still a little vague, here's a simple example illustrating the
1445connection between the two.
1446The set of rules:
1447.Bd -literal -offset indent
1448%s example
1449%%
1450
1451<example>foo   do_something();
1452
1453bar            something_else();
1454.Ed
1455.Pp
1456is equivalent to
1457.Bd -literal -offset indent
1458%x example
1459%%
1460
1461<example>foo   do_something();
1462
1463<INITIAL,example>bar    something_else();
1464.Ed
1465.Pp
1466Without the
1467.Aq INITIAL,example
1468qualifier, the
1469.Dq bar
1470pattern in the second example wouldn't be active
1471.Pq i.e., couldn't match
1472when in start condition
1473.Dq example .
1474If we just used
1475.Aq example
1476to qualify
1477.Dq bar ,
1478though, then it would only be active in
1479.Dq example
1480and not in
1481.Em INITIAL ,
1482while in the first example it's active in both,
1483because in the first example the
1484.Dq example
1485start condition is an inclusive
1486.Pq Sq %s
1487start condition.
1488.Pp
1489Also note that the special start-condition specifier
1490.Sq Aq *
1491matches every start condition.
1492Thus, the above example could also have been written:
1493.Bd -literal -offset indent
1494%x example
1495%%
1496
1497<example>foo   do_something();
1498
1499<*>bar         something_else();
1500.Ed
1501.Pp
1502The default rule (to
1503.Em ECHO
1504any unmatched character) remains active in start conditions.
1505It is equivalent to:
1506.Bd -literal -offset indent
1507<*>.|\en     ECHO;
1508.Ed
1509.Pp
1510.Dq BEGIN(0)
1511returns to the original state where only the rules with
1512no start conditions are active.
1513This state can also be referred to as the start-condition
1514.Em INITIAL ,
1515so
1516.Dq BEGIN(INITIAL)
1517is equivalent to
1518.Dq BEGIN(0) .
1519(The parentheses around the start condition name are not required but
1520are considered good style.)
1521.Pp
1522.Em BEGIN
1523actions can also be given as indented code at the beginning
1524of the rules section.
1525For example, the following will cause the scanner to enter the
1526.Qq SPECIAL
1527start condition whenever
1528.Fn yylex
1529is called and the global variable
1530.Fa enter_special
1531is true:
1532.Bd -literal -offset indent
1533int enter_special;
1534
1535%x SPECIAL
1536%%
1537        if (enter_special)
1538                BEGIN(SPECIAL);
1539
1540<SPECIAL>blahblahblah
1541\&...more rules follow...
1542.Ed
1543.Pp
1544To illustrate the uses of start conditions,
1545here is a scanner which provides two different interpretations
1546of a string like
1547.Qq 123.456 .
1548By default it will treat it as three tokens: the integer
1549.Qq 123 ,
1550a dot
1551.Pq Sq .\& ,
1552and the integer
1553.Qq 456 .
1554But if the string is preceded earlier in the line by the string
1555.Qq expect-floats
1556it will treat it as a single token, the floating-point number 123.456:
1557.Bd -literal -offset indent
1558%{
1559#include <math.h>
1560%}
1561%s expect
1562
1563%%
1564expect-floats        BEGIN(expect);
1565
1566<expect>[0-9]+"."[0-9]+ {
1567        printf("found a float, = %f\en",
1568            atof(yytext));
1569}
1570<expect>\en {
1571        /*
1572         * That's the end of the line, so
1573         * we need another "expect-number"
1574         * before we'll recognize any more
1575         * numbers.
1576         */
1577        BEGIN(INITIAL);
1578}
1579
1580[0-9]+ {
1581        printf("found an integer, = %d\en",
1582            atoi(yytext));
1583}
1584
1585"."     printf("found a dot\en");
1586.Ed
1587.Pp
1588Here is a scanner which recognizes
1589.Pq and discards
1590C comments while maintaining a count of the current input line:
1591.Bd -literal -offset indent
1592%x comment
1593%%
1594int line_num = 1;
1595
1596"/*"                    BEGIN(comment);
1597
1598<comment>[^*\en]*        /* eat anything that's not a '*' */
1599<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1600<comment>\en             ++line_num;
1601<comment>"*"+"/"        BEGIN(INITIAL);
1602.Ed
1603.Pp
1604This scanner goes to a bit of trouble to match as much
1605text as possible with each rule.
1606In general, when attempting to write a high-speed scanner
1607try to match as much as possible in each rule, as it's a big win.
1608.Pp
1609Note that start-condition names are really integer values and
1610can be stored as such.
1611Thus, the above could be extended in the following fashion:
1612.Bd -literal -offset indent
1613%x comment foo
1614%%
1615int line_num = 1;
1616int comment_caller;
1617
1618"/*" {
1619        comment_caller = INITIAL;
1620        BEGIN(comment);
1621}
1622
1623\&...
1624
1625<foo>"/*" {
1626        comment_caller = foo;
1627        BEGIN(comment);
1628}
1629
1630<comment>[^*\en]*        /* eat anything that's not a '*' */
1631<comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
1632<comment>\en             ++line_num;
1633<comment>"*"+"/"        BEGIN(comment_caller);
1634.Ed
1635.Pp
1636Furthermore, the current start condition can be accessed by using
1637the integer-valued
1638.Dv YY_START
1639macro.
1640For example, the above assignments to
1641.Em comment_caller
1642could instead be written
1643.Pp
1644.Dl comment_caller = YY_START;
1645.Pp
1646Flex provides
1647.Dv YYSTATE
1648as an alias for
1649.Dv YY_START
1650(since that is what's used by
1651.At
1652.Nm lex ) .
1653.Pp
1654Note that start conditions do not have their own name-space;
1655%s's and %x's declare names in the same fashion as #define's.
1656.Pp
1657Finally, here's an example of how to match C-style quoted strings using
1658exclusive start conditions, including expanded escape sequences
1659(but not including checking for a string that's too long):
1660.Bd -literal -offset indent
1661%x str
1662
1663%%
1664#define MAX_STR_CONST 1024
1665char string_buf[MAX_STR_CONST];
1666char *string_buf_ptr;
1667
1668\e"      string_buf_ptr = string_buf; BEGIN(str);
1669
1670<str>\e" { /* saw closing quote - all done */
1671        BEGIN(INITIAL);
1672        *string_buf_ptr = '\e0';
1673        /*
1674         * return string constant token type and
1675         * value to parser
1676         */
1677}
1678
1679<str>\en {
1680        /* error - unterminated string constant */
1681        /* generate error message */
1682}
1683
1684<str>\e\e[0-7]{1,3} {
1685        /* octal escape sequence */
1686        int result;
1687
1688        (void) sscanf(yytext + 1, "%o", &result);
1689
1690        if (result > 0xff) {
1691                /* error, constant is out-of-bounds */
1692	} else
1693	        *string_buf_ptr++ = result;
1694}
1695
1696<str>\e\e[0-9]+ {
1697        /*
1698         * generate error - bad escape sequence; something
1699         * like '\e48' or '\e0777777'
1700         */
1701}
1702
1703<str>\e\en  *string_buf_ptr++ = '\en';
1704<str>\e\et  *string_buf_ptr++ = '\et';
1705<str>\e\er  *string_buf_ptr++ = '\er';
1706<str>\e\eb  *string_buf_ptr++ = '\eb';
1707<str>\e\ef  *string_buf_ptr++ = '\ef';
1708
1709<str>\e\e(.|\en)  *string_buf_ptr++ = yytext[1];
1710
1711<str>[^\e\e\en\e"]+ {
1712        char *yptr = yytext;
1713
1714        while (*yptr)
1715                *string_buf_ptr++ = *yptr++;
1716}
1717.Ed
1718.Pp
1719Often, such as in some of the examples above,
1720a whole bunch of rules are all preceded by the same start condition(s).
1721.Nm
1722makes this a little easier and cleaner by introducing a notion of
1723start condition
1724.Em scope .
1725A start condition scope is begun with:
1726.Pp
1727.Dl <SCs>{
1728.Pp
1729where
1730.Dq SCs
1731is a list of one or more start conditions.
1732Inside the start condition scope, every rule automatically has the prefix
1733.Aq SCs
1734applied to it, until a
1735.Sq }
1736which matches the initial
1737.Sq { .
1738So, for example,
1739.Bd -literal -offset indent
1740<ESC>{
1741    "\e\en"   return '\en';
1742    "\e\er"   return '\er';
1743    "\e\ef"   return '\ef';
1744    "\e\e0"   return '\e0';
1745}
1746.Ed
1747.Pp
1748is equivalent to:
1749.Bd -literal -offset indent
1750<ESC>"\e\en"  return '\en';
1751<ESC>"\e\er"  return '\er';
1752<ESC>"\e\ef"  return '\ef';
1753<ESC>"\e\e0"  return '\e0';
1754.Ed
1755.Pp
1756Start condition scopes may be nested.
1757.Pp
1758Three routines are available for manipulating stacks of start conditions:
1759.Bl -tag -width Ds
1760.It void yy_push_state(int new_state)
1761Pushes the current start condition onto the top of the start condition
1762stack and switches to
1763.Fa new_state
1764as though
1765.Dq BEGIN new_state
1766had been used
1767.Pq recall that start condition names are also integers .
1768.It void yy_pop_state()
1769Pops the top of the stack and switches to it via
1770.Em BEGIN .
1771.It int yy_top_state()
1772Returns the top of the stack without altering the stack's contents.
1773.El
1774.Pp
1775The start condition stack grows dynamically and so has no built-in
1776size limitation.
1777If memory is exhausted, program execution aborts.
1778.Pp
1779To use start condition stacks, scanners must include a
1780.Dq %option stack
1781directive (see
1782.Sx OPTIONS
1783below).
1784.Sh MULTIPLE INPUT BUFFERS
1785Some scanners
1786(such as those which support
1787.Qq include
1788files)
1789require reading from several input streams.
1790As
1791.Nm
1792scanners do a large amount of buffering, one cannot control
1793where the next input will be read from by simply writing a
1794.Dv YY_INPUT
1795which is sensitive to the scanning context.
1796.Dv YY_INPUT
1797is only called when the scanner reaches the end of its buffer, which
1798may be a long time after scanning a statement such as an
1799.Qq include
1800which requires switching the input source.
1801.Pp
1802To negotiate these sorts of problems,
1803.Nm
1804provides a mechanism for creating and switching between multiple
1805input buffers.
1806An input buffer is created by using:
1807.Pp
1808.D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1809.Pp
1810which takes a
1811.Fa FILE
1812pointer and a
1813.Fa size
1814and creates a buffer associated with the given file and large enough to hold
1815.Fa size
1816characters (when in doubt, use
1817.Dv YY_BUF_SIZE
1818for the size).
1819It returns a
1820.Dv YY_BUFFER_STATE
1821handle, which may then be passed to other routines
1822.Pq see below .
1823The
1824.Dv YY_BUFFER_STATE
1825type is a pointer to an opaque
1826.Dq struct yy_buffer_state
1827structure, so
1828.Dv YY_BUFFER_STATE
1829variables may be safely initialized to
1830.Dq ((YY_BUFFER_STATE) 0)
1831if desired, and the opaque structure can also be referred to in order to
1832correctly declare input buffers in source files other than that of scanners.
1833Note that the
1834.Fa FILE
1835pointer in the call to
1836.Fn yy_create_buffer
1837is only used as the value of
1838.Fa yyin
1839seen by
1840.Dv YY_INPUT ;
1841if
1842.Dv YY_INPUT
1843is redefined so that it no longer uses
1844.Fa yyin ,
1845then a nil
1846.Fa FILE
1847pointer can safely be passed to
1848.Fn yy_create_buffer .
1849To select a particular buffer to scan:
1850.Pp
1851.D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1852.Pp
1853It switches the scanner's input buffer so subsequent tokens will
1854come from
1855.Fa new_buffer .
1856Note that
1857.Fn yy_switch_to_buffer
1858may be used by
1859.Fn yywrap
1860to set things up for continued scanning,
1861instead of opening a new file and pointing
1862.Fa yyin
1863at it.
1864Note also that switching input sources via either
1865.Fn yy_switch_to_buffer
1866or
1867.Fn yywrap
1868does not change the start condition.
1869.Pp
1870.D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1871.Pp
1872is used to reclaim the storage associated with a buffer.
1873.Pf ( Fa buffer
1874can be nil, in which case the routine does nothing.)
1875To clear the current contents of a buffer:
1876.Pp
1877.D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1878.Pp
1879This function discards the buffer's contents,
1880so the next time the scanner attempts to match a token from the buffer,
1881it will first fill the buffer anew using
1882.Dv YY_INPUT .
1883.Pp
1884.Fn yy_new_buffer
1885is an alias for
1886.Fn yy_create_buffer ,
1887provided for compatibility with the C++ use of
1888.Em new
1889and
1890.Em delete
1891for creating and destroying dynamic objects.
1892.Pp
1893Finally, the
1894.Dv YY_CURRENT_BUFFER
1895macro returns a
1896.Dv YY_BUFFER_STATE
1897handle to the current buffer.
1898.Pp
1899Here is an example of using these features for writing a scanner
1900which expands include files (the
1901.Aq Aq EOF
1902feature is discussed below):
1903.Bd -literal -offset indent
1904/*
1905 * the "incl" state is used for picking up the name
1906 * of an include file
1907 */
1908%x incl
1909
1910%{
1911#define MAX_INCLUDE_DEPTH 10
1912YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1913int include_stack_ptr = 0;
1914%}
1915
1916%%
1917include             BEGIN(incl);
1918
1919[a-z]+              ECHO;
1920[^a-z\en]*\en?        ECHO;
1921
1922<incl>[ \et]*        /* eat the whitespace */
1923<incl>[^ \et\en]+ {   /* got the include file name */
1924        if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1925                errx(1, "Includes nested too deeply");
1926
1927        include_stack[include_stack_ptr++] =
1928            YY_CURRENT_BUFFER;
1929
1930        yyin = fopen(yytext, "r");
1931
1932        if (yyin == NULL)
1933                err(1, NULL);
1934
1935        yy_switch_to_buffer(
1936            yy_create_buffer(yyin, YY_BUF_SIZE));
1937
1938        BEGIN(INITIAL);
1939}
1940
1941<<EOF>> {
1942        if (--include_stack_ptr < 0)
1943                yyterminate();
1944        else {
1945                yy_delete_buffer(YY_CURRENT_BUFFER);
1946                yy_switch_to_buffer(
1947                    include_stack[include_stack_ptr]);
1948       }
1949}
1950.Ed
1951.Pp
1952Three routines are available for setting up input buffers for
1953scanning in-memory strings instead of files.
1954All of them create a new input buffer for scanning the string,
1955and return a corresponding
1956.Dv YY_BUFFER_STATE
1957handle (which should be deleted afterwards using
1958.Fn yy_delete_buffer ) .
1959They also switch to the new buffer using
1960.Fn yy_switch_to_buffer ,
1961so the next call to
1962.Fn yylex
1963will start scanning the string.
1964.Bl -tag -width Ds
1965.It yy_scan_string(const char *str)
1966Scans a NUL-terminated string.
1967.It yy_scan_bytes(const char *bytes, int len)
1968Scans
1969.Fa len
1970bytes
1971.Pq including possibly NUL's
1972starting at location
1973.Fa bytes .
1974.El
1975.Pp
1976Note that both of these functions create and scan a copy
1977of the string or bytes.
1978(This may be desirable, since
1979.Fn yylex
1980modifies the contents of the buffer it is scanning.)
1981The copy can be avoided by using:
1982.Bl -tag -width Ds
1983.It yy_scan_buffer(char *base, yy_size_t size)
1984Which scans the buffer starting at
1985.Fa base ,
1986consisting of
1987.Fa size
1988bytes, the last two bytes of which must be
1989.Dv YY_END_OF_BUFFER_CHAR
1990.Pq ASCII NUL .
1991These last two bytes are not scanned; thus, scanning consists of
1992base[0] through base[size-2], inclusive.
1993.Pp
1994If
1995.Fa base
1996is not set up in this manner
1997(i.e., forget the final two
1998.Dv YY_END_OF_BUFFER_CHAR
1999bytes), then
2000.Fn yy_scan_buffer
2001returns a nil pointer instead of creating a new input buffer.
2002.Pp
2003The type
2004.Fa yy_size_t
2005is an integral type which can be cast to an integer expression
2006reflecting the size of the buffer.
2007.El
2008.Sh END-OF-FILE RULES
2009The special rule
2010.Qq Aq Aq EOF
2011indicates actions which are to be taken when an end-of-file is encountered and
2012.Fn yywrap
2013returns non-zero
2014.Pq i.e., indicates no further files to process .
2015The action must finish by doing one of four things:
2016.Bl -dash
2017.It
2018Assigning
2019.Em yyin
2020to a new input file
2021(in previous versions of
2022.Nm ,
2023after doing the assignment, it was necessary to call the special action
2024.Dv YY_NEW_FILE ;
2025this is no longer necessary).
2026.It
2027Executing a
2028.Em return
2029statement.
2030.It
2031Executing the special
2032.Fn yyterminate
2033action.
2034.It
2035Switching to a new buffer using
2036.Fn yy_switch_to_buffer
2037as shown in the example above.
2038.El
2039.Pp
2040.Aq Aq EOF
2041rules may not be used with other patterns;
2042they may only be qualified with a list of start conditions.
2043If an unqualified
2044.Aq Aq EOF
2045rule is given, it applies to all start conditions which do not already have
2046.Aq Aq EOF
2047actions.
2048To specify an
2049.Aq Aq EOF
2050rule for only the initial start condition, use
2051.Pp
2052.Dl <INITIAL><<EOF>>
2053.Pp
2054These rules are useful for catching things like unclosed comments.
2055An example:
2056.Bd -literal -offset indent
2057%x quote
2058%%
2059
2060\&...other rules for dealing with quotes...
2061
2062<quote><<EOF>> {
2063         error("unterminated quote");
2064         yyterminate();
2065}
2066<<EOF>> {
2067         if (*++filelist)
2068                 yyin = fopen(*filelist, "r");
2069         else
2070                 yyterminate();
2071}
2072.Ed
2073.Sh MISCELLANEOUS MACROS
2074The macro
2075.Dv YY_USER_ACTION
2076can be defined to provide an action
2077which is always executed prior to the matched rule's action.
2078For example,
2079it could be #define'd to call a routine to convert yytext to lower-case.
2080When
2081.Dv YY_USER_ACTION
2082is invoked, the variable
2083.Fa yy_act
2084gives the number of the matched rule
2085.Pq rules are numbered starting with 1 .
2086For example, to profile how often each rule is matched,
2087the following would do the trick:
2088.Pp
2089.Dl #define YY_USER_ACTION ++ctr[yy_act]
2090.Pp
2091where
2092.Fa ctr
2093is an array to hold the counts for the different rules.
2094Note that the macro
2095.Dv YY_NUM_RULES
2096gives the total number of rules
2097(including the default rule, even if
2098.Fl s
2099is used),
2100so a correct declaration for
2101.Fa ctr
2102is:
2103.Pp
2104.Dl int ctr[YY_NUM_RULES];
2105.Pp
2106The macro
2107.Dv YY_USER_INIT
2108may be defined to provide an action which is always executed before
2109the first scan
2110.Pq and before the scanner's internal initializations are done .
2111For example, it could be used to call a routine to read
2112in a data table or open a logging file.
2113.Pp
2114The macro
2115.Dv yy_set_interactive(is_interactive)
2116can be used to control whether the current buffer is considered
2117.Em interactive .
2118An interactive buffer is processed more slowly,
2119but must be used when the scanner's input source is indeed
2120interactive to avoid problems due to waiting to fill buffers
2121(see the discussion of the
2122.Fl I
2123flag below).
2124A non-zero value in the macro invocation marks the buffer as interactive,
2125a zero value as non-interactive.
2126Note that use of this macro overrides
2127.Dq %option always-interactive
2128or
2129.Dq %option never-interactive
2130(see
2131.Sx OPTIONS
2132below).
2133.Fn yy_set_interactive
2134must be invoked prior to beginning to scan the buffer that is
2135.Pq or is not
2136to be considered interactive.
2137.Pp
2138The macro
2139.Dv yy_set_bol(at_bol)
2140can be used to control whether the current buffer's scanning
2141context for the next token match is done as though at the
2142beginning of a line.
2143A non-zero macro argument makes rules anchored with
2144.Sq ^
2145active, while a zero argument makes
2146.Sq ^
2147rules inactive.
2148.Pp
2149The macro
2150.Dv YY_AT_BOL
2151returns true if the next token scanned from the current buffer will have
2152.Sq ^
2153rules active, false otherwise.
2154.Pp
2155In the generated scanner, the actions are all gathered in one large
2156switch statement and separated using
2157.Dv YY_BREAK ,
2158which may be redefined.
2159By default, it is simply a
2160.Qq break ,
2161to separate each rule's action from the following rules.
2162Redefining
2163.Dv YY_BREAK
2164allows, for example, C++ users to
2165.Dq #define YY_BREAK
2166to do nothing
2167(while being very careful that every rule ends with a
2168.Qq break
2169or a
2170.Qq return ! )
2171to avoid suffering from unreachable statement warnings where because a rule's
2172action ends with
2173.Dq return ,
2174the
2175.Dv YY_BREAK
2176is inaccessible.
2177.Sh VALUES AVAILABLE TO THE USER
2178This section summarizes the various values available to the user
2179in the rule actions.
2180.Bl -tag -width Ds
2181.It char *yytext
2182Holds the text of the current token.
2183It may be modified but not lengthened
2184.Pq characters cannot be appended to the end .
2185.Pp
2186If the special directive
2187.Dq %array
2188appears in the first section of the scanner description, then
2189.Fa yytext
2190is instead declared
2191.Dq char yytext[YYLMAX] ,
2192where
2193.Dv YYLMAX
2194is a macro definition that can be redefined in the first section
2195to change the default value
2196.Pq generally 8KB .
2197Using
2198.Dq %array
2199results in somewhat slower scanners, but the value of
2200.Fa yytext
2201becomes immune to calls to
2202.Fn input
2203and
2204.Fn unput ,
2205which potentially destroy its value when
2206.Fa yytext
2207is a character pointer.
2208The opposite of
2209.Dq %array
2210is
2211.Dq %pointer ,
2212which is the default.
2213.Pp
2214.Dq %array
2215cannot be used when generating C++ scanner classes
2216(the
2217.Fl +
2218flag).
2219.It int yyleng
2220Holds the length of the current token.
2221.It FILE *yyin
2222Is the file which by default
2223.Nm
2224reads from.
2225It may be redefined, but doing so only makes sense before
2226scanning begins or after an
2227.Dv EOF
2228has been encountered.
2229Changing it in the midst of scanning will have unexpected results since
2230.Nm
2231buffers its input; use
2232.Fn yyrestart
2233instead.
2234Once scanning terminates because an end-of-file
2235has been seen,
2236.Fa yyin
2237can be assigned as the new input file
2238and the scanner can be called again to continue scanning.
2239.It void yyrestart(FILE *new_file)
2240May be called to point
2241.Fa yyin
2242at the new input file.
2243The switch-over to the new file is immediate
2244.Pq any previously buffered-up input is lost .
2245Note that calling
2246.Fn yyrestart
2247with
2248.Fa yyin
2249as an argument thus throws away the current input buffer and continues
2250scanning the same input file.
2251.It FILE *yyout
2252Is the file to which
2253.Em ECHO
2254actions are done.
2255It can be reassigned by the user.
2256.It YY_CURRENT_BUFFER
2257Returns a
2258.Dv YY_BUFFER_STATE
2259handle to the current buffer.
2260.It YY_START
2261Returns an integer value corresponding to the current start condition.
2262This value can subsequently be used with
2263.Em BEGIN
2264to return to that start condition.
2265.El
2266.Sh INTERFACING WITH YACC
2267One of the main uses of
2268.Nm
2269is as a companion to the
2270.Xr yacc 1
2271parser-generator.
2272yacc parsers expect to call a routine named
2273.Fn yylex
2274to find the next input token.
2275The routine is supposed to return the type of the next token
2276as well as putting any associated value in the global
2277.Fa yylval ,
2278which is defined externally,
2279and can be a union or any other complex data structure.
2280To use
2281.Nm
2282with yacc, one specifies the
2283.Fl d
2284option to yacc to instruct it to generate the file
2285.Pa y.tab.h
2286containing definitions of all the
2287.Dq %tokens
2288appearing in the yacc input.
2289This file is then included in the
2290.Nm
2291scanner.
2292For example, if one of the tokens is
2293.Qq TOK_NUMBER ,
2294part of the scanner might look like:
2295.Bd -literal -offset indent
2296%{
2297#include "y.tab.h"
2298%}
2299
2300%%
2301
2302[0-9]+        yylval = atoi(yytext); return TOK_NUMBER;
2303.Ed
2304.Sh OPTIONS
2305.Nm
2306has the following options:
2307.Bl -tag -width Ds
2308.It Fl 7
2309Instructs
2310.Nm
2311to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2312characters in its input.
2313The advantage of using
2314.Fl 7
2315is that the scanner's tables can be up to half the size of those generated
2316using the
2317.Fl 8
2318option
2319.Pq see below .
2320The disadvantage is that such scanners often hang
2321or crash if their input contains an 8-bit character.
2322.Pp
2323Note, however, that unless generating a scanner using the
2324.Fl Cf
2325or
2326.Fl CF
2327table compression options, use of
2328.Fl 7
2329will save only a small amount of table space,
2330and make the scanner considerably less portable.
2331.Nm flex Ns 's
2332default behavior is to generate an 8-bit scanner unless
2333.Fl Cf
2334or
2335.Fl CF
2336is specified, in which case
2337.Nm
2338defaults to generating 7-bit scanners unless it was
2339configured to generate 8-bit scanners
2340(as will often be the case with non-USA sites).
2341It is possible tell whether
2342.Nm
2343generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2344.Fl v
2345output as described below.
2346.Pp
2347Note that if
2348.Fl Cfe
2349or
2350.Fl CFe
2351are used
2352(the table compression options, but also using equivalence classes as
2353discussed below),
2354.Nm
2355still defaults to generating an 8-bit scanner,
2356since usually with these compression options full 8-bit tables
2357are not much more expensive than 7-bit tables.
2358.It Fl 8
2359Instructs
2360.Nm
2361to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2362characters.
2363This flag is only needed for scanners generated using
2364.Fl Cf
2365or
2366.Fl CF ,
2367as otherwise
2368.Nm
2369defaults to generating an 8-bit scanner anyway.
2370.Pp
2371See the discussion of
2372.Fl 7
2373above for
2374.Nm flex Ns 's
2375default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2376.It Fl B
2377Instructs
2378.Nm
2379to generate a
2380.Em batch
2381scanner, the opposite of
2382.Em interactive
2383scanners generated by
2384.Fl I
2385.Pq see below .
2386In general,
2387.Fl B
2388is used when the scanner will never be used interactively,
2389and you want to squeeze a little more performance out of it.
2390If the aim is instead to squeeze out a lot more performance,
2391use the
2392.Fl Cf
2393or
2394.Fl CF
2395options
2396.Pq discussed below ,
2397which turn on
2398.Fl B
2399automatically anyway.
2400.It Fl b
2401Generate backing-up information to
2402.Pa lex.backup .
2403This is a list of scanner states which require backing up
2404and the input characters on which they do so.
2405By adding rules one can remove backing-up states.
2406If all backing-up states are eliminated and
2407.Fl Cf
2408or
2409.Fl CF
2410is used, the generated scanner will run faster (see the
2411.Fl p
2412flag).
2413Only users who wish to squeeze every last cycle out of their
2414scanners need worry about this option.
2415(See the section on
2416.Sx PERFORMANCE CONSIDERATIONS
2417below.)
2418.It Fl C Ns Op Cm aeFfmr
2419Controls the degree of table compression and, more generally, trade-offs
2420between small scanners and fast scanners.
2421.Bl -tag -width Ds
2422.It Fl Ca
2423Instructs
2424.Nm
2425to trade off larger tables in the generated scanner for faster performance
2426because the elements of the tables are better aligned for memory access
2427and computation.
2428On some
2429.Tn RISC
2430architectures, fetching and manipulating longwords is more efficient
2431than with smaller-sized units such as shortwords.
2432This option can double the size of the tables used by the scanner.
2433.It Fl Ce
2434Directs
2435.Nm
2436to construct
2437.Em equivalence classes ,
2438i.e., sets of characters which have identical lexical properties
2439(for example, if the only appearance of digits in the
2440.Nm
2441input is in the character class
2442.Qq [0-9]
2443then the digits
2444.Sq 0 ,
2445.Sq 1 ,
2446.Sq ... ,
2447.Sq 9
2448will all be put in the same equivalence class).
2449Equivalence classes usually give dramatic reductions in the final
2450table/object file sizes
2451.Pq typically a factor of 2\-5
2452and are pretty cheap performance-wise
2453.Pq one array look-up per character scanned .
2454.It Fl CF
2455Specifies that the alternate fast scanner representation
2456(described below under the
2457.Fl F
2458option)
2459should be used.
2460This option cannot be used with
2461.Fl + .
2462.It Fl Cf
2463Specifies that the
2464.Em full
2465scanner tables should be generated \-
2466.Nm
2467should not compress the tables by taking advantage of
2468similar transition functions for different states.
2469.It Fl \&Cm
2470Directs
2471.Nm
2472to construct
2473.Em meta-equivalence classes ,
2474which are sets of equivalence classes
2475(or characters, if equivalence classes are not being used)
2476that are commonly used together.
2477Meta-equivalence classes are often a big win when using compressed tables,
2478but they have a moderate performance impact
2479(one or two
2480.Qq if
2481tests and one array look-up per character scanned).
2482.It Fl Cr
2483Causes the generated scanner to
2484.Em bypass
2485use of the standard I/O library
2486.Pq stdio
2487for input.
2488Instead of calling
2489.Xr fread 3
2490or
2491.Xr getc 3 ,
2492the scanner will use the
2493.Xr read 2
2494system call,
2495resulting in a performance gain which varies from system to system,
2496but in general is probably negligible unless
2497.Fl Cf
2498or
2499.Fl CF
2500are being used.
2501Using
2502.Fl Cr
2503can cause strange behavior if, for example, reading from
2504.Fa yyin
2505using stdio prior to calling the scanner
2506(because the scanner will miss whatever text previous reads left
2507in the stdio input buffer).
2508.Pp
2509.Fl Cr
2510has no effect if
2511.Dv YY_INPUT
2512is defined
2513(see
2514.Sx THE GENERATED SCANNER
2515above).
2516.El
2517.Pp
2518A lone
2519.Fl C
2520specifies that the scanner tables should be compressed but neither
2521equivalence classes nor meta-equivalence classes should be used.
2522.Pp
2523The options
2524.Fl Cf
2525or
2526.Fl CF
2527and
2528.Fl \&Cm
2529do not make sense together \- there is no opportunity for meta-equivalence
2530classes if the table is not being compressed.
2531Otherwise the options may be freely mixed, and are cumulative.
2532.Pp
2533The default setting is
2534.Fl Cem
2535which specifies that
2536.Nm
2537should generate equivalence classes and meta-equivalence classes.
2538This setting provides the highest degree of table compression.
2539It is possible to trade off faster-executing scanners at the cost of
2540larger tables with the following generally being true:
2541.Bd -unfilled -offset indent
2542slowest & smallest
2543      -Cem
2544      -Cm
2545      -Ce
2546      -C
2547      -C{f,F}e
2548      -C{f,F}
2549      -C{f,F}a
2550fastest & largest
2551.Ed
2552.Pp
2553Note that scanners with the smallest tables are usually generated and
2554compiled the quickest,
2555so during development the default is usually best,
2556maximal compression.
2557.Pp
2558.Fl Cfe
2559is often a good compromise between speed and size for production scanners.
2560.It Fl d
2561Makes the generated scanner run in debug mode.
2562Whenever a pattern is recognized and the global
2563.Fa yy_flex_debug
2564is non-zero
2565.Pq which is the default ,
2566the scanner will write to stderr a line of the form:
2567.Pp
2568.D1 --accepting rule at line 53 ("the matched text")
2569.Pp
2570The line number refers to the location of the rule in the file
2571defining the scanner
2572(i.e., the file that was fed to
2573.Nm ) .
2574Messages are also generated when the scanner backs up,
2575accepts the default rule,
2576reaches the end of its input buffer
2577(or encounters a NUL;
2578at this point, the two look the same as far as the scanner's concerned),
2579or reaches an end-of-file.
2580.It Fl F
2581Specifies that the fast scanner table representation should be used
2582.Pq and stdio bypassed .
2583This representation is about as fast as the full table representation
2584.Pq Fl f ,
2585and for some sets of patterns will be considerably smaller
2586.Pq and for others, larger .
2587In general, if the pattern set contains both
2588.Qq keywords
2589and a catch-all,
2590.Qq identifier
2591rule, such as in the set:
2592.Bd -unfilled -offset indent
2593"case"    return TOK_CASE;
2594"switch"  return TOK_SWITCH;
2595\&...
2596"default" return TOK_DEFAULT;
2597[a-z]+    return TOK_ID;
2598.Ed
2599.Pp
2600then it's better to use the full table representation.
2601If only the
2602.Qq identifier
2603rule is present and a hash table or some such is used to detect the keywords,
2604it's better to use
2605.Fl F .
2606.Pp
2607This option is equivalent to
2608.Fl CFr
2609.Pq see above .
2610It cannot be used with
2611.Fl + .
2612.It Fl f
2613Specifies
2614.Em fast scanner .
2615No table compression is done and stdio is bypassed.
2616The result is large but fast.
2617This option is equivalent to
2618.Fl Cfr
2619.Pq see above .
2620.It Fl h
2621Generates a help summary of
2622.Nm flex Ns 's
2623options to stdout and then exits.
2624.Fl ?\&
2625and
2626.Fl Fl help
2627are synonyms for
2628.Fl h .
2629.It Fl I
2630Instructs
2631.Nm
2632to generate an
2633.Em interactive
2634scanner.
2635An interactive scanner is one that only looks ahead to decide
2636what token has been matched if it absolutely must.
2637It turns out that always looking one extra character ahead,
2638even if the scanner has already seen enough text
2639to disambiguate the current token, is a bit faster than
2640only looking ahead when necessary.
2641But scanners that always look ahead give dreadful interactive performance;
2642for example, when a user types a newline,
2643it is not recognized as a newline token until they enter
2644.Em another
2645token, which often means typing in another whole line.
2646.Pp
2647.Nm
2648scanners default to
2649.Em interactive
2650unless
2651.Fl Cf
2652or
2653.Fl CF
2654table-compression options are specified
2655.Pq see above .
2656That's because if high-performance is most important,
2657one of these options should be used,
2658so if they weren't,
2659.Nm
2660assumes it is preferable to trade off a bit of run-time performance for
2661intuitive interactive behavior.
2662Note also that
2663.Fl I
2664cannot be used in conjunction with
2665.Fl Cf
2666or
2667.Fl CF .
2668Thus, this option is not really needed; it is on by default for all those
2669cases in which it is allowed.
2670.Pp
2671A scanner can be forced to not be interactive by using
2672.Fl B
2673.Pq see above .
2674.It Fl i
2675Instructs
2676.Nm
2677to generate a case-insensitive scanner.
2678The case of letters given in the
2679.Nm
2680input patterns will be ignored,
2681and tokens in the input will be matched regardless of case.
2682The matched text given in
2683.Fa yytext
2684will have the preserved case
2685.Pq i.e., it will not be folded .
2686.It Fl L
2687Instructs
2688.Nm
2689not to generate
2690.Dq #line
2691directives.
2692Without this option,
2693.Nm
2694peppers the generated scanner with #line directives so error messages
2695in the actions will be correctly located with respect to either the original
2696.Nm
2697input file
2698(if the errors are due to code in the input file),
2699or
2700.Pa lex.yy.c
2701(if the errors are
2702.Nm flex Ns 's
2703fault \- these sorts of errors should be reported to the email address
2704given below).
2705.It Fl l
2706Turns on maximum compatibility with the original
2707.At
2708.Nm lex
2709implementation.
2710Note that this does not mean full compatibility.
2711Use of this option costs a considerable amount of performance,
2712and it cannot be used with the
2713.Fl + , f , F , Cf ,
2714or
2715.Fl CF
2716options.
2717For details on the compatibilities it provides, see the section
2718.Sx INCOMPATIBILITIES WITH LEX AND POSIX
2719below.
2720This option also results in the name
2721.Dv YY_FLEX_LEX_COMPAT
2722being #define'd in the generated scanner.
2723.It Fl n
2724Another do-nothing, deprecated option included only for
2725.Tn POSIX
2726compliance.
2727.It Fl o Ns Ar output
2728Directs
2729.Nm
2730to write the scanner to the file
2731.Ar output
2732instead of
2733.Pa lex.yy.c .
2734If
2735.Fl o
2736is combined with the
2737.Fl t
2738option, then the scanner is written to stdout but its
2739.Dq #line
2740directives
2741(see the
2742.Fl L
2743option above)
2744refer to the file
2745.Ar output .
2746.It Fl P Ns Ar prefix
2747Changes the default
2748.Qq yy
2749prefix used by
2750.Nm
2751for all globally visible variable and function names to instead be
2752.Ar prefix .
2753For example,
2754.Fl P Ns Ar foo
2755changes the name of
2756.Fa yytext
2757to
2758.Fa footext .
2759It also changes the name of the default output file from
2760.Pa lex.yy.c
2761to
2762.Pa lex.foo.c .
2763Here are all of the names affected:
2764.Bd -unfilled -offset indent
2765yy_create_buffer
2766yy_delete_buffer
2767yy_flex_debug
2768yy_init_buffer
2769yy_flush_buffer
2770yy_load_buffer_state
2771yy_switch_to_buffer
2772yyin
2773yyleng
2774yylex
2775yylineno
2776yyout
2777yyrestart
2778yytext
2779yywrap
2780.Ed
2781.Pp
2782(If using a C++ scanner, then only
2783.Fa yywrap
2784and
2785.Fa yyFlexLexer
2786are affected.)
2787Within the scanner itself, it is still possible to refer to the global variables
2788and functions using either version of their name; but externally, they
2789have the modified name.
2790.Pp
2791This option allows multiple
2792.Nm
2793programs to be easily linked together into the same executable.
2794Note, though, that using this option also renames
2795.Fn yywrap ,
2796so now either an
2797.Pq appropriately named
2798version of the routine for the scanner must be supplied, or
2799.Dq %option noyywrap
2800must be used, as linking with
2801.Fl lfl
2802no longer provides one by default.
2803.It Fl p
2804Generates a performance report to stderr.
2805The report consists of comments regarding features of the
2806.Nm
2807input file which will cause a serious loss of performance in the resulting
2808scanner.
2809If the flag is specified twice,
2810comments regarding features that lead to minor performance losses
2811will also be reported>
2812.Pp
2813Note that the use of
2814.Em REJECT ,
2815.Dq %option yylineno ,
2816and variable trailing context
2817(see the
2818.Sx BUGS
2819section below)
2820entails a substantial performance penalty; use of
2821.Fn yymore ,
2822the
2823.Sq ^
2824operator, and the
2825.Fl I
2826flag entail minor performance penalties.
2827.It Fl S Ns Ar skeleton
2828Overrides the default skeleton file from which
2829.Nm
2830constructs its scanners.
2831This option is needed only for
2832.Nm
2833maintenance or development.
2834.It Fl s
2835Causes the default rule
2836.Pq that unmatched scanner input is echoed to stdout
2837to be suppressed.
2838If the scanner encounters input that does not
2839match any of its rules, it aborts with an error.
2840This option is useful for finding holes in a scanner's rule set.
2841.It Fl T
2842Makes
2843.Nm
2844run in
2845.Em trace
2846mode.
2847It will generate a lot of messages to stderr concerning
2848the form of the input and the resultant non-deterministic and deterministic
2849finite automata.
2850This option is mostly for use in maintaining
2851.Nm .
2852.It Fl t
2853Instructs
2854.Nm
2855to write the scanner it generates to standard output instead of
2856.Pa lex.yy.c .
2857.It Fl V
2858Prints the version number to stdout and exits.
2859.Fl Fl version
2860is a synonym for
2861.Fl V .
2862.It Fl v
2863Specifies that
2864.Nm
2865should write to stderr
2866a summary of statistics regarding the scanner it generates.
2867Most of the statistics are meaningless to the casual
2868.Nm
2869user, but the first line identifies the version of
2870.Nm
2871(same as reported by
2872.Fl V ) ,
2873and the next line the flags used when generating the scanner,
2874including those that are on by default.
2875.It Fl w
2876Suppresses warning messages.
2877.It Fl +
2878Specifies that
2879.Nm
2880should generate a C++ scanner class.
2881See the section on
2882.Sx GENERATING C++ SCANNERS
2883below for details.
2884.El
2885.Pp
2886.Nm
2887also provides a mechanism for controlling options within the
2888scanner specification itself, rather than from the
2889.Nm
2890command line.
2891This is done by including
2892.Dq %option
2893directives in the first section of the scanner specification.
2894Multiple options can be specified with a single
2895.Dq %option
2896directive, and multiple directives in the first section of the
2897.Nm
2898input file.
2899.Pp
2900Most options are given simply as names, optionally preceded by the word
2901.Qq no
2902.Pq with no intervening whitespace
2903to negate their meaning.
2904A number are equivalent to
2905.Nm
2906flags or their negation:
2907.Bd -unfilled -offset indent
29087bit            -7 option
29098bit            -8 option
2910align           -Ca option
2911backup          -b option
2912batch           -B option
2913c++             -+ option
2914
2915caseful or
2916case-sensitive  opposite of -i (default)
2917
2918case-insensitive or
2919caseless        -i option
2920
2921debug           -d option
2922default         opposite of -s option
2923ecs             -Ce option
2924fast            -F option
2925full            -f option
2926interactive     -I option
2927lex-compat      -l option
2928meta-ecs        -Cm option
2929perf-report     -p option
2930read            -Cr option
2931stdout          -t option
2932verbose         -v option
2933warn            opposite of -w option
2934                (use "%option nowarn" for -w)
2935
2936array           equivalent to "%array"
2937pointer         equivalent to "%pointer" (default)
2938.Ed
2939.Pp
2940Some %option's provide features otherwise not available:
2941.Bl -tag -width Ds
2942.It always-interactive
2943Instructs
2944.Nm
2945to generate a scanner which always considers its input
2946.Qq interactive .
2947Normally, on each new input file the scanner calls
2948.Fn isatty
2949in an attempt to determine whether the scanner's input source is interactive
2950and thus should be read a character at a time.
2951When this option is used, however, no such call is made.
2952.It main
2953Directs
2954.Nm
2955to provide a default
2956.Fn main
2957program for the scanner, which simply calls
2958.Fn yylex .
2959This option implies
2960.Dq noyywrap
2961.Pq see below .
2962.It never-interactive
2963Instructs
2964.Nm
2965to generate a scanner which never considers its input
2966.Qq interactive
2967(again, no call made to
2968.Fn isatty ) .
2969This is the opposite of
2970.Dq always-interactive .
2971.It stack
2972Enables the use of start condition stacks
2973(see
2974.Sx START CONDITIONS
2975above).
2976.It stdinit
2977If set (i.e.,
2978.Dq %option stdinit ) ,
2979initializes
2980.Fa yyin
2981and
2982.Fa yyout
2983to stdin and stdout, instead of the default of
2984.Dq nil .
2985Some existing
2986.Nm lex
2987programs depend on this behavior, even though it is not compliant with ANSI C,
2988which does not require stdin and stdout to be compile-time constant.
2989.It yylineno
2990Directs
2991.Nm
2992to generate a scanner that maintains the number of the current line
2993read from its input in the global variable
2994.Fa yylineno .
2995This option is implied by
2996.Dq %option lex-compat .
2997.It yywrap
2998If unset (i.e.,
2999.Dq %option noyywrap ) ,
3000makes the scanner not call
3001.Fn yywrap
3002upon an end-of-file, but simply assume that there are no more files to scan
3003(until the user points
3004.Fa yyin
3005at a new file and calls
3006.Fn yylex
3007again).
3008.El
3009.Pp
3010.Nm
3011scans rule actions to determine whether the
3012.Em REJECT
3013or
3014.Fn yymore
3015features are being used.
3016The
3017.Dq reject
3018and
3019.Dq yymore
3020options are available to override its decision as to whether to use the
3021options, either by setting them (e.g.,
3022.Dq %option reject )
3023to indicate the feature is indeed used,
3024or unsetting them to indicate it actually is not used
3025(e.g.,
3026.Dq %option noyymore ) .
3027.Pp
3028Three options take string-delimited values, offset with
3029.Sq = :
3030.Pp
3031.D1 %option outfile="ABC"
3032.Pp
3033is equivalent to
3034.Fl o Ns Ar ABC ,
3035and
3036.Pp
3037.D1 %option prefix="XYZ"
3038.Pp
3039is equivalent to
3040.Fl P Ns Ar XYZ .
3041Finally,
3042.Pp
3043.D1 %option yyclass="foo"
3044.Pp
3045only applies when generating a C++ scanner
3046.Pf ( Fl +
3047option).
3048It informs
3049.Nm
3050that
3051.Dq foo
3052has been derived as a subclass of yyFlexLexer, so
3053.Nm
3054will place actions in the member function
3055.Dq foo::yylex()
3056instead of
3057.Dq yyFlexLexer::yylex() .
3058It also generates a
3059.Dq yyFlexLexer::yylex()
3060member function that emits a run-time error (by invoking
3061.Dq yyFlexLexer::LexerError() )
3062if called.
3063See
3064.Sx GENERATING C++ SCANNERS ,
3065below, for additional information.
3066.Pp
3067A number of options are available for
3068lint
3069purists who want to suppress the appearance of unneeded routines
3070in the generated scanner.
3071Each of the following, if unset
3072(e.g.,
3073.Dq %option nounput ) ,
3074results in the corresponding routine not appearing in the generated scanner:
3075.Bd -unfilled -offset indent
3076input, unput
3077yy_push_state, yy_pop_state, yy_top_state
3078yy_scan_buffer, yy_scan_bytes, yy_scan_string
3079.Ed
3080.Pp
3081(though
3082.Fn yy_push_state
3083and friends won't appear anyway unless
3084.Dq %option stack
3085is being used).
3086.Sh PERFORMANCE CONSIDERATIONS
3087The main design goal of
3088.Nm
3089is that it generate high-performance scanners.
3090It has been optimized for dealing well with large sets of rules.
3091Aside from the effects on scanner speed of the table compression
3092.Fl C
3093options outlined above,
3094there are a number of options/actions which degrade performance.
3095These are, from most expensive to least:
3096.Bd -unfilled -offset indent
3097REJECT
3098%option yylineno
3099arbitrary trailing context
3100
3101pattern sets that require backing up
3102%array
3103%option interactive
3104%option always-interactive
3105
3106\&'^' beginning-of-line operator
3107yymore()
3108.Ed
3109.Pp
3110with the first three all being quite expensive
3111and the last two being quite cheap.
3112Note also that
3113.Fn unput
3114is implemented as a routine call that potentially does quite a bit of work,
3115while
3116.Fn yyless
3117is a quite-cheap macro; so if just putting back some excess text,
3118use
3119.Fn yyless .
3120.Pp
3121.Em REJECT
3122should be avoided at all costs when performance is important.
3123It is a particularly expensive option.
3124.Pp
3125Getting rid of backing up is messy and often may be an enormous
3126amount of work for a complicated scanner.
3127In principal, one begins by using the
3128.Fl b
3129flag to generate a
3130.Pa lex.backup
3131file.
3132For example, on the input
3133.Bd -literal -offset indent
3134%%
3135foo        return TOK_KEYWORD;
3136foobar     return TOK_KEYWORD;
3137.Ed
3138.Pp
3139the file looks like:
3140.Bd -literal -offset indent
3141State #6 is non-accepting -
3142 associated rule line numbers:
3143       2       3
3144 out-transitions: [ o ]
3145 jam-transitions: EOF [ \e001-n  p-\e177 ]
3146
3147State #8 is non-accepting -
3148 associated rule line numbers:
3149       3
3150 out-transitions: [ a ]
3151 jam-transitions: EOF [ \e001-`  b-\e177 ]
3152
3153State #9 is non-accepting -
3154 associated rule line numbers:
3155       3
3156 out-transitions: [ r ]
3157 jam-transitions: EOF [ \e001-q  s-\e177 ]
3158
3159Compressed tables always back up.
3160.Ed
3161.Pp
3162The first few lines tell us that there's a scanner state in
3163which it can make a transition on an
3164.Sq o
3165but not on any other character,
3166and that in that state the currently scanned text does not match any rule.
3167The state occurs when trying to match the rules found
3168at lines 2 and 3 in the input file.
3169If the scanner is in that state and then reads something other than an
3170.Sq o ,
3171it will have to back up to find a rule which is matched.
3172With a bit of headscratching one can see that this must be the
3173state it's in when it has seen
3174.Sq fo .
3175When this has happened, if anything other than another
3176.Sq o
3177is seen, the scanner will have to back up to simply match the
3178.Sq f
3179.Pq by the default rule .
3180.Pp
3181The comment regarding State #8 indicates there's a problem when
3182.Qq foob
3183has been scanned.
3184Indeed, on any character other than an
3185.Sq a ,
3186the scanner will have to back up to accept
3187.Qq foo .
3188Similarly, the comment for State #9 concerns when
3189.Qq fooba
3190has been scanned and an
3191.Sq r
3192does not follow.
3193.Pp
3194The final comment reminds us that there's no point going to
3195all the trouble of removing backing up from the rules unless we're using
3196.Fl Cf
3197or
3198.Fl CF ,
3199since there's no performance gain doing so with compressed scanners.
3200.Pp
3201The way to remove the backing up is to add
3202.Qq error
3203rules:
3204.Bd -literal -offset indent
3205%%
3206foo    return TOK_KEYWORD;
3207foobar return TOK_KEYWORD;
3208
3209fooba  |
3210foob   |
3211fo {
3212        /* false alarm, not really a keyword */
3213        return TOK_ID;
3214}
3215.Ed
3216.Pp
3217Eliminating backing up among a list of keywords can also be done using a
3218.Qq catch-all
3219rule:
3220.Bd -literal -offset indent
3221%%
3222foo    return TOK_KEYWORD;
3223foobar return TOK_KEYWORD;
3224
3225[a-z]+ return TOK_ID;
3226.Ed
3227.Pp
3228This is usually the best solution when appropriate.
3229.Pp
3230Backing up messages tend to cascade.
3231With a complicated set of rules it's not uncommon to get hundreds of messages.
3232If one can decipher them, though,
3233it often only takes a dozen or so rules to eliminate the backing up
3234(though it's easy to make a mistake and have an error rule accidentally match
3235a valid token; a possible future
3236.Nm
3237feature will be to automatically add rules to eliminate backing up).
3238.Pp
3239It's important to keep in mind that the benefits of eliminating
3240backing up are gained only if
3241.Em every
3242instance of backing up is eliminated.
3243Leaving just one gains nothing.
3244.Pp
3245.Em Variable
3246trailing context
3247(where both the leading and trailing parts do not have a fixed length)
3248entails almost the same performance loss as
3249.Em REJECT
3250.Pq i.e., substantial .
3251So when possible a rule like:
3252.Bd -literal -offset indent
3253%%
3254mouse|rat/(cat|dog)   run();
3255.Ed
3256.Pp
3257is better written:
3258.Bd -literal -offset indent
3259%%
3260mouse/cat|dog         run();
3261rat/cat|dog           run();
3262.Ed
3263.Pp
3264or as
3265.Bd -literal -offset indent
3266%%
3267mouse|rat/cat         run();
3268mouse|rat/dog         run();
3269.Ed
3270.Pp
3271Note that here the special
3272.Sq |\&
3273action does not provide any savings, and can even make things worse (see
3274.Sx BUGS
3275below).
3276.Pp
3277Another area where the user can increase a scanner's performance
3278.Pq and one that's easier to implement
3279arises from the fact that the longer the tokens matched,
3280the faster the scanner will run.
3281This is because with long tokens the processing of most input
3282characters takes place in the
3283.Pq short
3284inner scanning loop, and does not often have to go through the additional work
3285of setting up the scanning environment (e.g.,
3286.Fa yytext )
3287for the action.
3288Recall the scanner for C comments:
3289.Bd -literal -offset indent
3290%x comment
3291%%
3292int line_num = 1;
3293
3294"/*"                    BEGIN(comment);
3295
3296<comment>[^*\en]*
3297<comment>"*"+[^*/\en]*
3298<comment>\en             ++line_num;
3299<comment>"*"+"/"        BEGIN(INITIAL);
3300.Ed
3301.Pp
3302This could be sped up by writing it as:
3303.Bd -literal -offset indent
3304%x comment
3305%%
3306int line_num = 1;
3307
3308"/*"                    BEGIN(comment);
3309
3310<comment>[^*\en]*
3311<comment>[^*\en]*\en      ++line_num;
3312<comment>"*"+[^*/\en]*
3313<comment>"*"+[^*/\en]*\en ++line_num;
3314<comment>"*"+"/"        BEGIN(INITIAL);
3315.Ed
3316.Pp
3317Now instead of each newline requiring the processing of another action,
3318recognizing the newlines is
3319.Qq distributed
3320over the other rules to keep the matched text as long as possible.
3321Note that adding rules does
3322.Em not
3323slow down the scanner!
3324The speed of the scanner is independent of the number of rules or
3325(modulo the considerations given at the beginning of this section)
3326how complicated the rules are with regard to operators such as
3327.Sq *
3328and
3329.Sq |\& .
3330.Pp
3331A final example in speeding up a scanner:
3332scan through a file containing identifiers and keywords, one per line
3333and with no other extraneous characters, and recognize all the keywords.
3334A natural first approach is:
3335.Bd -literal -offset indent
3336%%
3337asm      |
3338auto     |
3339break    |
3340\&... etc ...
3341volatile |
3342while    /* it's a keyword */
3343
3344\&.|\en     /* it's not a keyword */
3345.Ed
3346.Pp
3347To eliminate the back-tracking, introduce a catch-all rule:
3348.Bd -literal -offset indent
3349%%
3350asm      |
3351auto     |
3352break    |
3353\&... etc ...
3354volatile |
3355while    /* it's a keyword */
3356
3357[a-z]+   |
3358\&.|\en     /* it's not a keyword */
3359.Ed
3360.Pp
3361Now, if it's guaranteed that there's exactly one word per line,
3362then we can reduce the total number of matches by a half by
3363merging in the recognition of newlines with that of the other tokens:
3364.Bd -literal -offset indent
3365%%
3366asm\en      |
3367auto\en     |
3368break\en    |
3369\&... etc ...
3370volatile\en |
3371while\en    /* it's a keyword */
3372
3373[a-z]+\en   |
3374\&.|\en       /* it's not a keyword */
3375.Ed
3376.Pp
3377One has to be careful here,
3378as we have now reintroduced backing up into the scanner.
3379In particular, while we know that there will never be any characters
3380in the input stream other than letters or newlines,
3381.Nm
3382can't figure this out, and it will plan for possibly needing to back up
3383when it has scanned a token like
3384.Qq auto
3385and then the next character is something other than a newline or a letter.
3386Previously it would then just match the
3387.Qq auto
3388rule and be done, but now it has no
3389.Qq auto
3390rule, only an
3391.Qq auto\en
3392rule.
3393To eliminate the possibility of backing up,
3394we could either duplicate all rules but without final newlines or,
3395since we never expect to encounter such an input and therefore don't
3396how it's classified, we can introduce one more catch-all rule,
3397this one which doesn't include a newline:
3398.Bd -literal -offset indent
3399%%
3400asm\en      |
3401auto\en     |
3402break\en    |
3403\&... etc ...
3404volatile\en |
3405while\en    /* it's a keyword */
3406
3407[a-z]+\en   |
3408[a-z]+     |
3409\&.|\en       /* it's not a keyword */
3410.Ed
3411.Pp
3412Compiled with
3413.Fl Cf ,
3414this is about as fast as one can get a
3415.Nm
3416scanner to go for this particular problem.
3417.Pp
3418A final note:
3419.Nm
3420is slow when matching NUL's,
3421particularly when a token contains multiple NUL's.
3422It's best to write rules which match short
3423amounts of text if it's anticipated that the text will often include NUL's.
3424.Pp
3425Another final note regarding performance: as mentioned above in the section
3426.Sx HOW THE INPUT IS MATCHED ,
3427dynamically resizing
3428.Fa yytext
3429to accommodate huge tokens is a slow process because it presently requires that
3430the
3431.Pq huge
3432token be rescanned from the beginning.
3433Thus if performance is vital, it is better to attempt to match
3434.Qq large
3435quantities of text but not
3436.Qq huge
3437quantities, where the cutoff between the two is at about 8K characters/token.
3438.Sh GENERATING C++ SCANNERS
3439.Nm
3440provides two different ways to generate scanners for use with C++.
3441The first way is to simply compile a scanner generated by
3442.Nm
3443using a C++ compiler instead of a C compiler.
3444This should not generate any compilation errors
3445(please report any found to the email address given in the
3446.Sx AUTHORS
3447section below).
3448C++ code can then be used in rule actions instead of C code.
3449Note that the default input source for scanners remains
3450.Fa yyin ,
3451and default echoing is still done to
3452.Fa yyout .
3453Both of these remain
3454.Fa FILE *
3455variables and not C++ streams.
3456.Pp
3457.Nm
3458can also be used to generate a C++ scanner class, using the
3459.Fl +
3460option (or, equivalently,
3461.Dq %option c++ ) ,
3462which is automatically specified if the name of the flex executable ends in a
3463.Sq + ,
3464such as
3465.Nm flex++ .
3466When using this option,
3467.Nm
3468defaults to generating the scanner to the file
3469.Pa lex.yy.cc
3470instead of
3471.Pa lex.yy.c .
3472The generated scanner includes the header file
3473.In g++/FlexLexer.h ,
3474which defines the interface to two C++ classes.
3475.Pp
3476The first class,
3477.Em FlexLexer ,
3478provides an abstract base class defining the general scanner class interface.
3479It provides the following member functions:
3480.Bl -tag -width Ds
3481.It const char* YYText()
3482Returns the text of the most recently matched token, the equivalent of
3483.Fa yytext .
3484.It int YYLeng()
3485Returns the length of the most recently matched token, the equivalent of
3486.Fa yyleng .
3487.It int lineno() const
3488Returns the current input line number
3489(see
3490.Dq %option yylineno ) ,
3491or 1 if
3492.Dq %option yylineno
3493was not used.
3494.It void set_debug(int flag)
3495Sets the debugging flag for the scanner, equivalent to assigning to
3496.Fa yy_flex_debug
3497(see the
3498.Sx OPTIONS
3499section above).
3500Note that the scanner must be built using
3501.Dq %option debug
3502to include debugging information in it.
3503.It int debug() const
3504Returns the current setting of the debugging flag.
3505.El
3506.Pp
3507Also provided are member functions equivalent to
3508.Fn yy_switch_to_buffer ,
3509.Fn yy_create_buffer
3510(though the first argument is an
3511.Fa std::istream*
3512object pointer and not a
3513.Fa FILE* ) ,
3514.Fn yy_flush_buffer ,
3515.Fn yy_delete_buffer ,
3516and
3517.Fn yyrestart
3518(again, the first argument is an
3519.Fa std::istream*
3520object pointer).
3521.Pp
3522The second class defined in
3523.In g++/FlexLexer.h
3524is
3525.Fa yyFlexLexer ,
3526which is derived from
3527.Fa FlexLexer .
3528It defines the following additional member functions:
3529.Bl -tag -width Ds
3530.It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
3531Constructs a
3532.Fa yyFlexLexer
3533object using the given streams for input and output.
3534If not specified, the streams default to
3535.Fa cin
3536and
3537.Fa cout ,
3538respectively.
3539.It virtual int yylex()
3540Performs the same role as
3541.Fn yylex
3542does for ordinary flex scanners: it scans the input stream, consuming
3543tokens, until a rule's action returns a value.
3544If subclass
3545.Sq S
3546is derived from
3547.Fa yyFlexLexer ,
3548in order to access the member functions and variables of
3549.Sq S
3550inside
3551.Fn yylex ,
3552use
3553.Dq %option yyclass="S"
3554to inform
3555.Nm
3556that the
3557.Sq S
3558subclass will be used instead of
3559.Fa yyFlexLexer .
3560In this case, rather than generating
3561.Dq yyFlexLexer::yylex() ,
3562.Nm
3563generates
3564.Dq S::yylex()
3565(and also generates a dummy
3566.Dq yyFlexLexer::yylex()
3567that calls
3568.Dq yyFlexLexer::LexerError()
3569if called).
3570.It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
3571Reassigns
3572.Fa yyin
3573to
3574.Fa new_in
3575.Pq if non-nil
3576and
3577.Fa yyout
3578to
3579.Fa new_out
3580.Pq ditto ,
3581deleting the previous input buffer if
3582.Fa yyin
3583is reassigned.
3584.It int yylex(std::istream* new_in, std::ostream* new_out = 0)
3585First switches the input streams via
3586.Dq switch_streams(new_in, new_out)
3587and then returns the value of
3588.Fn yylex .
3589.El
3590.Pp
3591In addition,
3592.Fa yyFlexLexer
3593defines the following protected virtual functions which can be redefined
3594in derived classes to tailor the scanner:
3595.Bl -tag -width Ds
3596.It virtual int LexerInput(char* buf, int max_size)
3597Reads up to
3598.Fa max_size
3599characters into
3600.Fa buf
3601and returns the number of characters read.
3602To indicate end-of-input, return 0 characters.
3603Note that
3604.Qq interactive
3605scanners (see the
3606.Fl B
3607and
3608.Fl I
3609flags) define the macro
3610.Dv YY_INTERACTIVE .
3611If
3612.Fn LexerInput
3613has been redefined, and it's necessary to take different actions depending on
3614whether or not the scanner might be scanning an interactive input source,
3615it's possible to test for the presence of this name via
3616.Dq #ifdef .
3617.It virtual void LexerOutput(const char* buf, int size)
3618Writes out
3619.Fa size
3620characters from the buffer
3621.Fa buf ,
3622which, while NUL-terminated, may also contain
3623.Qq internal
3624NUL's if the scanner's rules can match text with NUL's in them.
3625.It virtual void LexerError(const char* msg)
3626Reports a fatal error message.
3627The default version of this function writes the message to the stream
3628.Fa cerr
3629and exits.
3630.El
3631.Pp
3632Note that a
3633.Fa yyFlexLexer
3634object contains its entire scanning state.
3635Thus such objects can be used to create reentrant scanners.
3636Multiple instances of the same
3637.Fa yyFlexLexer
3638class can be instantiated, and multiple C++ scanner classes can be combined
3639in the same program using the
3640.Fl P
3641option discussed above.
3642.Pp
3643Finally, note that the
3644.Dq %array
3645feature is not available to C++ scanner classes;
3646.Dq %pointer
3647must be used
3648.Pq the default .
3649.Pp
3650Here is an example of a simple C++ scanner:
3651.Bd -literal -offset indent
3652// An example of using the flex C++ scanner class.
3653
3654%{
3655#include <errno.h>
3656int mylineno = 0;
3657%}
3658
3659string  \e"[^\en"]+\e"
3660
3661ws      [ \et]+
3662
3663alpha   [A-Za-z]
3664dig     [0-9]
3665name    ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3666num1    [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3667num2    [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3668number  {num1}|{num2}
3669
3670%%
3671
3672{ws}    /* skip blanks and tabs */
3673
3674"/*" {
3675        int c;
3676
3677        while ((c = yyinput()) != 0) {
3678                if(c == '\en')
3679                    ++mylineno;
3680                else if(c == '*') {
3681                    if ((c = yyinput()) == '/')
3682                        break;
3683                    else
3684                        unput(c);
3685                }
3686        }
3687}
3688
3689{number}  cout << "number " << YYText() << '\en';
3690
3691\en        mylineno++;
3692
3693{name}    cout << "name " << YYText() << '\en';
3694
3695{string}  cout << "string " << YYText() << '\en';
3696
3697%%
3698
3699int main(int /* argc */, char** /* argv */)
3700{
3701	FlexLexer* lexer = new yyFlexLexer;
3702	while(lexer->yylex() != 0)
3703	    ;
3704	return 0;
3705}
3706.Ed
3707.Pp
3708To create multiple
3709.Pq different
3710lexer classes, use the
3711.Fl P
3712flag
3713(or the
3714.Dq prefix=
3715option)
3716to rename each
3717.Fa yyFlexLexer
3718to some other
3719.Fa xxFlexLexer .
3720.In g++/FlexLexer.h
3721can then be included in other sources once per lexer class, first renaming
3722.Fa yyFlexLexer
3723as follows:
3724.Bd -literal -offset indent
3725#undef yyFlexLexer
3726#define yyFlexLexer xxFlexLexer
3727#include <g++/FlexLexer.h>
3728
3729#undef yyFlexLexer
3730#define yyFlexLexer zzFlexLexer
3731#include <g++/FlexLexer.h>
3732.Ed
3733.Pp
3734If, for example,
3735.Dq %option prefix="xx"
3736is used for one scanner and
3737.Dq %option prefix="zz"
3738is used for the other.
3739.Pp
3740.Sy IMPORTANT :
3741the present form of the scanning class is experimental
3742and may change considerably between major releases.
3743.Sh INCOMPATIBILITIES WITH LEX AND POSIX
3744.Nm
3745is a rewrite of the
3746.At
3747.Nm lex
3748tool
3749(the two implementations do not share any code, though),
3750with some extensions and incompatibilities, both of which are of concern
3751to those who wish to write scanners acceptable to either implementation.
3752.Nm
3753is fully compliant with the
3754.Tn POSIX
3755.Nm lex
3756specification, except that when using
3757.Dq %pointer
3758.Pq the default ,
3759a call to
3760.Fn unput
3761destroys the contents of
3762.Fa yytext ,
3763which is counter to the
3764.Tn POSIX
3765specification.
3766.Pp
3767In this section we discuss all of the known areas of incompatibility between
3768.Nm ,
3769.At
3770.Nm lex ,
3771and the
3772.Tn POSIX
3773specification.
3774.Pp
3775.Nm flex Ns 's
3776.Fl l
3777option turns on maximum compatibility with the original
3778.At
3779.Nm lex
3780implementation, at the cost of a major loss in the generated scanner's
3781performance.
3782We note below which incompatibilities can be overcome using the
3783.Fl l
3784option.
3785.Pp
3786.Nm
3787is fully compatible with
3788.Nm lex
3789with the following exceptions:
3790.Bl -dash
3791.It
3792The undocumented
3793.Nm lex
3794scanner internal variable
3795.Fa yylineno
3796is not supported unless
3797.Fl l
3798or
3799.Dq %option yylineno
3800is used.
3801.Pp
3802.Fa yylineno
3803should be maintained on a per-buffer basis, rather than a per-scanner
3804.Pq single global variable
3805basis.
3806.Pp
3807.Fa yylineno
3808is not part of the
3809.Tn POSIX
3810specification.
3811.It
3812The
3813.Fn input
3814routine is not redefinable, though it may be called to read characters
3815following whatever has been matched by a rule.
3816If
3817.Fn input
3818encounters an end-of-file, the normal
3819.Fn yywrap
3820processing is done.
3821A
3822.Dq real
3823end-of-file is returned by
3824.Fn input
3825as
3826.Dv EOF .
3827.Pp
3828Input is instead controlled by defining the
3829.Dv YY_INPUT
3830macro.
3831.Pp
3832The
3833.Nm
3834restriction that
3835.Fn input
3836cannot be redefined is in accordance with the
3837.Tn POSIX
3838specification, which simply does not specify any way of controlling the
3839scanner's input other than by making an initial assignment to
3840.Fa yyin .
3841.It
3842The
3843.Fn unput
3844routine is not redefinable.
3845This restriction is in accordance with
3846.Tn POSIX .
3847.It
3848.Nm
3849scanners are not as reentrant as
3850.Nm lex
3851scanners.
3852In particular, if a scanner is interactive and
3853an interrupt handler long-jumps out of the scanner,
3854and the scanner is subsequently called again,
3855the following error message may be displayed:
3856.Pp
3857.D1 fatal flex scanner internal error--end of buffer missed
3858.Pp
3859To reenter the scanner, first use
3860.Pp
3861.Dl yyrestart(yyin);
3862.Pp
3863Note that this call will throw away any buffered input;
3864usually this isn't a problem with an interactive scanner.
3865.Pp
3866Also note that flex C++ scanner classes are reentrant,
3867so if using C++ is an option , they should be used instead.
3868See
3869.Sx GENERATING C++ SCANNERS
3870above for details.
3871.It
3872.Fn output
3873is not supported.
3874Output from the
3875.Em ECHO
3876macro is done to the file-pointer
3877.Fa yyout
3878.Pq default stdout .
3879.Pp
3880.Fn output
3881is not part of the
3882.Tn POSIX
3883specification.
3884.It
3885.Nm lex
3886does not support exclusive start conditions
3887.Pq %x ,
3888though they are in the
3889.Tn POSIX
3890specification.
3891.It
3892When definitions are expanded,
3893.Nm
3894encloses them in parentheses.
3895With
3896.Nm lex ,
3897the following:
3898.Bd -literal -offset indent
3899NAME    [A-Z][A-Z0-9]*
3900%%
3901foo{NAME}?      printf("Found it\en");
3902%%
3903.Ed
3904.Pp
3905will not match the string
3906.Qq foo
3907because when the macro is expanded the rule is equivalent to
3908.Qq foo[A-Z][A-Z0-9]*?
3909and the precedence is such that the
3910.Sq ?\&
3911is associated with
3912.Qq [A-Z0-9]* .
3913With
3914.Nm ,
3915the rule will be expanded to
3916.Qq foo([A-Z][A-Z0-9]*)?
3917and so the string
3918.Qq foo
3919will match.
3920.Pp
3921Note that if the definition begins with
3922.Sq ^
3923or ends with
3924.Sq $
3925then it is not expanded with parentheses, to allow these operators to appear in
3926definitions without losing their special meanings.
3927But the
3928.Sq Aq s ,
3929.Sq / ,
3930and
3931.Aq Aq EOF
3932operators cannot be used in a
3933.Nm
3934definition.
3935.Pp
3936Using
3937.Fl l
3938results in the
3939.Nm lex
3940behavior of no parentheses around the definition.
3941.Pp
3942The
3943.Tn POSIX
3944specification is that the definition be enclosed in parentheses.
3945.It
3946Some implementations of
3947.Nm lex
3948allow a rule's action to begin on a separate line,
3949if the rule's pattern has trailing whitespace:
3950.Bd -literal -offset indent
3951%%
3952foo|bar<space here>
3953  { foobar_action(); }
3954.Ed
3955.Pp
3956.Nm
3957does not support this feature.
3958.It
3959The
3960.Nm lex
3961.Sq %r
3962.Pq generate a Ratfor scanner
3963option is not supported.
3964It is not part of the
3965.Tn POSIX
3966specification.
3967.It
3968After a call to
3969.Fn unput ,
3970.Fa yytext
3971is undefined until the next token is matched,
3972unless the scanner was built using
3973.Dq %array .
3974This is not the case with
3975.Nm lex
3976or the
3977.Tn POSIX
3978specification.
3979The
3980.Fl l
3981option does away with this incompatibility.
3982.It
3983The precedence of the
3984.Sq {}
3985.Pq numeric range
3986operator is different.
3987.Nm lex
3988interprets
3989.Qq abc{1,3}
3990as match one, two, or three occurrences of
3991.Sq abc ,
3992whereas
3993.Nm
3994interprets it as match
3995.Sq ab
3996followed by one, two, or three occurrences of
3997.Sq c .
3998The latter is in agreement with the
3999.Tn POSIX
4000specification.
4001.It
4002The precedence of the
4003.Sq ^
4004operator is different.
4005.Nm lex
4006interprets
4007.Qq ^foo|bar
4008as match either
4009.Sq foo
4010at the beginning of a line, or
4011.Sq bar
4012anywhere, whereas
4013.Nm
4014interprets it as match either
4015.Sq foo
4016or
4017.Sq bar
4018if they come at the beginning of a line.
4019The latter is in agreement with the
4020.Tn POSIX
4021specification.
4022.It
4023The special table-size declarations such as
4024.Sq %a
4025supported by
4026.Nm lex
4027are not required by
4028.Nm
4029scanners;
4030.Nm
4031ignores them.
4032.It
4033The name
4034.Dv FLEX_SCANNER
4035is #define'd so scanners may be written for use with either
4036.Nm
4037or
4038.Nm lex .
4039Scanners also include
4040.Dv YY_FLEX_MAJOR_VERSION
4041and
4042.Dv YY_FLEX_MINOR_VERSION
4043indicating which version of
4044.Nm
4045generated the scanner
4046(for example, for the 2.5 release, these defines would be 2 and 5,
4047respectively).
4048.El
4049.Pp
4050The following
4051.Nm
4052features are not included in
4053.Nm lex
4054or the
4055.Tn POSIX
4056specification:
4057.Bd -unfilled -offset indent
4058C++ scanners
4059%option
4060start condition scopes
4061start condition stacks
4062interactive/non-interactive scanners
4063yy_scan_string() and friends
4064yyterminate()
4065yy_set_interactive()
4066yy_set_bol()
4067YY_AT_BOL()
4068<<EOF>>
4069<*>
4070YY_DECL
4071YY_START
4072YY_USER_ACTION
4073YY_USER_INIT
4074#line directives
4075%{}'s around actions
4076multiple actions on a line
4077.Ed
4078.Pp
4079plus almost all of the
4080.Nm
4081flags.
4082The last feature in the list refers to the fact that with
4083.Nm
4084multiple actions can be placed on the same line,
4085separated with semi-colons, while with
4086.Nm lex ,
4087the following
4088.Pp
4089.Dl foo    handle_foo(); ++num_foos_seen;
4090.Pp
4091is
4092.Pq rather surprisingly
4093truncated to
4094.Pp
4095.Dl foo    handle_foo();
4096.Pp
4097.Nm
4098does not truncate the action.
4099Actions that are not enclosed in braces
4100are simply terminated at the end of the line.
4101.Sh FILES
4102.Bl -tag -width "<g++/FlexLexer.h>"
4103.It Pa flex.skl
4104Skeleton scanner.
4105This file is only used when building flex, not when
4106.Nm
4107executes.
4108.It Pa lex.backup
4109Backing-up information for the
4110.Fl b
4111flag (called
4112.Pa lex.bck
4113on some systems).
4114.It Pa lex.yy.c
4115Generated scanner
4116(called
4117.Pa lexyy.c
4118on some systems).
4119.It Pa lex.yy.cc
4120Generated C++ scanner class, when using
4121.Fl + .
4122.It In g++/FlexLexer.h
4123Header file defining the C++ scanner base class,
4124.Fa FlexLexer ,
4125and its derived class,
4126.Fa yyFlexLexer .
4127.It Pa /usr/lib/libl.*
4128.Nm
4129libraries.
4130The
4131.Pa /usr/lib/libfl.*\&
4132libraries are links to these.
4133Scanners must be linked using either
4134.Fl \&ll
4135or
4136.Fl lfl .
4137.El
4138.Sh EXIT STATUS
4139.Ex -std flex
4140.Sh DIAGNOSTICS
4141.Bl -diag
4142.It warning, rule cannot be matched
4143Indicates that the given rule cannot be matched because it follows other rules
4144that will always match the same text as it.
4145For example, in the following
4146.Dq foo
4147cannot be matched because it comes after an identifier
4148.Qq catch-all
4149rule:
4150.Bd -literal -offset indent
4151[a-z]+    got_identifier();
4152foo       got_foo();
4153.Ed
4154.Pp
4155Using
4156.Em REJECT
4157in a scanner suppresses this warning.
4158.It "warning, \-s option given but default rule can be matched"
4159Means that it is possible
4160.Pq perhaps only in a particular start condition
4161that the default rule
4162.Pq match any single character
4163is the only one that will match a particular input.
4164Since
4165.Fl s
4166was given, presumably this is not intended.
4167.It reject_used_but_not_detected undefined
4168.It yymore_used_but_not_detected undefined
4169These errors can occur at compile time.
4170They indicate that the scanner uses
4171.Em REJECT
4172or
4173.Fn yymore
4174but that
4175.Nm
4176failed to notice the fact, meaning that
4177.Nm
4178scanned the first two sections looking for occurrences of these actions
4179and failed to find any, but somehow they snuck in
4180.Pq via an #include file, for example .
4181Use
4182.Dq %option reject
4183or
4184.Dq %option yymore
4185to indicate to
4186.Nm
4187that these features are really needed.
4188.It flex scanner jammed
4189A scanner compiled with
4190.Fl s
4191has encountered an input string which wasn't matched by any of its rules.
4192This error can also occur due to internal problems.
4193.It token too large, exceeds YYLMAX
4194The scanner uses
4195.Dq %array
4196and one of its rules matched a string longer than the
4197.Dv YYLMAX
4198constant
4199.Pq 8K bytes by default .
4200The value can be increased by #define'ing
4201.Dv YYLMAX
4202in the definitions section of
4203.Nm
4204input.
4205.It "scanner requires \-8 flag to use the character 'x'"
4206The scanner specification includes recognizing the 8-bit character
4207.Sq x
4208and the
4209.Fl 8
4210flag was not specified, and defaulted to 7-bit because the
4211.Fl Cf
4212or
4213.Fl CF
4214table compression options were used.
4215See the discussion of the
4216.Fl 7
4217flag for details.
4218.It flex scanner push-back overflow
4219unput() was used to push back so much text that the scanner's buffer
4220could not hold both the pushed-back text and the current token in
4221.Fa yytext .
4222Ideally the scanner should dynamically resize the buffer in this case,
4223but at present it does not.
4224.It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4225The scanner was working on matching an extremely large token and needed
4226to expand the input buffer.
4227This doesn't work with scanners that use
4228.Em REJECT .
4229.It "fatal flex scanner internal error--end of buffer missed"
4230This can occur in an scanner which is reentered after a long-jump
4231has jumped out
4232.Pq or over
4233the scanner's activation frame.
4234Before reentering the scanner, use:
4235.Pp
4236.Dl yyrestart(yyin);
4237.Pp
4238or, as noted above, switch to using the C++ scanner class.
4239.It "too many start conditions in <> construct!"
4240More start conditions than exist were listed in a <> construct
4241(so at least one of them must have been listed twice).
4242.El
4243.Sh SEE ALSO
4244.Xr awk 1 ,
4245.Xr sed 1 ,
4246.Xr yacc 1
4247.Rs
4248.%A John Levine
4249.%A Tony Mason
4250.%A Doug Brown
4251.%B Lex & Yacc
4252.%I O'Reilly and Associates
4253.%N 2nd edition
4254.Re
4255.Rs
4256.%A Alfred Aho
4257.%A Ravi Sethi
4258.%A Jeffrey Ullman
4259.%B Compilers: Principles, Techniques and Tools
4260.%I Addison-Wesley
4261.%D 1986
4262.%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4263.Re
4264.Sh STANDARDS
4265The
4266.Nm lex
4267utility is compliant with the
4268.St -p1003.1-2008
4269specification,
4270though its presence is optional.
4271.Pp
4272The flags
4273.Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
4274.Op Fl -help ,
4275and
4276.Op Fl -version
4277are extensions to that specification.
4278.Pp
4279See also the
4280.Sx INCOMPATIBILITIES WITH LEX AND POSIX
4281section, above.
4282.Sh AUTHORS
4283Vern Paxson, with the help of many ideas and much inspiration from
4284Van Jacobson.
4285Original version by Jef Poskanzer.
4286The fast table representation is a partial implementation of a design done by
4287Van Jacobson.
4288The implementation was done by Kevin Gong and Vern Paxson.
4289.Pp
4290Thanks to the many
4291.Nm
4292beta-testers, feedbackers, and contributors, especially Francois Pinard,
4293Casey Leedom,
4294Robert Abramovitz,
4295Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4296Neal Becker, Nelson H.F. Beebe,
4297.Mt benson@odi.com ,
4298Karl Berry, Peter A. Bigot, Simon Blanchard,
4299Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4300Brian Clapper, J.T. Conklin,
4301Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4302Daniels, Chris G. Demetriou, Theo de Raadt,
4303Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4304Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4305Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4306Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4307Jan Hajic, Charles Hemphill, NORO Hideo,
4308Jarkko Hietaniemi, Scott Hofmann,
4309Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4310Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4311Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4312Amir Katz,
4313.Mt ken@ken.hilco.com ,
4314Kevin B. Kenny,
4315Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4316Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4317David Loffredo, Mike Long,
4318Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4319Bengt Martensson, Chris Metcalf,
4320Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4321G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4322Richard Ohnemus, Karsten Pahnke,
4323Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4324Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4325Frederic Raimbault, Pat Rankin, Rick Richardson,
4326Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4327Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4328Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4329Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4330Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4331Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4332Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4333Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4334and those whose names have slipped my marginal mail-archiving skills
4335but whose contributions are appreciated all the
4336same.
4337.Pp
4338Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4339John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4340Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4341distribution headaches.
4342.Pp
4343Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4344to Benson Margulies and Fred Burke for C++ support;
4345to Kent Williams and Tom Epperly for C++ class support;
4346to Ove Ewerlid for support of NUL's;
4347and to Eric Hughes for support of multiple buffers.
4348.Pp
4349This work was primarily done when I was with the Real Time Systems Group
4350at the Lawrence Berkeley Laboratory in Berkeley, CA.
4351Many thanks to all there for the support I received.
4352.Pp
4353Send comments to
4354.Aq Mt vern@ee.lbl.gov .
4355.Sh BUGS
4356Some trailing context patterns cannot be properly matched and generate
4357warning messages
4358.Pq "dangerous trailing context" .
4359These are patterns where the ending of the first part of the rule
4360matches the beginning of the second part, such as
4361.Qq zx*/xy* ,
4362where the
4363.Sq x*
4364matches the
4365.Sq x
4366at the beginning of the trailing context.
4367(Note that the POSIX draft states that the text matched by such patterns
4368is undefined.)
4369.Pp
4370For some trailing context rules, parts which are actually fixed-length are
4371not recognized as such, leading to the above mentioned performance loss.
4372In particular, parts using
4373.Sq |\&
4374or
4375.Sq {n}
4376(such as
4377.Qq foo{3} )
4378are always considered variable-length.
4379.Pp
4380Combining trailing context with the special
4381.Sq |\&
4382action can result in fixed trailing context being turned into
4383the more expensive variable trailing context.
4384For example, in the following:
4385.Bd -literal -offset indent
4386%%
4387abc      |
4388xyz/def
4389.Ed
4390.Pp
4391Use of
4392.Fn unput
4393invalidates yytext and yyleng, unless the
4394.Dq %array
4395directive
4396or the
4397.Fl l
4398option has been used.
4399.Pp
4400Pattern-matching of NUL's is substantially slower than matching other
4401characters.
4402.Pp
4403Dynamic resizing of the input buffer is slow, as it entails rescanning
4404all the text matched so far by the current
4405.Pq generally huge
4406token.
4407.Pp
4408Due to both buffering of input and read-ahead,
4409it is not possible to intermix calls to
4410.In stdio.h
4411routines, such as, for example,
4412.Fn getchar ,
4413with
4414.Nm
4415rules and expect it to work.
4416Call
4417.Fn input
4418instead.
4419.Pp
4420The total table entries listed by the
4421.Fl v
4422flag excludes the number of table entries needed to determine
4423what rule has been matched.
4424The number of entries is equal to the number of DFA states
4425if the scanner does not use
4426.Em REJECT ,
4427and somewhat greater than the number of states if it does.
4428.Pp
4429.Em REJECT
4430cannot be used with the
4431.Fl f
4432or
4433.Fl F
4434options.
4435.Pp
4436The
4437.Nm
4438internal algorithms need documentation.
4439