1This is flex.info, produced by makeinfo version 4.13 from flex.texi. 2 3INFO-DIR-SECTION Programming 4START-INFO-DIR-ENTRY 5* flex: (flex). Fast lexical analyzer generator (lex replacement). 6END-INFO-DIR-ENTRY 7 8 The flex manual is placed under the same licensing conditions as the 9rest of flex: 10 11 Copyright (C) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012 The 12Flex Project. 13 14 Copyright (C) 1990, 1997 The Regents of the University of California. 15All rights reserved. 16 17 This code is derived from software contributed to Berkeley by Vern 18Paxson. 19 20 The United States Government has rights in this work pursuant to 21contract no. DE-AC03-76SF00098 between the United States Department of 22Energy and the University of California. 23 24 Redistribution and use in source and binary forms, with or without 25modification, are permitted provided that the following conditions are 26met: 27 28 1. Redistributions of source code must retain the above copyright 29 notice, this list of conditions and the following disclaimer. 30 31 2. Redistributions in binary form must reproduce the above copyright 32 notice, this list of conditions and the following disclaimer in the 33 documentation and/or other materials provided with the 34 distribution. 35 36 Neither the name of the University nor the names of its contributors 37may be used to endorse or promote products derived from this software 38without specific prior written permission. 39 40 THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED 41WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF 42MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 43 44 45File: flex.info, Node: Top, Next: Copyright, Prev: (dir), Up: (dir) 46 47flex 48**** 49 50This manual describes `flex', a tool for generating programs that 51perform pattern-matching on text. The manual includes both tutorial and 52reference sections. 53 54 This edition of `The flex Manual' documents `flex' version 2.5.39. 55It was last updated on 6 December 2012. 56 57 This manual was written by Vern Paxson, Will Estes and John Millaway. 58 59* Menu: 60 61* Copyright:: 62* Reporting Bugs:: 63* Introduction:: 64* Simple Examples:: 65* Format:: 66* Patterns:: 67* Matching:: 68* Actions:: 69* Generated Scanner:: 70* Start Conditions:: 71* Multiple Input Buffers:: 72* EOF:: 73* Misc Macros:: 74* User Values:: 75* Yacc:: 76* Scanner Options:: 77* Performance:: 78* Cxx:: 79* Reentrant:: 80* Lex and Posix:: 81* Memory Management:: 82* Serialized Tables:: 83* Diagnostics:: 84* Limitations:: 85* Bibliography:: 86* FAQ:: 87* Appendices:: 88* Indices:: 89 90 --- The Detailed Node Listing --- 91 92Format of the Input File 93 94* Definitions Section:: 95* Rules Section:: 96* User Code Section:: 97* Comments in the Input:: 98 99Scanner Options 100 101* Options for Specifying Filenames:: 102* Options Affecting Scanner Behavior:: 103* Code-Level And API Options:: 104* Options for Scanner Speed and Size:: 105* Debugging Options:: 106* Miscellaneous Options:: 107 108Reentrant C Scanners 109 110* Reentrant Uses:: 111* Reentrant Overview:: 112* Reentrant Example:: 113* Reentrant Detail:: 114* Reentrant Functions:: 115 116The Reentrant API in Detail 117 118* Specify Reentrant:: 119* Extra Reentrant Argument:: 120* Global Replacement:: 121* Init and Destroy Functions:: 122* Accessor Methods:: 123* Extra Data:: 124* About yyscan_t:: 125 126Memory Management 127 128* The Default Memory Management:: 129* Overriding The Default Memory Management:: 130* A Note About yytext And Memory:: 131 132Serialized Tables 133 134* Creating Serialized Tables:: 135* Loading and Unloading Serialized Tables:: 136* Tables File Format:: 137 138FAQ 139 140* When was flex born?:: 141* How do I expand backslash-escape sequences in C-style quoted strings?:: 142* Why do flex scanners call fileno if it is not ANSI compatible?:: 143* Does flex support recursive pattern definitions?:: 144* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 145* Flex is not matching my patterns in the same order that I defined them.:: 146* My actions are executing out of order or sometimes not at all.:: 147* How can I have multiple input sources feed into the same scanner at the same time?:: 148* Can I build nested parsers that work with the same input file?:: 149* How can I match text only at the end of a file?:: 150* How can I make REJECT cascade across start condition boundaries?:: 151* Why cant I use fast or full tables with interactive mode?:: 152* How much faster is -F or -f than -C?:: 153* If I have a simple grammar cant I just parse it with flex?:: 154* Why doesn't yyrestart() set the start state back to INITIAL?:: 155* How can I match C-style comments?:: 156* The period isn't working the way I expected.:: 157* Can I get the flex manual in another format?:: 158* Does there exist a "faster" NDFA->DFA algorithm?:: 159* How does flex compile the DFA so quickly?:: 160* How can I use more than 8192 rules?:: 161* How do I abandon a file in the middle of a scan and switch to a new file?:: 162* How do I execute code only during initialization (only before the first scan)?:: 163* How do I execute code at termination?:: 164* Where else can I find help?:: 165* Can I include comments in the "rules" section of the file?:: 166* I get an error about undefined yywrap().:: 167* How can I change the matching pattern at run time?:: 168* How can I expand macros in the input?:: 169* How can I build a two-pass scanner?:: 170* How do I match any string not matched in the preceding rules?:: 171* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 172* Is there a way to make flex treat NULL like a regular character?:: 173* Whenever flex can not match the input it says "flex scanner jammed".:: 174* Why doesn't flex have non-greedy operators like perl does?:: 175* Memory leak - 16386 bytes allocated by malloc.:: 176* How do I track the byte offset for lseek()?:: 177* How do I use my own I/O classes in a C++ scanner?:: 178* How do I skip as many chars as possible?:: 179* deleteme00:: 180* Are certain equivalent patterns faster than others?:: 181* Is backing up a big deal?:: 182* Can I fake multi-byte character support?:: 183* deleteme01:: 184* Can you discuss some flex internals?:: 185* unput() messes up yy_at_bol:: 186* The | operator is not doing what I want:: 187* Why can't flex understand this variable trailing context pattern?:: 188* The ^ operator isn't working:: 189* Trailing context is getting confused with trailing optional patterns:: 190* Is flex GNU or not?:: 191* ERASEME53:: 192* I need to scan if-then-else blocks and while loops:: 193* ERASEME55:: 194* ERASEME56:: 195* ERASEME57:: 196* Is there a repository for flex scanners?:: 197* How can I conditionally compile or preprocess my flex input file?:: 198* Where can I find grammars for lex and yacc?:: 199* I get an end-of-buffer message for each character scanned.:: 200* unnamed-faq-62:: 201* unnamed-faq-63:: 202* unnamed-faq-64:: 203* unnamed-faq-65:: 204* unnamed-faq-66:: 205* unnamed-faq-67:: 206* unnamed-faq-68:: 207* unnamed-faq-69:: 208* unnamed-faq-70:: 209* unnamed-faq-71:: 210* unnamed-faq-72:: 211* unnamed-faq-73:: 212* unnamed-faq-74:: 213* unnamed-faq-75:: 214* unnamed-faq-76:: 215* unnamed-faq-77:: 216* unnamed-faq-78:: 217* unnamed-faq-79:: 218* unnamed-faq-80:: 219* unnamed-faq-81:: 220* unnamed-faq-82:: 221* unnamed-faq-83:: 222* unnamed-faq-84:: 223* unnamed-faq-85:: 224* unnamed-faq-86:: 225* unnamed-faq-87:: 226* unnamed-faq-88:: 227* unnamed-faq-90:: 228* unnamed-faq-91:: 229* unnamed-faq-92:: 230* unnamed-faq-93:: 231* unnamed-faq-94:: 232* unnamed-faq-95:: 233* unnamed-faq-96:: 234* unnamed-faq-97:: 235* unnamed-faq-98:: 236* unnamed-faq-99:: 237* unnamed-faq-100:: 238* unnamed-faq-101:: 239* What is the difference between YYLEX_PARAM and YY_DECL?:: 240* Why do I get "conflicting types for yylex" error?:: 241* How do I access the values set in a Flex action from within a Bison action?:: 242 243Appendices 244 245* Makefiles and Flex:: 246* Bison Bridge:: 247* M4 Dependency:: 248* Common Patterns:: 249 250Indices 251 252* Concept Index:: 253* Index of Functions and Macros:: 254* Index of Variables:: 255* Index of Data Types:: 256* Index of Hooks:: 257* Index of Scanner Options:: 258 259 260File: flex.info, Node: Copyright, Next: Reporting Bugs, Prev: Top, Up: Top 261 2621 Copyright 263*********** 264 265The flex manual is placed under the same licensing conditions as the 266rest of flex: 267 268 Copyright (C) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012 The 269Flex Project. 270 271 Copyright (C) 1990, 1997 The Regents of the University of California. 272All rights reserved. 273 274 This code is derived from software contributed to Berkeley by Vern 275Paxson. 276 277 The United States Government has rights in this work pursuant to 278contract no. DE-AC03-76SF00098 between the United States Department of 279Energy and the University of California. 280 281 Redistribution and use in source and binary forms, with or without 282modification, are permitted provided that the following conditions are 283met: 284 285 1. Redistributions of source code must retain the above copyright 286 notice, this list of conditions and the following disclaimer. 287 288 2. Redistributions in binary form must reproduce the above copyright 289 notice, this list of conditions and the following disclaimer in the 290 documentation and/or other materials provided with the 291 distribution. 292 293 Neither the name of the University nor the names of its contributors 294may be used to endorse or promote products derived from this software 295without specific prior written permission. 296 297 THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED 298WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF 299MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 300 301 302File: flex.info, Node: Reporting Bugs, Next: Introduction, Prev: Copyright, Up: Top 303 3042 Reporting Bugs 305**************** 306 307If you find a bug in `flex', please report it using the SourceForge Bug 308Tracking facilities which can be found on flex's SourceForge Page 309(http://sourceforge.net/projects/flex). 310 311 312File: flex.info, Node: Introduction, Next: Simple Examples, Prev: Reporting Bugs, Up: Top 313 3143 Introduction 315************** 316 317`flex' is a tool for generating "scanners". A scanner is a program 318which recognizes lexical patterns in text. The `flex' program reads 319the given input files, or its standard input if no file names are 320given, for a description of a scanner to generate. The description is 321in the form of pairs of regular expressions and C code, called "rules". 322`flex' generates as output a C source file, `lex.yy.c' by default, 323which defines a routine `yylex()'. This file can be compiled and 324linked with the flex runtime library to produce an executable. When 325the executable is run, it analyzes its input for occurrences of the 326regular expressions. Whenever it finds one, it executes the 327corresponding C code. 328 329 330File: flex.info, Node: Simple Examples, Next: Format, Prev: Introduction, Up: Top 331 3324 Some Simple Examples 333********************** 334 335First some simple examples to get the flavor of how one uses `flex'. 336 337 The following `flex' input specifies a scanner which, when it 338encounters the string `username' will replace it with the user's login 339name: 340 341 %% 342 username printf( "%s", getlogin() ); 343 344 By default, any text not matched by a `flex' scanner is copied to 345the output, so the net effect of this scanner is to copy its input file 346to its output with each occurrence of `username' expanded. In this 347input, there is just one rule. `username' is the "pattern" and the 348`printf' is the "action". The `%%' symbol marks the beginning of the 349rules. 350 351 Here's another simple example: 352 353 int num_lines = 0, num_chars = 0; 354 355 %% 356 \n ++num_lines; ++num_chars; 357 . ++num_chars; 358 359 %% 360 361 int main() 362 { 363 yylex(); 364 printf( "# of lines = %d, # of chars = %d\n", 365 num_lines, num_chars ); 366 } 367 368 This scanner counts the number of characters and the number of lines 369in its input. It produces no output other than the final report on the 370character and line counts. The first line declares two globals, 371`num_lines' and `num_chars', which are accessible both inside `yylex()' 372and in the `main()' routine declared after the second `%%'. There are 373two rules, one which matches a newline (`\n') and increments both the 374line count and the character count, and one which matches any character 375other than a newline (indicated by the `.' regular expression). 376 377 A somewhat more complicated example: 378 379 /* scanner for a toy Pascal-like language */ 380 381 %{ 382 /* need this for the call to atof() below */ 383 #include <math.h> 384 %} 385 386 DIGIT [0-9] 387 ID [a-z][a-z0-9]* 388 389 %% 390 391 {DIGIT}+ { 392 printf( "An integer: %s (%d)\n", yytext, 393 atoi( yytext ) ); 394 } 395 396 {DIGIT}+"."{DIGIT}* { 397 printf( "A float: %s (%g)\n", yytext, 398 atof( yytext ) ); 399 } 400 401 if|then|begin|end|procedure|function { 402 printf( "A keyword: %s\n", yytext ); 403 } 404 405 {ID} printf( "An identifier: %s\n", yytext ); 406 407 "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); 408 409 "{"[\^{}}\n]*"}" /* eat up one-line comments */ 410 411 [ \t\n]+ /* eat up whitespace */ 412 413 . printf( "Unrecognized character: %s\n", yytext ); 414 415 %% 416 417 int main( int argc, char **argv ) 418 { 419 ++argv, --argc; /* skip over program name */ 420 if ( argc > 0 ) 421 yyin = fopen( argv[0], "r" ); 422 else 423 yyin = stdin; 424 425 yylex(); 426 } 427 428 This is the beginnings of a simple scanner for a language like 429Pascal. It identifies different types of "tokens" and reports on what 430it has seen. 431 432 The details of this example will be explained in the following 433sections. 434 435 436File: flex.info, Node: Format, Next: Patterns, Prev: Simple Examples, Up: Top 437 4385 Format of the Input File 439************************** 440 441The `flex' input file consists of three sections, separated by a line 442containing only `%%'. 443 444 definitions 445 %% 446 rules 447 %% 448 user code 449 450* Menu: 451 452* Definitions Section:: 453* Rules Section:: 454* User Code Section:: 455* Comments in the Input:: 456 457 458File: flex.info, Node: Definitions Section, Next: Rules Section, Prev: Format, Up: Format 459 4605.1 Format of the Definitions Section 461===================================== 462 463The "definitions section" contains declarations of simple "name" 464definitions to simplify the scanner specification, and declarations of 465"start conditions", which are explained in a later section. 466 467 Name definitions have the form: 468 469 name definition 470 471 The `name' is a word beginning with a letter or an underscore (`_') 472followed by zero or more letters, digits, `_', or `-' (dash). The 473definition is taken to begin at the first non-whitespace character 474following the name and continuing to the end of the line. The 475definition can subsequently be referred to using `{name}', which will 476expand to `(definition)'. For example, 477 478 DIGIT [0-9] 479 ID [a-z][a-z0-9]* 480 481 Defines `DIGIT' to be a regular expression which matches a single 482digit, and `ID' to be a regular expression which matches a letter 483followed by zero-or-more letters-or-digits. A subsequent reference to 484 485 {DIGIT}+"."{DIGIT}* 486 487 is identical to 488 489 ([0-9])+"."([0-9])* 490 491 and matches one-or-more digits followed by a `.' followed by 492zero-or-more digits. 493 494 An unindented comment (i.e., a line beginning with `/*') is copied 495verbatim to the output up to the next `*/'. 496 497 Any _indented_ text or text enclosed in `%{' and `%}' is also copied 498verbatim to the output (with the %{ and %} symbols removed). The %{ 499and %} symbols must appear unindented on lines by themselves. 500 501 A `%top' block is similar to a `%{' ... `%}' block, except that the 502code in a `%top' block is relocated to the _top_ of the generated file, 503before any flex definitions (1). The `%top' block is useful when you 504want certain preprocessor macros to be defined or certain files to be 505included before the generated code. The single characters, `{' and 506`}' are used to delimit the `%top' block, as show in the example below: 507 508 %top{ 509 /* This code goes at the "top" of the generated file. */ 510 #include <stdint.h> 511 #include <inttypes.h> 512 } 513 514 Multiple `%top' blocks are allowed, and their order is preserved. 515 516 ---------- Footnotes ---------- 517 518 (1) Actually, `yyIN_HEADER' is defined before the `%top' block. 519 520 521File: flex.info, Node: Rules Section, Next: User Code Section, Prev: Definitions Section, Up: Format 522 5235.2 Format of the Rules Section 524=============================== 525 526The "rules" section of the `flex' input contains a series of rules of 527the form: 528 529 pattern action 530 531 where the pattern must be unindented and the action must begin on 532the same line. *Note Patterns::, for a further description of patterns 533and actions. 534 535 In the rules section, any indented or %{ %} enclosed text appearing 536before the first rule may be used to declare variables which are local 537to the scanning routine and (after the declarations) code which is to be 538executed whenever the scanning routine is entered. Other indented or 539%{ %} text in the rule section is still copied to the output, but its 540meaning is not well-defined and it may well cause compile-time errors 541(this feature is present for POSIX compliance. *Note Lex and Posix::, 542for other such features). 543 544 Any _indented_ text or text enclosed in `%{' and `%}' is copied 545verbatim to the output (with the %{ and %} symbols removed). The %{ 546and %} symbols must appear unindented on lines by themselves. 547 548 549File: flex.info, Node: User Code Section, Next: Comments in the Input, Prev: Rules Section, Up: Format 550 5515.3 Format of the User Code Section 552=================================== 553 554The user code section is simply copied to `lex.yy.c' verbatim. It is 555used for companion routines which call or are called by the scanner. 556The presence of this section is optional; if it is missing, the second 557`%%' in the input file may be skipped, too. 558 559 560File: flex.info, Node: Comments in the Input, Prev: User Code Section, Up: Format 561 5625.4 Comments in the Input 563========================= 564 565Flex supports C-style comments, that is, anything between `/*' and `*/' 566is considered a comment. Whenever flex encounters a comment, it copies 567the entire comment verbatim to the generated source code. Comments may 568appear just about anywhere, but with the following exceptions: 569 570 * Comments may not appear in the Rules Section wherever flex is 571 expecting a regular expression. This means comments may not appear 572 at the beginning of a line, or immediately following a list of 573 scanner states. 574 575 * Comments may not appear on an `%option' line in the Definitions 576 Section. 577 578 If you want to follow a simple rule, then always begin a comment on a 579new line, with one or more whitespace characters before the initial 580`/*'). This rule will work anywhere in the input file. 581 582 All the comments in the following example are valid: 583 584 %{ 585 /* code block */ 586 %} 587 588 /* Definitions Section */ 589 %x STATE_X 590 591 %% 592 /* Rules Section */ 593 ruleA /* after regex */ { /* code block */ } /* after code block */ 594 /* Rules Section (indented) */ 595 <STATE_X>{ 596 ruleC ECHO; 597 ruleD ECHO; 598 %{ 599 /* code block */ 600 %} 601 } 602 %% 603 /* User Code Section */ 604 605 606File: flex.info, Node: Patterns, Next: Matching, Prev: Format, Up: Top 607 6086 Patterns 609********** 610 611The patterns in the input (see *note Rules Section::) are written using 612an extended set of regular expressions. These are: 613 614`x' 615 match the character 'x' 616 617`.' 618 any character (byte) except newline 619 620`[xyz]' 621 a "character class"; in this case, the pattern matches either an 622 'x', a 'y', or a 'z' 623 624`[abj-oZ]' 625 a "character class" with a range in it; matches an 'a', a 'b', any 626 letter from 'j' through 'o', or a 'Z' 627 628`[^A-Z]' 629 a "negated character class", i.e., any character but those in the 630 class. In this case, any character EXCEPT an uppercase letter. 631 632`[^A-Z\n]' 633 any character EXCEPT an uppercase letter or a newline 634 635`[a-z]{-}[aeiou]' 636 the lowercase consonants 637 638`r*' 639 zero or more r's, where r is any regular expression 640 641`r+' 642 one or more r's 643 644`r?' 645 zero or one r's (that is, "an optional r") 646 647`r{2,5}' 648 anywhere from two to five r's 649 650`r{2,}' 651 two or more r's 652 653`r{4}' 654 exactly 4 r's 655 656`{name}' 657 the expansion of the `name' definition (*note Format::). 658 659`"[xyz]\"foo"' 660 the literal string: `[xyz]"foo' 661 662`\X' 663 if X is `a', `b', `f', `n', `r', `t', or `v', then the ANSI-C 664 interpretation of `\x'. Otherwise, a literal `X' (used to escape 665 operators such as `*') 666 667`\0' 668 a NUL character (ASCII code 0) 669 670`\123' 671 the character with octal value 123 672 673`\x2a' 674 the character with hexadecimal value 2a 675 676`(r)' 677 match an `r'; parentheses are used to override precedence (see 678 below) 679 680`(?r-s:pattern)' 681 apply option `r' and omit option `s' while interpreting pattern. 682 Options may be zero or more of the characters `i', `s', or `x'. 683 684 `i' means case-insensitive. `-i' means case-sensitive. 685 686 `s' alters the meaning of the `.' syntax to match any single byte 687 whatsoever. `-s' alters the meaning of `.' to match any byte 688 except `\n'. 689 690 `x' ignores comments and whitespace in patterns. Whitespace is 691 ignored unless it is backslash-escaped, contained within `""'s, or 692 appears inside a character class. 693 694 The following are all valid: 695 696 (?:foo) same as (foo) 697 (?i:ab7) same as ([aA][bB]7) 698 (?-i:ab) same as (ab) 699 (?s:.) same as [\x00-\xFF] 700 (?-s:.) same as [^\n] 701 (?ix-s: a . b) same as ([Aa][^\n][bB]) 702 (?x:a b) same as ("ab") 703 (?x:a\ b) same as ("a b") 704 (?x:a" "b) same as ("a b") 705 (?x:a[ ]b) same as ("a b") 706 (?x:a 707 /* comment */ 708 b 709 c) same as (abc) 710 711`(?# comment )' 712 omit everything within `()'. The first `)' character encountered 713 ends the pattern. It is not possible to for the comment to contain 714 a `)' character. The comment may span lines. 715 716`rs' 717 the regular expression `r' followed by the regular expression `s'; 718 called "concatenation" 719 720`r|s' 721 either an `r' or an `s' 722 723`r/s' 724 an `r' but only if it is followed by an `s'. The text matched by 725 `s' is included when determining whether this rule is the longest 726 match, but is then returned to the input before the action is 727 executed. So the action only sees the text matched by `r'. This 728 type of pattern is called "trailing context". (There are some 729 combinations of `r/s' that flex cannot match correctly. *Note 730 Limitations::, regarding dangerous trailing context.) 731 732`^r' 733 an `r', but only at the beginning of a line (i.e., when just 734 starting to scan, or right after a newline has been scanned). 735 736`r$' 737 an `r', but only at the end of a line (i.e., just before a 738 newline). Equivalent to `r/\n'. 739 740 Note that `flex''s notion of "newline" is exactly whatever the C 741 compiler used to compile `flex' interprets `\n' as; in particular, 742 on some DOS systems you must either filter out `\r's in the input 743 yourself, or explicitly use `r/\r\n' for `r$'. 744 745`<s>r' 746 an `r', but only in start condition `s' (see *note Start 747 Conditions:: for discussion of start conditions). 748 749`<s1,s2,s3>r' 750 same, but in any of start conditions `s1', `s2', or `s3'. 751 752`<*>r' 753 an `r' in any start condition, even an exclusive one. 754 755`<<EOF>>' 756 an end-of-file. 757 758`<s1,s2><<EOF>>' 759 an end-of-file when in start condition `s1' or `s2' 760 761 Note that inside of a character class, all regular expression 762operators lose their special meaning except escape (`\') and the 763character class operators, `-', `]]', and, at the beginning of the 764class, `^'. 765 766 The regular expressions listed above are grouped according to 767precedence, from highest precedence at the top to lowest at the bottom. 768Those grouped together have equal precedence (see special note on the 769precedence of the repeat operator, `{}', under the documentation for 770the `--posix' POSIX compliance option). For example, 771 772 foo|bar* 773 774 is the same as 775 776 (foo)|(ba(r*)) 777 778 since the `*' operator has higher precedence than concatenation, and 779concatenation higher than alternation (`|'). This pattern therefore 780matches _either_ the string `foo' _or_ the string `ba' followed by 781zero-or-more `r''s. To match `foo' or zero-or-more repetitions of the 782string `bar', use: 783 784 foo|(bar)* 785 786 And to match a sequence of zero or more repetitions of `foo' and 787`bar': 788 789 (foo|bar)* 790 791 In addition to characters and ranges of characters, character classes 792can also contain "character class expressions". These are expressions 793enclosed inside `[': and `:]' delimiters (which themselves must appear 794between the `[' and `]' of the character class. Other elements may 795occur inside the character class, too). The valid expressions are: 796 797 [:alnum:] [:alpha:] [:blank:] 798 [:cntrl:] [:digit:] [:graph:] 799 [:lower:] [:print:] [:punct:] 800 [:space:] [:upper:] [:xdigit:] 801 802 These expressions all designate a set of characters equivalent to the 803corresponding standard C `isXXX' function. For example, `[:alnum:]' 804designates those characters for which `isalnum()' returns true - i.e., 805any alphabetic or numeric character. Some systems don't provide 806`isblank()', so flex defines `[:blank:]' as a blank or a tab. 807 808 For example, the following character classes are all equivalent: 809 810 [[:alnum:]] 811 [[:alpha:][:digit:]] 812 [[:alpha:][0-9]] 813 [a-zA-Z0-9] 814 815 A word of caution. Character classes are expanded immediately when 816seen in the `flex' input. This means the character classes are 817sensitive to the locale in which `flex' is executed, and the resulting 818scanner will not be sensitive to the runtime locale. This may or may 819not be desirable. 820 821 * If your scanner is case-insensitive (the `-i' flag), then 822 `[:upper:]' and `[:lower:]' are equivalent to `[:alpha:]'. 823 824 * Character classes with ranges, such as `[a-Z]', should be used with 825 caution in a case-insensitive scanner if the range spans upper or 826 lowercase characters. Flex does not know if you want to fold all 827 upper and lowercase characters together, or if you want the 828 literal numeric range specified (with no case folding). When in 829 doubt, flex will assume that you meant the literal numeric range, 830 and will issue a warning. The exception to this rule is a 831 character range such as `[a-z]' or `[S-W]' where it is obvious 832 that you want case-folding to occur. Here are some examples with 833 the `-i' flag enabled: 834 835 Range Result Literal Range Alternate Range 836 `[a-t]' ok `[a-tA-T]' 837 `[A-T]' ok `[a-tA-T]' 838 `[A-t]' ambiguous `[A-Z\[\\\]_`a-t]' `[a-tA-T]' 839 `[_-{]' ambiguous `[_`a-z{]' `[_`a-zA-Z{]' 840 `[@-C]' ambiguous `[@ABC]' `[@A-Z\[\\\]_`abc]' 841 842 * A negated character class such as the example `[^A-Z]' above 843 _will_ match a newline unless `\n' (or an equivalent escape 844 sequence) is one of the characters explicitly present in the 845 negated character class (e.g., `[^A-Z\n]'). This is unlike how 846 many other regular expression tools treat negated character 847 classes, but unfortunately the inconsistency is historically 848 entrenched. Matching newlines means that a pattern like `[^"]*' 849 can match the entire input unless there's another quote in the 850 input. 851 852 Flex allows negation of character class expressions by prepending 853 `^' to the POSIX character class name. 854 855 [:^alnum:] [:^alpha:] [:^blank:] 856 [:^cntrl:] [:^digit:] [:^graph:] 857 [:^lower:] [:^print:] [:^punct:] 858 [:^space:] [:^upper:] [:^xdigit:] 859 860 Flex will issue a warning if the expressions `[:^upper:]' and 861 `[:^lower:]' appear in a case-insensitive scanner, since their 862 meaning is unclear. The current behavior is to skip them entirely, 863 but this may change without notice in future revisions of flex. 864 865 * The `{-}' operator computes the difference of two character 866 classes. For example, `[a-c]{-}[b-z]' represents all the 867 characters in the class `[a-c]' that are not in the class `[b-z]' 868 (which in this case, is just the single character `a'). The `{-}' 869 operator is left associative, so `[abc]{-}[b]{-}[c]' is the same 870 as `[a]'. Be careful not to accidentally create an empty set, 871 which will never match. 872 873 * The `{+}' operator computes the union of two character classes. 874 For example, `[a-z]{+}[0-9]' is the same as `[a-z0-9]'. This 875 operator is useful when preceded by the result of a difference 876 operation, as in, `[[:alpha:]]{-}[[:lower:]]{+}[q]', which is 877 equivalent to `[A-Zq]' in the "C" locale. 878 879 * A rule can have at most one instance of trailing context (the `/' 880 operator or the `$' operator). The start condition, `^', and 881 `<<EOF>>' patterns can only occur at the beginning of a pattern, 882 and, as well as with `/' and `$', cannot be grouped inside 883 parentheses. A `^' which does not occur at the beginning of a 884 rule or a `$' which does not occur at the end of a rule loses its 885 special properties and is treated as a normal character. 886 887 * The following are invalid: 888 889 foo/bar$ 890 <sc1>foo<sc2>bar 891 892 Note that the first of these can be written `foo/bar\n'. 893 894 * The following will result in `$' or `^' being treated as a normal 895 character: 896 897 foo|(bar$) 898 foo|^bar 899 900 If the desired meaning is a `foo' or a 901 `bar'-followed-by-a-newline, the following could be used (the 902 special `|' action is explained below, *note Actions::): 903 904 foo | 905 bar$ /* action goes here */ 906 907 A similar trick will work for matching a `foo' or a 908 `bar'-at-the-beginning-of-a-line. 909 910 911File: flex.info, Node: Matching, Next: Actions, Prev: Patterns, Up: Top 912 9137 How the Input Is Matched 914************************** 915 916When the generated scanner is run, it analyzes its input looking for 917strings which match any of its patterns. If it finds more than one 918match, it takes the one matching the most text (for trailing context 919rules, this includes the length of the trailing part, even though it 920will then be returned to the input). If it finds two or more matches of 921the same length, the rule listed first in the `flex' input file is 922chosen. 923 924 Once the match is determined, the text corresponding to the match 925(called the "token") is made available in the global character pointer 926`yytext', and its length in the global integer `yyleng'. The "action" 927corresponding to the matched pattern is then executed (*note 928Actions::), and then the remaining input is scanned for another match. 929 930 If no match is found, then the "default rule" is executed: the next 931character in the input is considered matched and copied to the standard 932output. Thus, the simplest valid `flex' input is: 933 934 %% 935 936 which generates a scanner that simply copies its input (one 937character at a time) to its output. 938 939 Note that `yytext' can be defined in two different ways: either as a 940character _pointer_ or as a character _array_. You can control which 941definition `flex' uses by including one of the special directives 942`%pointer' or `%array' in the first (definitions) section of your flex 943input. The default is `%pointer', unless you use the `-l' lex 944compatibility option, in which case `yytext' will be an array. The 945advantage of using `%pointer' is substantially faster scanning and no 946buffer overflow when matching very large tokens (unless you run out of 947dynamic memory). The disadvantage is that you are restricted in how 948your actions can modify `yytext' (*note Actions::), and calls to the 949`unput()' function destroys the present contents of `yytext', which can 950be a considerable porting headache when moving between different `lex' 951versions. 952 953 The advantage of `%array' is that you can then modify `yytext' to 954your heart's content, and calls to `unput()' do not destroy `yytext' 955(*note Actions::). Furthermore, existing `lex' programs sometimes 956access `yytext' externally using declarations of the form: 957 958 extern char yytext[]; 959 960 This definition is erroneous when used with `%pointer', but correct 961for `%array'. 962 963 The `%array' declaration defines `yytext' to be an array of `YYLMAX' 964characters, which defaults to a fairly large value. You can change the 965size by simply #define'ing `YYLMAX' to a different value in the first 966section of your `flex' input. As mentioned above, with `%pointer' 967yytext grows dynamically to accommodate large tokens. While this means 968your `%pointer' scanner can accommodate very large tokens (such as 969matching entire blocks of comments), bear in mind that each time the 970scanner must resize `yytext' it also must rescan the entire token from 971the beginning, so matching such tokens can prove slow. `yytext' 972presently does _not_ dynamically grow if a call to `unput()' results in 973too much text being pushed back; instead, a run-time error results. 974 975 Also note that you cannot use `%array' with C++ scanner classes 976(*note Cxx::). 977 978 979File: flex.info, Node: Actions, Next: Generated Scanner, Prev: Matching, Up: Top 980 9818 Actions 982********* 983 984Each pattern in a rule has a corresponding "action", which can be any 985arbitrary C statement. The pattern ends at the first non-escaped 986whitespace character; the remainder of the line is its action. If the 987action is empty, then when the pattern is matched the input token is 988simply discarded. For example, here is the specification for a program 989which deletes all occurrences of `zap me' from its input: 990 991 %% 992 "zap me" 993 994 This example will copy all other characters in the input to the 995output since they will be matched by the default rule. 996 997 Here is a program which compresses multiple blanks and tabs down to a 998single blank, and throws away whitespace found at the end of a line: 999 1000 %% 1001 [ \t]+ putchar( ' ' ); 1002 [ \t]+$ /* ignore this token */ 1003 1004 If the action contains a `{', then the action spans till the 1005balancing `}' is found, and the action may cross multiple lines. 1006`flex' knows about C strings and comments and won't be fooled by braces 1007found within them, but also allows actions to begin with `%{' and will 1008consider the action to be all the text up to the next `%}' (regardless 1009of ordinary braces inside the action). 1010 1011 An action consisting solely of a vertical bar (`|') means "same as 1012the action for the next rule". See below for an illustration. 1013 1014 Actions can include arbitrary C code, including `return' statements 1015to return a value to whatever routine called `yylex()'. Each time 1016`yylex()' is called it continues processing tokens from where it last 1017left off until it either reaches the end of the file or executes a 1018return. 1019 1020 Actions are free to modify `yytext' except for lengthening it 1021(adding characters to its end-these will overwrite later characters in 1022the input stream). This however does not apply when using `%array' 1023(*note Matching::). In that case, `yytext' may be freely modified in 1024any way. 1025 1026 Actions are free to modify `yyleng' except they should not do so if 1027the action also includes use of `yymore()' (see below). 1028 1029 There are a number of special directives which can be included 1030within an action: 1031 1032`ECHO' 1033 copies yytext to the scanner's output. 1034 1035`BEGIN' 1036 followed by the name of a start condition places the scanner in the 1037 corresponding start condition (see below). 1038 1039`REJECT' 1040 directs the scanner to proceed on to the "second best" rule which 1041 matched the input (or a prefix of the input). The rule is chosen 1042 as described above in *note Matching::, and `yytext' and `yyleng' 1043 set up appropriately. It may either be one which matched as much 1044 text as the originally chosen rule but came later in the `flex' 1045 input file, or one which matched less text. For example, the 1046 following will both count the words in the input and call the 1047 routine `special()' whenever `frob' is seen: 1048 1049 int word_count = 0; 1050 %% 1051 1052 frob special(); REJECT; 1053 [^ \t\n]+ ++word_count; 1054 1055 Without the `REJECT', any occurrences of `frob' in the input would 1056 not be counted as words, since the scanner normally executes only 1057 one action per token. Multiple uses of `REJECT' are allowed, each 1058 one finding the next best choice to the currently active rule. For 1059 example, when the following scanner scans the token `abcd', it will 1060 write `abcdabcaba' to the output: 1061 1062 %% 1063 a | 1064 ab | 1065 abc | 1066 abcd ECHO; REJECT; 1067 .|\n /* eat up any unmatched character */ 1068 1069 The first three rules share the fourth's action since they use the 1070 special `|' action. 1071 1072 `REJECT' is a particularly expensive feature in terms of scanner 1073 performance; if it is used in _any_ of the scanner's actions it 1074 will slow down _all_ of the scanner's matching. Furthermore, 1075 `REJECT' cannot be used with the `-Cf' or `-CF' options (*note 1076 Scanner Options::). 1077 1078 Note also that unlike the other special actions, `REJECT' is a 1079 _branch_. Code immediately following it in the action will _not_ 1080 be executed. 1081 1082`yymore()' 1083 tells the scanner that the next time it matches a rule, the 1084 corresponding token should be _appended_ onto the current value of 1085 `yytext' rather than replacing it. For example, given the input 1086 `mega-kludge' the following will write `mega-mega-kludge' to the 1087 output: 1088 1089 %% 1090 mega- ECHO; yymore(); 1091 kludge ECHO; 1092 1093 First `mega-' is matched and echoed to the output. Then `kludge' 1094 is matched, but the previous `mega-' is still hanging around at the 1095 beginning of `yytext' so the `ECHO' for the `kludge' rule will 1096 actually write `mega-kludge'. 1097 1098 Two notes regarding use of `yymore()'. First, `yymore()' depends on 1099the value of `yyleng' correctly reflecting the size of the current 1100token, so you must not modify `yyleng' if you are using `yymore()'. 1101Second, the presence of `yymore()' in the scanner's action entails a 1102minor performance penalty in the scanner's matching speed. 1103 1104 `yyless(n)' returns all but the first `n' characters of the current 1105token back to the input stream, where they will be rescanned when the 1106scanner looks for the next match. `yytext' and `yyleng' are adjusted 1107appropriately (e.g., `yyleng' will now be equal to `n'). For example, 1108on the input `foobar' the following will write out `foobarbar': 1109 1110 %% 1111 foobar ECHO; yyless(3); 1112 [a-z]+ ECHO; 1113 1114 An argument of 0 to `yyless()' will cause the entire current input 1115string to be scanned again. Unless you've changed how the scanner will 1116subsequently process its input (using `BEGIN', for example), this will 1117result in an endless loop. 1118 1119 Note that `yyless()' is a macro and can only be used in the flex 1120input file, not from other source files. 1121 1122 `unput(c)' puts the character `c' back onto the input stream. It 1123will be the next character scanned. The following action will take the 1124current token and cause it to be rescanned enclosed in parentheses. 1125 1126 { 1127 int i; 1128 /* Copy yytext because unput() trashes yytext */ 1129 char *yycopy = strdup( yytext ); 1130 unput( ')' ); 1131 for ( i = yyleng - 1; i >= 0; --i ) 1132 unput( yycopy[i] ); 1133 unput( '(' ); 1134 free( yycopy ); 1135 } 1136 1137 Note that since each `unput()' puts the given character back at the 1138_beginning_ of the input stream, pushing back strings must be done 1139back-to-front. 1140 1141 An important potential problem when using `unput()' is that if you 1142are using `%pointer' (the default), a call to `unput()' _destroys_ the 1143contents of `yytext', starting with its rightmost character and 1144devouring one character to the left with each call. If you need the 1145value of `yytext' preserved after a call to `unput()' (as in the above 1146example), you must either first copy it elsewhere, or build your 1147scanner using `%array' instead (*note Matching::). 1148 1149 Finally, note that you cannot put back `EOF' to attempt to mark the 1150input stream with an end-of-file. 1151 1152 `input()' reads the next character from the input stream. For 1153example, the following is one way to eat up C comments: 1154 1155 %% 1156 "/*" { 1157 register int c; 1158 1159 for ( ; ; ) 1160 { 1161 while ( (c = input()) != '*' && 1162 c != EOF ) 1163 ; /* eat up text of comment */ 1164 1165 if ( c == '*' ) 1166 { 1167 while ( (c = input()) == '*' ) 1168 ; 1169 if ( c == '/' ) 1170 break; /* found the end */ 1171 } 1172 1173 if ( c == EOF ) 1174 { 1175 error( "EOF in comment" ); 1176 break; 1177 } 1178 } 1179 } 1180 1181 (Note that if the scanner is compiled using `C++', then `input()' is 1182instead referred to as yyinput(), in order to avoid a name clash with 1183the `C++' stream by the name of `input'.) 1184 1185 `YY_FLUSH_BUFFER;' flushes the scanner's internal buffer so that the 1186next time the scanner attempts to match a token, it will first refill 1187the buffer using `YY_INPUT()' (*note Generated Scanner::). This action 1188is a special case of the more general `yy_flush_buffer;' function, 1189described below (*note Multiple Input Buffers::) 1190 1191 `yyterminate()' can be used in lieu of a return statement in an 1192action. It terminates the scanner and returns a 0 to the scanner's 1193caller, indicating "all done". By default, `yyterminate()' is also 1194called when an end-of-file is encountered. It is a macro and may be 1195redefined. 1196 1197 1198File: flex.info, Node: Generated Scanner, Next: Start Conditions, Prev: Actions, Up: Top 1199 12009 The Generated Scanner 1201*********************** 1202 1203The output of `flex' is the file `lex.yy.c', which contains the 1204scanning routine `yylex()', a number of tables used by it for matching 1205tokens, and a number of auxiliary routines and macros. By default, 1206`yylex()' is declared as follows: 1207 1208 int yylex() 1209 { 1210 ... various definitions and the actions in here ... 1211 } 1212 1213 (If your environment supports function prototypes, then it will be 1214`int yylex( void )'.) This definition may be changed by defining the 1215`YY_DECL' macro. For example, you could use: 1216 1217 #define YY_DECL float lexscan( a, b ) float a, b; 1218 1219 to give the scanning routine the name `lexscan', returning a float, 1220and taking two floats as arguments. Note that if you give arguments to 1221the scanning routine using a K&R-style/non-prototyped function 1222declaration, you must terminate the definition with a semi-colon (;). 1223 1224 `flex' generates `C99' function definitions by default. However flex 1225does have the ability to generate obsolete, er, `traditional', function 1226definitions. This is to support bootstrapping gcc on old systems. 1227Unfortunately, traditional definitions prevent us from using any 1228standard data types smaller than int (such as short, char, or bool) as 1229function arguments. For this reason, future versions of `flex' may 1230generate standard C99 code only, leaving K&R-style functions to the 1231historians. Currently, if you do *not* want `C99' definitions, then 1232you must use `%option noansi-definitions'. 1233 1234 Whenever `yylex()' is called, it scans tokens from the global input 1235file `yyin' (which defaults to stdin). It continues until it either 1236reaches an end-of-file (at which point it returns the value 0) or one 1237of its actions executes a `return' statement. 1238 1239 If the scanner reaches an end-of-file, subsequent calls are undefined 1240unless either `yyin' is pointed at a new input file (in which case 1241scanning continues from that file), or `yyrestart()' is called. 1242`yyrestart()' takes one argument, a `FILE *' pointer (which can be 1243NULL, if you've set up `YY_INPUT' to scan from a source other than 1244`yyin'), and initializes `yyin' for scanning from that file. 1245Essentially there is no difference between just assigning `yyin' to a 1246new input file or using `yyrestart()' to do so; the latter is available 1247for compatibility with previous versions of `flex', and because it can 1248be used to switch input files in the middle of scanning. It can also 1249be used to throw away the current input buffer, by calling it with an 1250argument of `yyin'; but it would be better to use `YY_FLUSH_BUFFER' 1251(*note Actions::). Note that `yyrestart()' does _not_ reset the start 1252condition to `INITIAL' (*note Start Conditions::). 1253 1254 If `yylex()' stops scanning due to executing a `return' statement in 1255one of the actions, the scanner may then be called again and it will 1256resume scanning where it left off. 1257 1258 By default (and for purposes of efficiency), the scanner uses 1259block-reads rather than simple `getc()' calls to read characters from 1260`yyin'. The nature of how it gets its input can be controlled by 1261defining the `YY_INPUT' macro. The calling sequence for `YY_INPUT()' 1262is `YY_INPUT(buf,result,max_size)'. Its action is to place up to 1263`max_size' characters in the character array `buf' and return in the 1264integer variable `result' either the number of characters read or the 1265constant `YY_NULL' (0 on Unix systems) to indicate `EOF'. The default 1266`YY_INPUT' reads from the global file-pointer `yyin'. 1267 1268 Here is a sample definition of `YY_INPUT' (in the definitions 1269section of the input file): 1270 1271 %{ 1272 #define YY_INPUT(buf,result,max_size) \ 1273 { \ 1274 int c = getchar(); \ 1275 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ 1276 } 1277 %} 1278 1279 This definition will change the input processing to occur one 1280character at a time. 1281 1282 When the scanner receives an end-of-file indication from YY_INPUT, it 1283then checks the `yywrap()' function. If `yywrap()' returns false 1284(zero), then it is assumed that the function has gone ahead and set up 1285`yyin' to point to another input file, and scanning continues. If it 1286returns true (non-zero), then the scanner terminates, returning 0 to 1287its caller. Note that in either case, the start condition remains 1288unchanged; it does _not_ revert to `INITIAL'. 1289 1290 If you do not supply your own version of `yywrap()', then you must 1291either use `%option noyywrap' (in which case the scanner behaves as 1292though `yywrap()' returned 1), or you must link with `-lfl' to obtain 1293the default version of the routine, which always returns 1. 1294 1295 For scanning from in-memory buffers (e.g., scanning strings), see 1296*note Scanning Strings::. *Note Multiple Input Buffers::. 1297 1298 The scanner writes its `ECHO' output to the `yyout' global (default, 1299`stdout'), which may be redefined by the user simply by assigning it to 1300some other `FILE' pointer. 1301 1302 1303File: flex.info, Node: Start Conditions, Next: Multiple Input Buffers, Prev: Generated Scanner, Up: Top 1304 130510 Start Conditions 1306******************* 1307 1308`flex' provides a mechanism for conditionally activating rules. Any 1309rule whose pattern is prefixed with `<sc>' will only be active when the 1310scanner is in the "start condition" named `sc'. For example, 1311 1312 <STRING>[^"]* { /* eat up the string body ... */ 1313 ... 1314 } 1315 1316 will be active only when the scanner is in the `STRING' start 1317condition, and 1318 1319 <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ 1320 ... 1321 } 1322 1323 will be active only when the current start condition is either 1324`INITIAL', `STRING', or `QUOTE'. 1325 1326 Start conditions are declared in the definitions (first) section of 1327the input using unindented lines beginning with either `%s' or `%x' 1328followed by a list of names. The former declares "inclusive" start 1329conditions, the latter "exclusive" start conditions. A start condition 1330is activated using the `BEGIN' action. Until the next `BEGIN' action 1331is executed, rules with the given start condition will be active and 1332rules with other start conditions will be inactive. If the start 1333condition is inclusive, then rules with no start conditions at all will 1334also be active. If it is exclusive, then _only_ rules qualified with 1335the start condition will be active. A set of rules contingent on the 1336same exclusive start condition describe a scanner which is independent 1337of any of the other rules in the `flex' input. Because of this, 1338exclusive start conditions make it easy to specify "mini-scanners" 1339which scan portions of the input that are syntactically different from 1340the rest (e.g., comments). 1341 1342 If the distinction between inclusive and exclusive start conditions 1343is still a little vague, here's a simple example illustrating the 1344connection between the two. The set of rules: 1345 1346 %s example 1347 %% 1348 1349 <example>foo do_something(); 1350 1351 bar something_else(); 1352 1353 is equivalent to 1354 1355 %x example 1356 %% 1357 1358 <example>foo do_something(); 1359 1360 <INITIAL,example>bar something_else(); 1361 1362 Without the `<INITIAL,example>' qualifier, the `bar' pattern in the 1363second example wouldn't be active (i.e., couldn't match) when in start 1364condition `example'. If we just used `<example>' to qualify `bar', 1365though, then it would only be active in `example' and not in `INITIAL', 1366while in the first example it's active in both, because in the first 1367example the `example' start condition is an inclusive `(%s)' start 1368condition. 1369 1370 Also note that the special start-condition specifier `<*>' matches 1371every start condition. Thus, the above example could also have been 1372written: 1373 1374 %x example 1375 %% 1376 1377 <example>foo do_something(); 1378 1379 <*>bar something_else(); 1380 1381 The default rule (to `ECHO' any unmatched character) remains active 1382in start conditions. It is equivalent to: 1383 1384 <*>.|\n ECHO; 1385 1386 `BEGIN(0)' returns to the original state where only the rules with 1387no start conditions are active. This state can also be referred to as 1388the start-condition `INITIAL', so `BEGIN(INITIAL)' is equivalent to 1389`BEGIN(0)'. (The parentheses around the start condition name are not 1390required but are considered good style.) 1391 1392 `BEGIN' actions can also be given as indented code at the beginning 1393of the rules section. For example, the following will cause the scanner 1394to enter the `SPECIAL' start condition whenever `yylex()' is called and 1395the global variable `enter_special' is true: 1396 1397 int enter_special; 1398 1399 %x SPECIAL 1400 %% 1401 if ( enter_special ) 1402 BEGIN(SPECIAL); 1403 1404 <SPECIAL>blahblahblah 1405 ...more rules follow... 1406 1407 To illustrate the uses of start conditions, here is a scanner which 1408provides two different interpretations of a string like `123.456'. By 1409default it will treat it as three tokens, the integer `123', a dot 1410(`.'), and the integer `456'. But if the string is preceded earlier in 1411the line by the string `expect-floats' it will treat it as a single 1412token, the floating-point number `123.456': 1413 1414 %{ 1415 #include <math.h> 1416 %} 1417 %s expect 1418 1419 %% 1420 expect-floats BEGIN(expect); 1421 1422 <expect>[0-9]+.[0-9]+ { 1423 printf( "found a float, = %f\n", 1424 atof( yytext ) ); 1425 } 1426 <expect>\n { 1427 /* that's the end of the line, so 1428 * we need another "expect-number" 1429 * before we'll recognize any more 1430 * numbers 1431 */ 1432 BEGIN(INITIAL); 1433 } 1434 1435 [0-9]+ { 1436 printf( "found an integer, = %d\n", 1437 atoi( yytext ) ); 1438 } 1439 1440 "." printf( "found a dot\n" ); 1441 1442 Here is a scanner which recognizes (and discards) C comments while 1443maintaining a count of the current input line. 1444 1445 %x comment 1446 %% 1447 int line_num = 1; 1448 1449 "/*" BEGIN(comment); 1450 1451 <comment>[^*\n]* /* eat anything that's not a '*' */ 1452 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1453 <comment>\n ++line_num; 1454 <comment>"*"+"/" BEGIN(INITIAL); 1455 1456 This scanner goes to a bit of trouble to match as much text as 1457possible with each rule. In general, when attempting to write a 1458high-speed scanner try to match as much possible in each rule, as it's 1459a big win. 1460 1461 Note that start-conditions names are really integer values and can 1462be stored as such. Thus, the above could be extended in the following 1463fashion: 1464 1465 %x comment foo 1466 %% 1467 int line_num = 1; 1468 int comment_caller; 1469 1470 "/*" { 1471 comment_caller = INITIAL; 1472 BEGIN(comment); 1473 } 1474 1475 ... 1476 1477 <foo>"/*" { 1478 comment_caller = foo; 1479 BEGIN(comment); 1480 } 1481 1482 <comment>[^*\n]* /* eat anything that's not a '*' */ 1483 <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ 1484 <comment>\n ++line_num; 1485 <comment>"*"+"/" BEGIN(comment_caller); 1486 1487 Furthermore, you can access the current start condition using the 1488integer-valued `YY_START' macro. For example, the above assignments to 1489`comment_caller' could instead be written 1490 1491 comment_caller = YY_START; 1492 1493 Flex provides `YYSTATE' as an alias for `YY_START' (since that is 1494what's used by AT&T `lex'). 1495 1496 For historical reasons, start conditions do not have their own 1497name-space within the generated scanner. The start condition names are 1498unmodified in the generated scanner and generated header. *Note 1499option-header::. *Note option-prefix::. 1500 1501 Finally, here's an example of how to match C-style quoted strings 1502using exclusive start conditions, including expanded escape sequences 1503(but not including checking for a string that's too long): 1504 1505 %x str 1506 1507 %% 1508 char string_buf[MAX_STR_CONST]; 1509 char *string_buf_ptr; 1510 1511 1512 \" string_buf_ptr = string_buf; BEGIN(str); 1513 1514 <str>\" { /* saw closing quote - all done */ 1515 BEGIN(INITIAL); 1516 *string_buf_ptr = '\0'; 1517 /* return string constant token type and 1518 * value to parser 1519 */ 1520 } 1521 1522 <str>\n { 1523 /* error - unterminated string constant */ 1524 /* generate error message */ 1525 } 1526 1527 <str>\\[0-7]{1,3} { 1528 /* octal escape sequence */ 1529 int result; 1530 1531 (void) sscanf( yytext + 1, "%o", &result ); 1532 1533 if ( result > 0xff ) 1534 /* error, constant is out-of-bounds */ 1535 1536 *string_buf_ptr++ = result; 1537 } 1538 1539 <str>\\[0-9]+ { 1540 /* generate error - bad escape sequence; something 1541 * like '\48' or '\0777777' 1542 */ 1543 } 1544 1545 <str>\\n *string_buf_ptr++ = '\n'; 1546 <str>\\t *string_buf_ptr++ = '\t'; 1547 <str>\\r *string_buf_ptr++ = '\r'; 1548 <str>\\b *string_buf_ptr++ = '\b'; 1549 <str>\\f *string_buf_ptr++ = '\f'; 1550 1551 <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; 1552 1553 <str>[^\\\n\"]+ { 1554 char *yptr = yytext; 1555 1556 while ( *yptr ) 1557 *string_buf_ptr++ = *yptr++; 1558 } 1559 1560 Often, such as in some of the examples above, you wind up writing a 1561whole bunch of rules all preceded by the same start condition(s). Flex 1562makes this a little easier and cleaner by introducing a notion of start 1563condition "scope". A start condition scope is begun with: 1564 1565 <SCs>{ 1566 1567 where `SCs' is a list of one or more start conditions. Inside the 1568start condition scope, every rule automatically has the prefix `SCs>' 1569applied to it, until a `}' which matches the initial `{'. So, for 1570example, 1571 1572 <ESC>{ 1573 "\\n" return '\n'; 1574 "\\r" return '\r'; 1575 "\\f" return '\f'; 1576 "\\0" return '\0'; 1577 } 1578 1579 is equivalent to: 1580 1581 <ESC>"\\n" return '\n'; 1582 <ESC>"\\r" return '\r'; 1583 <ESC>"\\f" return '\f'; 1584 <ESC>"\\0" return '\0'; 1585 1586 Start condition scopes may be nested. 1587 1588 The following routines are available for manipulating stacks of 1589start conditions: 1590 1591 -- Function: void yy_push_state ( int `new_state' ) 1592 pushes the current start condition onto the top of the start 1593 condition stack and switches to `new_state' as though you had used 1594 `BEGIN new_state' (recall that start condition names are also 1595 integers). 1596 1597 -- Function: void yy_pop_state () 1598 pops the top of the stack and switches to it via `BEGIN'. 1599 1600 -- Function: int yy_top_state () 1601 returns the top of the stack without altering the stack's contents. 1602 1603 The start condition stack grows dynamically and so has no built-in 1604size limitation. If memory is exhausted, program execution aborts. 1605 1606 To use start condition stacks, your scanner must include a `%option 1607stack' directive (*note Scanner Options::). 1608 1609 1610File: flex.info, Node: Multiple Input Buffers, Next: EOF, Prev: Start Conditions, Up: Top 1611 161211 Multiple Input Buffers 1613************************* 1614 1615Some scanners (such as those which support "include" files) require 1616reading from several input streams. As `flex' scanners do a large 1617amount of buffering, one cannot control where the next input will be 1618read from by simply writing a `YY_INPUT()' which is sensitive to the 1619scanning context. `YY_INPUT()' is only called when the scanner reaches 1620the end of its buffer, which may be a long time after scanning a 1621statement such as an `include' statement which requires switching the 1622input source. 1623 1624 To negotiate these sorts of problems, `flex' provides a mechanism 1625for creating and switching between multiple input buffers. An input 1626buffer is created by using: 1627 1628 -- Function: YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) 1629 1630 which takes a `FILE' pointer and a size and creates a buffer 1631associated with the given file and large enough to hold `size' 1632characters (when in doubt, use `YY_BUF_SIZE' for the size). It returns 1633a `YY_BUFFER_STATE' handle, which may then be passed to other routines 1634(see below). The `YY_BUFFER_STATE' type is a pointer to an opaque 1635`struct yy_buffer_state' structure, so you may safely initialize 1636`YY_BUFFER_STATE' variables to `((YY_BUFFER_STATE) 0)' if you wish, and 1637also refer to the opaque structure in order to correctly declare input 1638buffers in source files other than that of your scanner. Note that the 1639`FILE' pointer in the call to `yy_create_buffer' is only used as the 1640value of `yyin' seen by `YY_INPUT'. If you redefine `YY_INPUT()' so it 1641no longer uses `yyin', then you can safely pass a NULL `FILE' pointer to 1642`yy_create_buffer'. You select a particular buffer to scan from using: 1643 1644 -- Function: void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) 1645 1646 The above function switches the scanner's input buffer so subsequent 1647tokens will come from `new_buffer'. Note that `yy_switch_to_buffer()' 1648may be used by `yywrap()' to set things up for continued scanning, 1649instead of opening a new file and pointing `yyin' at it. If you are 1650looking for a stack of input buffers, then you want to use 1651`yypush_buffer_state()' instead of this function. Note also that 1652switching input sources via either `yy_switch_to_buffer()' or 1653`yywrap()' does _not_ change the start condition. 1654 1655 -- Function: void yy_delete_buffer ( YY_BUFFER_STATE buffer ) 1656 1657 is used to reclaim the storage associated with a buffer. (`buffer' 1658can be NULL, in which case the routine does nothing.) You can also 1659clear the current contents of a buffer using: 1660 1661 -- Function: void yypush_buffer_state ( YY_BUFFER_STATE buffer ) 1662 1663 This function pushes the new buffer state onto an internal stack. 1664The pushed state becomes the new current state. The stack is maintained 1665by flex and will grow as required. This function is intended to be used 1666instead of `yy_switch_to_buffer', when you want to change states, but 1667preserve the current state for later use. 1668 1669 -- Function: void yypop_buffer_state ( ) 1670 1671 This function removes the current state from the top of the stack, 1672and deletes it by calling `yy_delete_buffer'. The next state on the 1673stack, if any, becomes the new current state. 1674 1675 -- Function: void yy_flush_buffer ( YY_BUFFER_STATE buffer ) 1676 1677 This function discards the buffer's contents, so the next time the 1678scanner attempts to match a token from the buffer, it will first fill 1679the buffer anew using `YY_INPUT()'. 1680 1681 -- Function: YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) 1682 1683 is an alias for `yy_create_buffer()', provided for compatibility 1684with the C++ use of `new' and `delete' for creating and destroying 1685dynamic objects. 1686 1687 `YY_CURRENT_BUFFER' macro returns a `YY_BUFFER_STATE' handle to the 1688current buffer. It should not be used as an lvalue. 1689 1690 Here are two examples of using these features for writing a scanner 1691which expands include files (the `<<EOF>>' feature is discussed below). 1692 1693 This first example uses yypush_buffer_state and yypop_buffer_state. 1694Flex maintains the stack internally. 1695 1696 /* the "incl" state is used for picking up the name 1697 * of an include file 1698 */ 1699 %x incl 1700 %% 1701 include BEGIN(incl); 1702 1703 [a-z]+ ECHO; 1704 [^a-z\n]*\n? ECHO; 1705 1706 <incl>[ \t]* /* eat the whitespace */ 1707 <incl>[^ \t\n]+ { /* got the include file name */ 1708 yyin = fopen( yytext, "r" ); 1709 1710 if ( ! yyin ) 1711 error( ... ); 1712 1713 yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); 1714 1715 BEGIN(INITIAL); 1716 } 1717 1718 <<EOF>> { 1719 yypop_buffer_state(); 1720 1721 if ( !YY_CURRENT_BUFFER ) 1722 { 1723 yyterminate(); 1724 } 1725 } 1726 1727 The second example, below, does the same thing as the previous 1728example did, but manages its own input buffer stack manually (instead 1729of letting flex do it). 1730 1731 /* the "incl" state is used for picking up the name 1732 * of an include file 1733 */ 1734 %x incl 1735 1736 %{ 1737 #define MAX_INCLUDE_DEPTH 10 1738 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1739 int include_stack_ptr = 0; 1740 %} 1741 1742 %% 1743 include BEGIN(incl); 1744 1745 [a-z]+ ECHO; 1746 [^a-z\n]*\n? ECHO; 1747 1748 <incl>[ \t]* /* eat the whitespace */ 1749 <incl>[^ \t\n]+ { /* got the include file name */ 1750 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) 1751 { 1752 fprintf( stderr, "Includes nested too deeply" ); 1753 exit( 1 ); 1754 } 1755 1756 include_stack[include_stack_ptr++] = 1757 YY_CURRENT_BUFFER; 1758 1759 yyin = fopen( yytext, "r" ); 1760 1761 if ( ! yyin ) 1762 error( ... ); 1763 1764 yy_switch_to_buffer( 1765 yy_create_buffer( yyin, YY_BUF_SIZE ) ); 1766 1767 BEGIN(INITIAL); 1768 } 1769 1770 <<EOF>> { 1771 if ( --include_stack_ptr 0 ) 1772 { 1773 yyterminate(); 1774 } 1775 1776 else 1777 { 1778 yy_delete_buffer( YY_CURRENT_BUFFER ); 1779 yy_switch_to_buffer( 1780 include_stack[include_stack_ptr] ); 1781 } 1782 } 1783 1784 The following routines are available for setting up input buffers for 1785scanning in-memory strings instead of files. All of them create a new 1786input buffer for scanning the string, and return a corresponding 1787`YY_BUFFER_STATE' handle (which you should delete with 1788`yy_delete_buffer()' when done with it). They also switch to the new 1789buffer using `yy_switch_to_buffer()', so the next call to `yylex()' 1790will start scanning the string. 1791 1792 -- Function: YY_BUFFER_STATE yy_scan_string ( const char *str ) 1793 scans a NUL-terminated string. 1794 1795 -- Function: YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int 1796 len ) 1797 scans `len' bytes (including possibly `NUL's) starting at location 1798 `bytes'. 1799 1800 Note that both of these functions create and scan a _copy_ of the 1801string or bytes. (This may be desirable, since `yylex()' modifies the 1802contents of the buffer it is scanning.) You can avoid the copy by 1803using: 1804 1805 -- Function: YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t 1806 size) 1807 which scans in place the buffer starting at `base', consisting of 1808 `size' bytes, the last two bytes of which _must_ be 1809 `YY_END_OF_BUFFER_CHAR' (ASCII NUL). These last two bytes are not 1810 scanned; thus, scanning consists of `base[0]' through 1811 `base[size-2]', inclusive. 1812 1813 If you fail to set up `base' in this manner (i.e., forget the final 1814two `YY_END_OF_BUFFER_CHAR' bytes), then `yy_scan_buffer()' returns a 1815NULL pointer instead of creating a new input buffer. 1816 1817 -- Data type: yy_size_t 1818 is an integral type to which you can cast an integer expression 1819 reflecting the size of the buffer. 1820 1821 1822File: flex.info, Node: EOF, Next: Misc Macros, Prev: Multiple Input Buffers, Up: Top 1823 182412 End-of-File Rules 1825******************** 1826 1827The special rule `<<EOF>>' indicates actions which are to be taken when 1828an end-of-file is encountered and `yywrap()' returns non-zero (i.e., 1829indicates no further files to process). The action must finish by 1830doing one of the following things: 1831 1832 * assigning `yyin' to a new input file (in previous versions of 1833 `flex', after doing the assignment you had to call the special 1834 action `YY_NEW_FILE'. This is no longer necessary.) 1835 1836 * executing a `return' statement; 1837 1838 * executing the special `yyterminate()' action. 1839 1840 * or, switching to a new buffer using `yy_switch_to_buffer()' as 1841 shown in the example above. 1842 1843 <<EOF>> rules may not be used with other patterns; they may only be 1844qualified with a list of start conditions. If an unqualified <<EOF>> 1845rule is given, it applies to _all_ start conditions which do not 1846already have <<EOF>> actions. To specify an <<EOF>> rule for only the 1847initial start condition, use: 1848 1849 <INITIAL><<EOF>> 1850 1851 These rules are useful for catching things like unclosed comments. 1852An example: 1853 1854 %x quote 1855 %% 1856 1857 ...other rules for dealing with quotes... 1858 1859 <quote><<EOF>> { 1860 error( "unterminated quote" ); 1861 yyterminate(); 1862 } 1863 <<EOF>> { 1864 if ( *++filelist ) 1865 yyin = fopen( *filelist, "r" ); 1866 else 1867 yyterminate(); 1868 } 1869 1870 1871File: flex.info, Node: Misc Macros, Next: User Values, Prev: EOF, Up: Top 1872 187313 Miscellaneous Macros 1874*********************** 1875 1876The macro `YY_USER_ACTION' can be defined to provide an action which is 1877always executed prior to the matched rule's action. For example, it 1878could be #define'd to call a routine to convert yytext to lower-case. 1879When `YY_USER_ACTION' is invoked, the variable `yy_act' gives the 1880number of the matched rule (rules are numbered starting with 1). 1881Suppose you want to profile how often each of your rules is matched. 1882The following would do the trick: 1883 1884 #define YY_USER_ACTION ++ctr[yy_act] 1885 1886 where `ctr' is an array to hold the counts for the different rules. 1887Note that the macro `YY_NUM_RULES' gives the total number of rules 1888(including the default rule), even if you use `-s)', so a correct 1889declaration for `ctr' is: 1890 1891 int ctr[YY_NUM_RULES]; 1892 1893 The macro `YY_USER_INIT' may be defined to provide an action which 1894is always executed before the first scan (and before the scanner's 1895internal initializations are done). For example, it could be used to 1896call a routine to read in a data table or open a logging file. 1897 1898 The macro `yy_set_interactive(is_interactive)' can be used to 1899control whether the current buffer is considered "interactive". An 1900interactive buffer is processed more slowly, but must be used when the 1901scanner's input source is indeed interactive to avoid problems due to 1902waiting to fill buffers (see the discussion of the `-I' flag in *note 1903Scanner Options::). A non-zero value in the macro invocation marks the 1904buffer as interactive, a zero value as non-interactive. Note that use 1905of this macro overrides `%option always-interactive' or `%option 1906never-interactive' (*note Scanner Options::). `yy_set_interactive()' 1907must be invoked prior to beginning to scan the buffer that is (or is 1908not) to be considered interactive. 1909 1910 The macro `yy_set_bol(at_bol)' can be used to control whether the 1911current buffer's scanning context for the next token match is done as 1912though at the beginning of a line. A non-zero macro argument makes 1913rules anchored with `^' active, while a zero argument makes `^' rules 1914inactive. 1915 1916 The macro `YY_AT_BOL()' returns true if the next token scanned from 1917the current buffer will have `^' rules active, false otherwise. 1918 1919 In the generated scanner, the actions are all gathered in one large 1920switch statement and separated using `YY_BREAK', which may be 1921redefined. By default, it is simply a `break', to separate each rule's 1922action from the following rule's. Redefining `YY_BREAK' allows, for 1923example, C++ users to #define YY_BREAK to do nothing (while being very 1924careful that every rule ends with a `break' or a `return'!) to avoid 1925suffering from unreachable statement warnings where because a rule's 1926action ends with `return', the `YY_BREAK' is inaccessible. 1927 1928 1929File: flex.info, Node: User Values, Next: Yacc, Prev: Misc Macros, Up: Top 1930 193114 Values Available To the User 1932******************************* 1933 1934This chapter summarizes the various values available to the user in the 1935rule actions. 1936 1937`char *yytext' 1938 holds the text of the current token. It may be modified but not 1939 lengthened (you cannot append characters to the end). 1940 1941 If the special directive `%array' appears in the first section of 1942 the scanner description, then `yytext' is instead declared `char 1943 yytext[YYLMAX]', where `YYLMAX' is a macro definition that you can 1944 redefine in the first section if you don't like the default value 1945 (generally 8KB). Using `%array' results in somewhat slower 1946 scanners, but the value of `yytext' becomes immune to calls to 1947 `unput()', which potentially destroy its value when `yytext' is a 1948 character pointer. The opposite of `%array' is `%pointer', which 1949 is the default. 1950 1951 You cannot use `%array' when generating C++ scanner classes (the 1952 `-+' flag). 1953 1954`int yyleng' 1955 holds the length of the current token. 1956 1957`FILE *yyin' 1958 is the file which by default `flex' reads from. It may be 1959 redefined but doing so only makes sense before scanning begins or 1960 after an EOF has been encountered. Changing it in the midst of 1961 scanning will have unexpected results since `flex' buffers its 1962 input; use `yyrestart()' instead. Once scanning terminates 1963 because an end-of-file has been seen, you can assign `yyin' at the 1964 new input file and then call the scanner again to continue 1965 scanning. 1966 1967`void yyrestart( FILE *new_file )' 1968 may be called to point `yyin' at the new input file. The 1969 switch-over to the new file is immediate (any previously 1970 buffered-up input is lost). Note that calling `yyrestart()' with 1971 `yyin' as an argument thus throws away the current input buffer 1972 and continues scanning the same input file. 1973 1974`FILE *yyout' 1975 is the file to which `ECHO' actions are done. It can be reassigned 1976 by the user. 1977 1978`YY_CURRENT_BUFFER' 1979 returns a `YY_BUFFER_STATE' handle to the current buffer. 1980 1981`YY_START' 1982 returns an integer value corresponding to the current start 1983 condition. You can subsequently use this value with `BEGIN' to 1984 return to that start condition. 1985 1986 1987File: flex.info, Node: Yacc, Next: Scanner Options, Prev: User Values, Up: Top 1988 198915 Interfacing with Yacc 1990************************ 1991 1992One of the main uses of `flex' is as a companion to the `yacc' 1993parser-generator. `yacc' parsers expect to call a routine named 1994`yylex()' to find the next input token. The routine is supposed to 1995return the type of the next token as well as putting any associated 1996value in the global `yylval'. To use `flex' with `yacc', one specifies 1997the `-d' option to `yacc' to instruct it to generate the file `y.tab.h' 1998containing definitions of all the `%tokens' appearing in the `yacc' 1999input. This file is then included in the `flex' scanner. For example, 2000if one of the tokens is `TOK_NUMBER', part of the scanner might look 2001like: 2002 2003 %{ 2004 #include "y.tab.h" 2005 %} 2006 2007 %% 2008 2009 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; 2010 2011 2012File: flex.info, Node: Scanner Options, Next: Performance, Prev: Yacc, Up: Top 2013 201416 Scanner Options 2015****************** 2016 2017The various `flex' options are categorized by function in the following 2018menu. If you want to lookup a particular option by name, *Note Index of 2019Scanner Options::. 2020 2021* Menu: 2022 2023* Options for Specifying Filenames:: 2024* Options Affecting Scanner Behavior:: 2025* Code-Level And API Options:: 2026* Options for Scanner Speed and Size:: 2027* Debugging Options:: 2028* Miscellaneous Options:: 2029 2030 Even though there are many scanner options, a typical scanner might 2031only specify the following options: 2032 2033 %option 8bit reentrant bison-bridge 2034 %option warn nodefault 2035 %option yylineno 2036 %option outfile="scanner.c" header-file="scanner.h" 2037 2038 The first line specifies the general type of scanner we want. The 2039second line specifies that we are being careful. The third line asks 2040flex to track line numbers. The last line tells flex what to name the 2041files. (The options can be specified in any order. We just divided 2042them.) 2043 2044 `flex' also provides a mechanism for controlling options within the 2045scanner specification itself, rather than from the flex command-line. 2046This is done by including `%option' directives in the first section of 2047the scanner specification. You can specify multiple options with a 2048single `%option' directive, and multiple directives in the first 2049section of your flex input file. 2050 2051 Most options are given simply as names, optionally preceded by the 2052word `no' (with no intervening whitespace) to negate their meaning. 2053The names are the same as their long-option equivalents (but without the 2054leading `--' ). 2055 2056 `flex' scans your rule actions to determine whether you use the 2057`REJECT' or `yymore()' features. The `REJECT' and `yymore' options are 2058available to override its decision as to whether you use the options, 2059either by setting them (e.g., `%option reject)' to indicate the feature 2060is indeed used, or unsetting them to indicate it actually is not used 2061(e.g., `%option noyymore)'. 2062 2063 A number of options are available for lint purists who want to 2064suppress the appearance of unneeded routines in the generated scanner. 2065Each of the following, if unset (e.g., `%option nounput'), results in 2066the corresponding routine not appearing in the generated scanner: 2067 2068 input, unput 2069 yy_push_state, yy_pop_state, yy_top_state 2070 yy_scan_buffer, yy_scan_bytes, yy_scan_string 2071 2072 yyget_extra, yyset_extra, yyget_leng, yyget_text, 2073 yyget_lineno, yyset_lineno, yyget_in, yyset_in, 2074 yyget_out, yyset_out, yyget_lval, yyset_lval, 2075 yyget_lloc, yyset_lloc, yyget_debug, yyset_debug 2076 2077 (though `yy_push_state()' and friends won't appear anyway unless you 2078use `%option stack)'. 2079 2080 2081File: flex.info, Node: Options for Specifying Filenames, Next: Options Affecting Scanner Behavior, Prev: Scanner Options, Up: Scanner Options 2082 208316.1 Options for Specifying Filenames 2084===================================== 2085 2086`--header-file=FILE, `%option header-file="FILE"'' 2087 instructs flex to write a C header to `FILE'. This file contains 2088 function prototypes, extern variables, and types used by the 2089 scanner. Only the external API is exported by the header file. 2090 Many macros that are usable from within scanner actions are not 2091 exported to the header file. This is due to namespace problems and 2092 the goal of a clean external API. 2093 2094 While in the header, the macro `yyIN_HEADER' is defined, where `yy' 2095 is substituted with the appropriate prefix. 2096 2097 The `--header-file' option is not compatible with the `--c++' 2098 option, since the C++ scanner provides its own header in 2099 `yyFlexLexer.h'. 2100 2101`-oFILE, --outfile=FILE, `%option outfile="FILE"'' 2102 directs flex to write the scanner to the file `FILE' instead of 2103 `lex.yy.c'. If you combine `--outfile' with the `--stdout' option, 2104 then the scanner is written to `stdout' but its `#line' directives 2105 (see the `-l' option above) refer to the file `FILE'. 2106 2107`-t, --stdout, `%option stdout'' 2108 instructs `flex' to write the scanner it generates to standard 2109 output instead of `lex.yy.c'. 2110 2111`-SFILE, --skel=FILE' 2112 overrides the default skeleton file from which `flex' constructs 2113 its scanners. You'll never need this option unless you are doing 2114 `flex' maintenance or development. 2115 2116`--tables-file=FILE' 2117 Write serialized scanner dfa tables to FILE. The generated scanner 2118 will not contain the tables, and requires them to be loaded at 2119 runtime. *Note serialization::. 2120 2121`--tables-verify' 2122 This option is for flex development. We document it here in case 2123 you stumble upon it by accident or in case you suspect some 2124 inconsistency in the serialized tables. Flex will serialize the 2125 scanner dfa tables but will also generate the in-code tables as it 2126 normally does. At runtime, the scanner will verify that the 2127 serialized tables match the in-code tables, instead of loading 2128 them. 2129 2130 2131 2132File: flex.info, Node: Options Affecting Scanner Behavior, Next: Code-Level And API Options, Prev: Options for Specifying Filenames, Up: Scanner Options 2133 213416.2 Options Affecting Scanner Behavior 2135======================================= 2136 2137`-i, --case-insensitive, `%option case-insensitive'' 2138 instructs `flex' to generate a "case-insensitive" scanner. The 2139 case of letters given in the `flex' input patterns will be ignored, 2140 and tokens in the input will be matched regardless of case. The 2141 matched text given in `yytext' will have the preserved case (i.e., 2142 it will not be folded). For tricky behavior, see *note case and 2143 character ranges::. 2144 2145`-l, --lex-compat, `%option lex-compat'' 2146 turns on maximum compatibility with the original AT&T `lex' 2147 implementation. Note that this does not mean _full_ compatibility. 2148 Use of this option costs a considerable amount of performance, and 2149 it cannot be used with the `--c++', `--full', `--fast', `-Cf', or 2150 `-CF' options. For details on the compatibilities it provides, see 2151 *note Lex and Posix::. This option also results in the name 2152 `YY_FLEX_LEX_COMPAT' being `#define''d in the generated scanner. 2153 2154`-B, --batch, `%option batch'' 2155 instructs `flex' to generate a "batch" scanner, the opposite of 2156 _interactive_ scanners generated by `--interactive' (see below). 2157 In general, you use `-B' when you are _certain_ that your scanner 2158 will never be used interactively, and you want to squeeze a 2159 _little_ more performance out of it. If your goal is instead to 2160 squeeze out a _lot_ more performance, you should be using the 2161 `-Cf' or `-CF' options, which turn on `--batch' automatically 2162 anyway. 2163 2164`-I, --interactive, `%option interactive'' 2165 instructs `flex' to generate an interactive scanner. An 2166 interactive scanner is one that only looks ahead to decide what 2167 token has been matched if it absolutely must. It turns out that 2168 always looking one extra character ahead, even if the scanner has 2169 already seen enough text to disambiguate the current token, is a 2170 bit faster than only looking ahead when necessary. But scanners 2171 that always look ahead give dreadful interactive performance; for 2172 example, when a user types a newline, it is not recognized as a 2173 newline token until they enter _another_ token, which often means 2174 typing in another whole line. 2175 2176 `flex' scanners default to `interactive' unless you use the `-Cf' 2177 or `-CF' table-compression options (*note Performance::). That's 2178 because if you're looking for high-performance you should be using 2179 one of these options, so if you didn't, `flex' assumes you'd 2180 rather trade off a bit of run-time performance for intuitive 2181 interactive behavior. Note also that you _cannot_ use 2182 `--interactive' in conjunction with `-Cf' or `-CF'. Thus, this 2183 option is not really needed; it is on by default for all those 2184 cases in which it is allowed. 2185 2186 You can force a scanner to _not_ be interactive by using `--batch' 2187 2188`-7, --7bit, `%option 7bit'' 2189 instructs `flex' to generate a 7-bit scanner, i.e., one which can 2190 only recognize 7-bit characters in its input. The advantage of 2191 using `--7bit' is that the scanner's tables can be up to half the 2192 size of those generated using the `--8bit'. The disadvantage is 2193 that such scanners often hang or crash if their input contains an 2194 8-bit character. 2195 2196 Note, however, that unless you generate your scanner using the 2197 `-Cf' or `-CF' table compression options, use of `--7bit' will 2198 save only a small amount of table space, and make your scanner 2199 considerably less portable. `Flex''s default behavior is to 2200 generate an 8-bit scanner unless you use the `-Cf' or `-CF', in 2201 which case `flex' defaults to generating 7-bit scanners unless 2202 your site was always configured to generate 8-bit scanners (as will 2203 often be the case with non-USA sites). You can tell whether flex 2204 generated a 7-bit or an 8-bit scanner by inspecting the flag 2205 summary in the `--verbose' output as described above. 2206 2207 Note that if you use `-Cfe' or `-CFe' `flex' still defaults to 2208 generating an 8-bit scanner, since usually with these compression 2209 options full 8-bit tables are not much more expensive than 7-bit 2210 tables. 2211 2212`-8, --8bit, `%option 8bit'' 2213 instructs `flex' to generate an 8-bit scanner, i.e., one which can 2214 recognize 8-bit characters. This flag is only needed for scanners 2215 generated using `-Cf' or `-CF', as otherwise flex defaults to 2216 generating an 8-bit scanner anyway. 2217 2218 See the discussion of `--7bit' above for `flex''s default behavior 2219 and the tradeoffs between 7-bit and 8-bit scanners. 2220 2221`--default, `%option default'' 2222 generate the default rule. 2223 2224`--always-interactive, `%option always-interactive'' 2225 instructs flex to generate a scanner which always considers its 2226 input _interactive_. Normally, on each new input file the scanner 2227 calls `isatty()' in an attempt to determine whether the scanner's 2228 input source is interactive and thus should be read a character at 2229 a time. When this option is used, however, then no such call is 2230 made. 2231 2232`--never-interactive, `--never-interactive'' 2233 instructs flex to generate a scanner which never considers its 2234 input interactive. This is the opposite of `always-interactive'. 2235 2236`-X, --posix, `%option posix'' 2237 turns on maximum compatibility with the POSIX 1003.2-1992 2238 definition of `lex'. Since `flex' was originally designed to 2239 implement the POSIX definition of `lex' this generally involves 2240 very few changes in behavior. At the current writing the known 2241 differences between `flex' and the POSIX standard are: 2242 2243 * In POSIX and AT&T `lex', the repeat operator, `{}', has lower 2244 precedence than concatenation (thus `ab{3}' yields `ababab'). 2245 Most POSIX utilities use an Extended Regular Expression (ERE) 2246 precedence that has the precedence of the repeat operator 2247 higher than concatenation (which causes `ab{3}' to yield 2248 `abbb'). By default, `flex' places the precedence of the 2249 repeat operator higher than concatenation which matches the 2250 ERE processing of other POSIX utilities. When either 2251 `--posix' or `-l' are specified, `flex' will use the 2252 traditional AT&T and POSIX-compliant precedence for the 2253 repeat operator where concatenation has higher precedence 2254 than the repeat operator. 2255 2256`--stack, `%option stack'' 2257 enables the use of start condition stacks (*note Start 2258 Conditions::). 2259 2260`--stdinit, `%option stdinit'' 2261 if set (i.e., %option stdinit) initializes `yyin' and `yyout' to 2262 `stdin' and `stdout', instead of the default of `NULL'. Some 2263 existing `lex' programs depend on this behavior, even though it is 2264 not compliant with ANSI C, which does not require `stdin' and 2265 `stdout' to be compile-time constant. In a reentrant scanner, 2266 however, this is not a problem since initialization is performed 2267 in `yylex_init' at runtime. 2268 2269`--yylineno, `%option yylineno'' 2270 directs `flex' to generate a scanner that maintains the number of 2271 the current line read from its input in the global variable 2272 `yylineno'. This option is implied by `%option lex-compat'. In a 2273 reentrant C scanner, the macro `yylineno' is accessible regardless 2274 of the value of `%option yylineno', however, its value is not 2275 modified by `flex' unless `%option yylineno' is enabled. 2276 2277`--yywrap, `%option yywrap'' 2278 if unset (i.e., `--noyywrap)', makes the scanner not call 2279 `yywrap()' upon an end-of-file, but simply assume that there are no 2280 more files to scan (until the user points `yyin' at a new file and 2281 calls `yylex()' again). 2282 2283 2284 2285File: flex.info, Node: Code-Level And API Options, Next: Options for Scanner Speed and Size, Prev: Options Affecting Scanner Behavior, Up: Scanner Options 2286 228716.3 Code-Level And API Options 2288=============================== 2289 2290`--ansi-definitions, `%option ansi-definitions'' 2291 instruct flex to generate ANSI C99 definitions for functions. 2292 This option is enabled by default. If `%option 2293 noansi-definitions' is specified, then the obsolete style is 2294 generated. 2295 2296`--ansi-prototypes, `%option ansi-prototypes'' 2297 instructs flex to generate ANSI C99 prototypes for functions. 2298 This option is enabled by default. If `noansi-prototypes' is 2299 specified, then prototypes will have empty parameter lists. 2300 2301`--bison-bridge, `%option bison-bridge'' 2302 instructs flex to generate a C scanner that is meant to be called 2303 by a `GNU bison' parser. The scanner has minor API changes for 2304 `bison' compatibility. In particular, the declaration of `yylex' 2305 is modified to take an additional parameter, `yylval'. *Note 2306 Bison Bridge::. 2307 2308`--bison-locations, `%option bison-locations'' 2309 instruct flex that `GNU bison' `%locations' are being used. This 2310 means `yylex' will be passed an additional parameter, `yylloc'. 2311 This option implies `%option bison-bridge'. *Note Bison Bridge::. 2312 2313`-L, --noline, `%option noline'' 2314 instructs `flex' not to generate `#line' directives. Without this 2315 option, `flex' peppers the generated scanner with `#line' 2316 directives so error messages in the actions will be correctly 2317 located with respect to either the original `flex' input file (if 2318 the errors are due to code in the input file), or `lex.yy.c' (if 2319 the errors are `flex''s fault - you should report these sorts of 2320 errors to the email address given in *note Reporting Bugs::). 2321 2322`-R, --reentrant, `%option reentrant'' 2323 instructs flex to generate a reentrant C scanner. The generated 2324 scanner may safely be used in a multi-threaded environment. The 2325 API for a reentrant scanner is different than for a non-reentrant 2326 scanner *note Reentrant::). Because of the API difference between 2327 reentrant and non-reentrant `flex' scanners, non-reentrant flex 2328 code must be modified before it is suitable for use with this 2329 option. This option is not compatible with the `--c++' option. 2330 2331 The option `--reentrant' does not affect the performance of the 2332 scanner. 2333 2334`-+, --c++, `%option c++'' 2335 specifies that you want flex to generate a C++ scanner class. 2336 *Note Cxx::, for details. 2337 2338`--array, `%option array'' 2339 specifies that you want yytext to be an array instead of a char* 2340 2341`--pointer, `%option pointer'' 2342 specify that `yytext' should be a `char *', not an array. This 2343 default is `char *'. 2344 2345`-PPREFIX, --prefix=PREFIX, `%option prefix="PREFIX"'' 2346 changes the default `yy' prefix used by `flex' for all 2347 globally-visible variable and function names to instead be 2348 `PREFIX'. For example, `--prefix=foo' changes the name of 2349 `yytext' to `footext'. It also changes the name of the default 2350 output file from `lex.yy.c' to `lex.foo.c'. Here is a partial 2351 list of the names affected: 2352 2353 yy_create_buffer 2354 yy_delete_buffer 2355 yy_flex_debug 2356 yy_init_buffer 2357 yy_flush_buffer 2358 yy_load_buffer_state 2359 yy_switch_to_buffer 2360 yyin 2361 yyleng 2362 yylex 2363 yylineno 2364 yyout 2365 yyrestart 2366 yytext 2367 yywrap 2368 yyalloc 2369 yyrealloc 2370 yyfree 2371 2372 (If you are using a C++ scanner, then only `yywrap' and 2373 `yyFlexLexer' are affected.) Within your scanner itself, you can 2374 still refer to the global variables and functions using either 2375 version of their name; but externally, they have the modified name. 2376 2377 This option lets you easily link together multiple `flex' programs 2378 into the same executable. Note, though, that using this option 2379 also renames `yywrap()', so you now _must_ either provide your own 2380 (appropriately-named) version of the routine for your scanner, or 2381 use `%option noyywrap', as linking with `-lfl' no longer provides 2382 one for you by default. 2383 2384`--main, `%option main'' 2385 directs flex to provide a default `main()' program for the 2386 scanner, which simply calls `yylex()'. This option implies 2387 `noyywrap' (see below). 2388 2389`--nounistd, `%option nounistd'' 2390 suppresses inclusion of the non-ANSI header file `unistd.h'. This 2391 option is meant to target environments in which `unistd.h' does 2392 not exist. Be aware that certain options may cause flex to 2393 generate code that relies on functions normally found in 2394 `unistd.h', (e.g. `isatty()', `read()'.) If you wish to use these 2395 functions, you will have to inform your compiler where to find 2396 them. *Note option-always-interactive::. *Note option-read::. 2397 2398`--yyclass=NAME, `%option yyclass="NAME"'' 2399 only applies when generating a C++ scanner (the `--c++' option). 2400 It informs `flex' that you have derived `NAME' as a subclass of 2401 `yyFlexLexer', so `flex' will place your actions in the member 2402 function `foo::yylex()' instead of `yyFlexLexer::yylex()'. It 2403 also generates a `yyFlexLexer::yylex()' member function that emits 2404 a run-time error (by invoking `yyFlexLexer::LexerError())' if 2405 called. *Note Cxx::. 2406 2407 2408 2409File: flex.info, Node: Options for Scanner Speed and Size, Next: Debugging Options, Prev: Code-Level And API Options, Up: Scanner Options 2410 241116.4 Options for Scanner Speed and Size 2412======================================= 2413 2414`-C[aefFmr]' 2415 controls the degree of table compression and, more generally, 2416 trade-offs between small scanners and fast scanners. 2417 2418 `-C' 2419 A lone `-C' specifies that the scanner tables should be 2420 compressed but neither equivalence classes nor 2421 meta-equivalence classes should be used. 2422 2423 `-Ca, --align, `%option align'' 2424 ("align") instructs flex to trade off larger tables in the 2425 generated scanner for faster performance because the elements 2426 of the tables are better aligned for memory access and 2427 computation. On some RISC architectures, fetching and 2428 manipulating longwords is more efficient than with 2429 smaller-sized units such as shortwords. This option can 2430 quadruple the size of the tables used by your scanner. 2431 2432 `-Ce, --ecs, `%option ecs'' 2433 directs `flex' to construct "equivalence classes", i.e., sets 2434 of characters which have identical lexical properties (for 2435 example, if the only appearance of digits in the `flex' input 2436 is in the character class "[0-9]" then the digits '0', '1', 2437 ..., '9' will all be put in the same equivalence class). 2438 Equivalence classes usually give dramatic reductions in the 2439 final table/object file sizes (typically a factor of 2-5) and 2440 are pretty cheap performance-wise (one array look-up per 2441 character scanned). 2442 2443 `-Cf' 2444 specifies that the "full" scanner tables should be generated - 2445 `flex' should not compress the tables by taking advantages of 2446 similar transition functions for different states. 2447 2448 `-CF' 2449 specifies that the alternate fast scanner representation 2450 (described above under the `--fast' flag) should be used. 2451 This option cannot be used with `--c++'. 2452 2453 `-Cm, --meta-ecs, `%option meta-ecs'' 2454 directs `flex' to construct "meta-equivalence classes", which 2455 are sets of equivalence classes (or characters, if equivalence 2456 classes are not being used) that are commonly used together. 2457 Meta-equivalence classes are often a big win when using 2458 compressed tables, but they have a moderate performance 2459 impact (one or two `if' tests and one array look-up per 2460 character scanned). 2461 2462 `-Cr, --read, `%option read'' 2463 causes the generated scanner to _bypass_ use of the standard 2464 I/O library (`stdio') for input. Instead of calling 2465 `fread()' or `getc()', the scanner will use the `read()' 2466 system call, resulting in a performance gain which varies 2467 from system to system, but in general is probably negligible 2468 unless you are also using `-Cf' or `-CF'. Using `-Cr' can 2469 cause strange behavior if, for example, you read from `yyin' 2470 using `stdio' prior to calling the scanner (because the 2471 scanner will miss whatever text your previous reads left in 2472 the `stdio' input buffer). `-Cr' has no effect if you define 2473 `YY_INPUT()' (*note Generated Scanner::). 2474 2475 The options `-Cf' or `-CF' and `-Cm' do not make sense together - 2476 there is no opportunity for meta-equivalence classes if the table 2477 is not being compressed. Otherwise the options may be freely 2478 mixed, and are cumulative. 2479 2480 The default setting is `-Cem', which specifies that `flex' should 2481 generate equivalence classes and meta-equivalence classes. This 2482 setting provides the highest degree of table compression. You can 2483 trade off faster-executing scanners at the cost of larger tables 2484 with the following generally being true: 2485 2486 slowest & smallest 2487 -Cem 2488 -Cm 2489 -Ce 2490 -C 2491 -C{f,F}e 2492 -C{f,F} 2493 -C{f,F}a 2494 fastest & largest 2495 2496 Note that scanners with the smallest tables are usually generated 2497 and compiled the quickest, so during development you will usually 2498 want to use the default, maximal compression. 2499 2500 `-Cfe' is often a good compromise between speed and size for 2501 production scanners. 2502 2503`-f, --full, `%option full'' 2504 specifies "fast scanner". No table compression is done and 2505 `stdio' is bypassed. The result is large but fast. This option 2506 is equivalent to `--Cfr' 2507 2508`-F, --fast, `%option fast'' 2509 specifies that the _fast_ scanner table representation should be 2510 used (and `stdio' bypassed). This representation is about as fast 2511 as the full table representation `--full', and for some sets of 2512 patterns will be considerably smaller (and for others, larger). In 2513 general, if the pattern set contains both _keywords_ and a 2514 catch-all, _identifier_ rule, such as in the set: 2515 2516 "case" return TOK_CASE; 2517 "switch" return TOK_SWITCH; 2518 ... 2519 "default" return TOK_DEFAULT; 2520 [a-z]+ return TOK_ID; 2521 2522 then you're better off using the full table representation. If 2523 only the _identifier_ rule is present and you then use a hash 2524 table or some such to detect the keywords, you're better off using 2525 `--fast'. 2526 2527 This option is equivalent to `-CFr'. It cannot be used with 2528 `--c++'. 2529 2530 2531 2532File: flex.info, Node: Debugging Options, Next: Miscellaneous Options, Prev: Options for Scanner Speed and Size, Up: Scanner Options 2533 253416.5 Debugging Options 2535====================== 2536 2537`-b, --backup, `%option backup'' 2538 Generate backing-up information to `lex.backup'. This is a list of 2539 scanner states which require backing up and the input characters on 2540 which they do so. By adding rules one can remove backing-up 2541 states. If _all_ backing-up states are eliminated and `-Cf' or 2542 `-CF' is used, the generated scanner will run faster (see the 2543 `--perf-report' flag). Only users who wish to squeeze every last 2544 cycle out of their scanners need worry about this option. (*note 2545 Performance::). 2546 2547`-d, --debug, `%option debug'' 2548 makes the generated scanner run in "debug" mode. Whenever a 2549 pattern is recognized and the global variable `yy_flex_debug' is 2550 non-zero (which is the default), the scanner will write to 2551 `stderr' a line of the form: 2552 2553 -accepting rule at line 53 ("the matched text") 2554 2555 The line number refers to the location of the rule in the file 2556 defining the scanner (i.e., the file that was fed to flex). 2557 Messages are also generated when the scanner backs up, accepts the 2558 default rule, reaches the end of its input buffer (or encounters a 2559 NUL; at this point, the two look the same as far as the scanner's 2560 concerned), or reaches an end-of-file. 2561 2562`-p, --perf-report, `%option perf-report'' 2563 generates a performance report to `stderr'. The report consists of 2564 comments regarding features of the `flex' input file which will 2565 cause a serious loss of performance in the resulting scanner. If 2566 you give the flag twice, you will also get comments regarding 2567 features that lead to minor performance losses. 2568 2569 Note that the use of `REJECT', and variable trailing context 2570 (*note Limitations::) entails a substantial performance penalty; 2571 use of `yymore()', the `^' operator, and the `--interactive' flag 2572 entail minor performance penalties. 2573 2574`-s, --nodefault, `%option nodefault'' 2575 causes the _default rule_ (that unmatched scanner input is echoed 2576 to `stdout)' to be suppressed. If the scanner encounters input 2577 that does not match any of its rules, it aborts with an error. 2578 This option is useful for finding holes in a scanner's rule set. 2579 2580`-T, --trace, `%option trace'' 2581 makes `flex' run in "trace" mode. It will generate a lot of 2582 messages to `stderr' concerning the form of the input and the 2583 resultant non-deterministic and deterministic finite automata. 2584 This option is mostly for use in maintaining `flex'. 2585 2586`-w, --nowarn, `%option nowarn'' 2587 suppresses warning messages. 2588 2589`-v, --verbose, `%option verbose'' 2590 specifies that `flex' should write to `stderr' a summary of 2591 statistics regarding the scanner it generates. Most of the 2592 statistics are meaningless to the casual `flex' user, but the 2593 first line identifies the version of `flex' (same as reported by 2594 `--version'), and the next line the flags used when generating the 2595 scanner, including those that are on by default. 2596 2597`--warn, `%option warn'' 2598 warn about certain things. In particular, if the default rule can 2599 be matched but no default rule has been given, the flex will warn 2600 you. We recommend using this option always. 2601 2602 2603 2604File: flex.info, Node: Miscellaneous Options, Prev: Debugging Options, Up: Scanner Options 2605 260616.6 Miscellaneous Options 2607========================== 2608 2609`-c' 2610 A do-nothing option included for POSIX compliance. 2611 2612`-h, -?, --help' 2613 generates a "help" summary of `flex''s options to `stdout' and 2614 then exits. 2615 2616`-n' 2617 Another do-nothing option included for POSIX compliance. 2618 2619`-V, --version' 2620 prints the version number to `stdout' and exits. 2621 2622 2623 2624File: flex.info, Node: Performance, Next: Cxx, Prev: Scanner Options, Up: Top 2625 262617 Performance Considerations 2627***************************** 2628 2629The main design goal of `flex' is that it generate high-performance 2630scanners. It has been optimized for dealing well with large sets of 2631rules. Aside from the effects on scanner speed of the table compression 2632`-C' options outlined above, there are a number of options/actions 2633which degrade performance. These are, from most expensive to least: 2634 2635 REJECT 2636 arbitrary trailing context 2637 2638 pattern sets that require backing up 2639 %option yylineno 2640 %array 2641 2642 %option interactive 2643 %option always-interactive 2644 2645 ^ beginning-of-line operator 2646 yymore() 2647 2648 with the first two all being quite expensive and the last two being 2649quite cheap. Note also that `unput()' is implemented as a routine call 2650that potentially does quite a bit of work, while `yyless()' is a 2651quite-cheap macro. So if you are just putting back some excess text you 2652scanned, use `yyless()'. 2653 2654 `REJECT' should be avoided at all costs when performance is 2655important. It is a particularly expensive option. 2656 2657 There is one case when `%option yylineno' can be expensive. That is 2658when your patterns match long tokens that could _possibly_ contain a 2659newline character. There is no performance penalty for rules that can 2660not possibly match newlines, since flex does not need to check them for 2661newlines. In general, you should avoid rules such as `[^f]+', which 2662match very long tokens, including newlines, and may possibly match your 2663entire file! A better approach is to separate `[^f]+' into two rules: 2664 2665 %option yylineno 2666 %% 2667 [^f\n]+ 2668 \n+ 2669 2670 The above scanner does not incur a performance penalty. 2671 2672 Getting rid of backing up is messy and often may be an enormous 2673amount of work for a complicated scanner. In principal, one begins by 2674using the `-b' flag to generate a `lex.backup' file. For example, on 2675the input: 2676 2677 %% 2678 foo return TOK_KEYWORD; 2679 foobar return TOK_KEYWORD; 2680 2681 the file looks like: 2682 2683 State #6 is non-accepting - 2684 associated rule line numbers: 2685 2 3 2686 out-transitions: [ o ] 2687 jam-transitions: EOF [ \001-n p-\177 ] 2688 2689 State #8 is non-accepting - 2690 associated rule line numbers: 2691 3 2692 out-transitions: [ a ] 2693 jam-transitions: EOF [ \001-` b-\177 ] 2694 2695 State #9 is non-accepting - 2696 associated rule line numbers: 2697 3 2698 out-transitions: [ r ] 2699 jam-transitions: EOF [ \001-q s-\177 ] 2700 2701 Compressed tables always back up. 2702 2703 The first few lines tell us that there's a scanner state in which it 2704can make a transition on an 'o' but not on any other character, and 2705that in that state the currently scanned text does not match any rule. 2706The state occurs when trying to match the rules found at lines 2 and 3 2707in the input file. If the scanner is in that state and then reads 2708something other than an 'o', it will have to back up to find a rule 2709which is matched. With a bit of headscratching one can see that this 2710must be the state it's in when it has seen `fo'. When this has 2711happened, if anything other than another `o' is seen, the scanner will 2712have to back up to simply match the `f' (by the default rule). 2713 2714 The comment regarding State #8 indicates there's a problem when 2715`foob' has been scanned. Indeed, on any character other than an `a', 2716the scanner will have to back up to accept "foo". Similarly, the 2717comment for State #9 concerns when `fooba' has been scanned and an `r' 2718does not follow. 2719 2720 The final comment reminds us that there's no point going to all the 2721trouble of removing backing up from the rules unless we're using `-Cf' 2722or `-CF', since there's no performance gain doing so with compressed 2723scanners. 2724 2725 The way to remove the backing up is to add "error" rules: 2726 2727 %% 2728 foo return TOK_KEYWORD; 2729 foobar return TOK_KEYWORD; 2730 2731 fooba | 2732 foob | 2733 fo { 2734 /* false alarm, not really a keyword */ 2735 return TOK_ID; 2736 } 2737 2738 Eliminating backing up among a list of keywords can also be done 2739using a "catch-all" rule: 2740 2741 %% 2742 foo return TOK_KEYWORD; 2743 foobar return TOK_KEYWORD; 2744 2745 [a-z]+ return TOK_ID; 2746 2747 This is usually the best solution when appropriate. 2748 2749 Backing up messages tend to cascade. With a complicated set of rules 2750it's not uncommon to get hundreds of messages. If one can decipher 2751them, though, it often only takes a dozen or so rules to eliminate the 2752backing up (though it's easy to make a mistake and have an error rule 2753accidentally match a valid token. A possible future `flex' feature 2754will be to automatically add rules to eliminate backing up). 2755 2756 It's important to keep in mind that you gain the benefits of 2757eliminating backing up only if you eliminate _every_ instance of 2758backing up. Leaving just one means you gain nothing. 2759 2760 _Variable_ trailing context (where both the leading and trailing 2761parts do not have a fixed length) entails almost the same performance 2762loss as `REJECT' (i.e., substantial). So when possible a rule like: 2763 2764 %% 2765 mouse|rat/(cat|dog) run(); 2766 2767 is better written: 2768 2769 %% 2770 mouse/cat|dog run(); 2771 rat/cat|dog run(); 2772 2773 or as 2774 2775 %% 2776 mouse|rat/cat run(); 2777 mouse|rat/dog run(); 2778 2779 Note that here the special '|' action does _not_ provide any 2780savings, and can even make things worse (*note Limitations::). 2781 2782 Another area where the user can increase a scanner's performance (and 2783one that's easier to implement) arises from the fact that the longer the 2784tokens matched, the faster the scanner will run. This is because with 2785long tokens the processing of most input characters takes place in the 2786(short) inner scanning loop, and does not often have to go through the 2787additional work of setting up the scanning environment (e.g., `yytext') 2788for the action. Recall the scanner for C comments: 2789 2790 %x comment 2791 %% 2792 int line_num = 1; 2793 2794 "/*" BEGIN(comment); 2795 2796 <comment>[^*\n]* 2797 <comment>"*"+[^*/\n]* 2798 <comment>\n ++line_num; 2799 <comment>"*"+"/" BEGIN(INITIAL); 2800 2801 This could be sped up by writing it as: 2802 2803 %x comment 2804 %% 2805 int line_num = 1; 2806 2807 "/*" BEGIN(comment); 2808 2809 <comment>[^*\n]* 2810 <comment>[^*\n]*\n ++line_num; 2811 <comment>"*"+[^*/\n]* 2812 <comment>"*"+[^*/\n]*\n ++line_num; 2813 <comment>"*"+"/" BEGIN(INITIAL); 2814 2815 Now instead of each newline requiring the processing of another 2816action, recognizing the newlines is distributed over the other rules to 2817keep the matched text as long as possible. Note that _adding_ rules 2818does _not_ slow down the scanner! The speed of the scanner is 2819independent of the number of rules or (modulo the considerations given 2820at the beginning of this section) how complicated the rules are with 2821regard to operators such as `*' and `|'. 2822 2823 A final example in speeding up a scanner: suppose you want to scan 2824through a file containing identifiers and keywords, one per line and 2825with no other extraneous characters, and recognize all the keywords. A 2826natural first approach is: 2827 2828 %% 2829 asm | 2830 auto | 2831 break | 2832 ... etc ... 2833 volatile | 2834 while /* it's a keyword */ 2835 2836 .|\n /* it's not a keyword */ 2837 2838 To eliminate the back-tracking, introduce a catch-all rule: 2839 2840 %% 2841 asm | 2842 auto | 2843 break | 2844 ... etc ... 2845 volatile | 2846 while /* it's a keyword */ 2847 2848 [a-z]+ | 2849 .|\n /* it's not a keyword */ 2850 2851 Now, if it's guaranteed that there's exactly one word per line, then 2852we can reduce the total number of matches by a half by merging in the 2853recognition of newlines with that of the other tokens: 2854 2855 %% 2856 asm\n | 2857 auto\n | 2858 break\n | 2859 ... etc ... 2860 volatile\n | 2861 while\n /* it's a keyword */ 2862 2863 [a-z]+\n | 2864 .|\n /* it's not a keyword */ 2865 2866 One has to be careful here, as we have now reintroduced backing up 2867into the scanner. In particular, while _we_ know that there will never 2868be any characters in the input stream other than letters or newlines, 2869`flex' can't figure this out, and it will plan for possibly needing to 2870back up when it has scanned a token like `auto' and then the next 2871character is something other than a newline or a letter. Previously it 2872would then just match the `auto' rule and be done, but now it has no 2873`auto' rule, only a `auto\n' rule. To eliminate the possibility of 2874backing up, we could either duplicate all rules but without final 2875newlines, or, since we never expect to encounter such an input and 2876therefore don't how it's classified, we can introduce one more 2877catch-all rule, this one which doesn't include a newline: 2878 2879 %% 2880 asm\n | 2881 auto\n | 2882 break\n | 2883 ... etc ... 2884 volatile\n | 2885 while\n /* it's a keyword */ 2886 2887 [a-z]+\n | 2888 [a-z]+ | 2889 .|\n /* it's not a keyword */ 2890 2891 Compiled with `-Cf', this is about as fast as one can get a `flex' 2892scanner to go for this particular problem. 2893 2894 A final note: `flex' is slow when matching `NUL's, particularly when 2895a token contains multiple `NUL's. It's best to write rules which match 2896_short_ amounts of text if it's anticipated that the text will often 2897include `NUL's. 2898 2899 Another final note regarding performance: as mentioned in *note 2900Matching::, dynamically resizing `yytext' to accommodate huge tokens is 2901a slow process because it presently requires that the (huge) token be 2902rescanned from the beginning. Thus if performance is vital, you should 2903attempt to match "large" quantities of text but not "huge" quantities, 2904where the cutoff between the two is at about 8K characters per token. 2905 2906 2907File: flex.info, Node: Cxx, Next: Reentrant, Prev: Performance, Up: Top 2908 290918 Generating C++ Scanners 2910************************** 2911 2912*IMPORTANT*: the present form of the scanning class is _experimental_ 2913and may change considerably between major releases. 2914 2915 `flex' provides two different ways to generate scanners for use with 2916C++. The first way is to simply compile a scanner generated by `flex' 2917using a C++ compiler instead of a C compiler. You should not encounter 2918any compilation errors (*note Reporting Bugs::). You can then use C++ 2919code in your rule actions instead of C code. Note that the default 2920input source for your scanner remains `yyin', and default echoing is 2921still done to `yyout'. Both of these remain `FILE *' variables and not 2922C++ _streams_. 2923 2924 You can also use `flex' to generate a C++ scanner class, using the 2925`-+' option (or, equivalently, `%option c++)', which is automatically 2926specified if the name of the `flex' executable ends in a '+', such as 2927`flex++'. When using this option, `flex' defaults to generating the 2928scanner to the file `lex.yy.cc' instead of `lex.yy.c'. The generated 2929scanner includes the header file `FlexLexer.h', which defines the 2930interface to two C++ classes. 2931 2932 The first class, `FlexLexer', provides an abstract base class 2933defining the general scanner class interface. It provides the 2934following member functions: 2935 2936`const char* YYText()' 2937 returns the text of the most recently matched token, the 2938 equivalent of `yytext'. 2939 2940`int YYLeng()' 2941 returns the length of the most recently matched token, the 2942 equivalent of `yyleng'. 2943 2944`int lineno() const' 2945 returns the current input line number (see `%option yylineno)', or 2946 `1' if `%option yylineno' was not used. 2947 2948`void set_debug( int flag )' 2949 sets the debugging flag for the scanner, equivalent to assigning to 2950 `yy_flex_debug' (*note Scanner Options::). Note that you must 2951 build the scanner using `%option debug' to include debugging 2952 information in it. 2953 2954`int debug() const' 2955 returns the current setting of the debugging flag. 2956 2957 Also provided are member functions equivalent to 2958`yy_switch_to_buffer()', `yy_create_buffer()' (though the first 2959argument is an `istream*' object pointer and not a `FILE*)', 2960`yy_flush_buffer()', `yy_delete_buffer()', and `yyrestart()' (again, 2961the first argument is a `istream*' object pointer). 2962 2963 The second class defined in `FlexLexer.h' is `yyFlexLexer', which is 2964derived from `FlexLexer'. It defines the following additional member 2965functions: 2966 2967`yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 )' 2968 constructs a `yyFlexLexer' object using the given streams for input 2969 and output. If not specified, the streams default to `cin' and 2970 `cout', respectively. 2971 2972`virtual int yylex()' 2973 performs the same role is `yylex()' does for ordinary `flex' 2974 scanners: it scans the input stream, consuming tokens, until a 2975 rule's action returns a value. If you derive a subclass `S' from 2976 `yyFlexLexer' and want to access the member functions and variables 2977 of `S' inside `yylex()', then you need to use `%option 2978 yyclass="S"' to inform `flex' that you will be using that subclass 2979 instead of `yyFlexLexer'. In this case, rather than generating 2980 `yyFlexLexer::yylex()', `flex' generates `S::yylex()' (and also 2981 generates a dummy `yyFlexLexer::yylex()' that calls 2982 `yyFlexLexer::LexerError()' if called). 2983 2984`virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0)' 2985 reassigns `yyin' to `new_in' (if non-null) and `yyout' to 2986 `new_out' (if non-null), deleting the previous input buffer if 2987 `yyin' is reassigned. 2988 2989`int yylex( istream* new_in, ostream* new_out = 0 )' 2990 first switches the input streams via `switch_streams( new_in, 2991 new_out )' and then returns the value of `yylex()'. 2992 2993 In addition, `yyFlexLexer' defines the following protected virtual 2994functions which you can redefine in derived classes to tailor the 2995scanner: 2996 2997`virtual int LexerInput( char* buf, int max_size )' 2998 reads up to `max_size' characters into `buf' and returns the 2999 number of characters read. To indicate end-of-input, return 0 3000 characters. Note that `interactive' scanners (see the `-B' and 3001 `-I' flags in *note Scanner Options::) define the macro 3002 `YY_INTERACTIVE'. If you redefine `LexerInput()' and need to take 3003 different actions depending on whether or not the scanner might be 3004 scanning an interactive input source, you can test for the 3005 presence of this name via `#ifdef' statements. 3006 3007`virtual void LexerOutput( const char* buf, int size )' 3008 writes out `size' characters from the buffer `buf', which, while 3009 `NUL'-terminated, may also contain internal `NUL's if the 3010 scanner's rules can match text with `NUL's in them. 3011 3012`virtual void LexerError( const char* msg )' 3013 reports a fatal error message. The default version of this 3014 function writes the message to the stream `cerr' and exits. 3015 3016 Note that a `yyFlexLexer' object contains its _entire_ scanning 3017state. Thus you can use such objects to create reentrant scanners, but 3018see also *note Reentrant::. You can instantiate multiple instances of 3019the same `yyFlexLexer' class, and you can also combine multiple C++ 3020scanner classes together in the same program using the `-P' option 3021discussed above. 3022 3023 Finally, note that the `%array' feature is not available to C++ 3024scanner classes; you must use `%pointer' (the default). 3025 3026 Here is an example of a simple C++ scanner: 3027 3028 // An example of using the flex C++ scanner class. 3029 3030 %{ 3031 #include <iostream> 3032 using namespace std; 3033 int mylineno = 0; 3034 %} 3035 3036 %option noyywrap 3037 3038 string \"[^\n"]+\" 3039 3040 ws [ \t]+ 3041 3042 alpha [A-Za-z] 3043 dig [0-9] 3044 name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* 3045 num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? 3046 num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? 3047 number {num1}|{num2} 3048 3049 %% 3050 3051 {ws} /* skip blanks and tabs */ 3052 3053 "/*" { 3054 int c; 3055 3056 while((c = yyinput()) != 0) 3057 { 3058 if(c == '\n') 3059 ++mylineno; 3060 3061 else if(c == '*') 3062 { 3063 if((c = yyinput()) == '/') 3064 break; 3065 else 3066 unput(c); 3067 } 3068 } 3069 } 3070 3071 {number} cout << "number " << YYText() << '\n'; 3072 3073 \n mylineno++; 3074 3075 {name} cout << "name " << YYText() << '\n'; 3076 3077 {string} cout << "string " << YYText() << '\n'; 3078 3079 %% 3080 3081 int main( int /* argc */, char** /* argv */ ) 3082 { 3083 FlexLexer* lexer = new yyFlexLexer; 3084 while(lexer->yylex() != 0) 3085 ; 3086 return 0; 3087 } 3088 3089 If you want to create multiple (different) lexer classes, you use the 3090`-P' flag (or the `prefix=' option) to rename each `yyFlexLexer' to 3091some other `xxFlexLexer'. You then can include `<FlexLexer.h>' in your 3092other sources once per lexer class, first renaming `yyFlexLexer' as 3093follows: 3094 3095 #undef yyFlexLexer 3096 #define yyFlexLexer xxFlexLexer 3097 #include <FlexLexer.h> 3098 3099 #undef yyFlexLexer 3100 #define yyFlexLexer zzFlexLexer 3101 #include <FlexLexer.h> 3102 3103 if, for example, you used `%option prefix="xx"' for one of your 3104scanners and `%option prefix="zz"' for the other. 3105 3106 3107File: flex.info, Node: Reentrant, Next: Lex and Posix, Prev: Cxx, Up: Top 3108 310919 Reentrant C Scanners 3110*********************** 3111 3112`flex' has the ability to generate a reentrant C scanner. This is 3113accomplished by specifying `%option reentrant' (`-R') The generated 3114scanner is both portable, and safe to use in one or more separate 3115threads of control. The most common use for reentrant scanners is from 3116within multi-threaded applications. Any thread may create and execute 3117a reentrant `flex' scanner without the need for synchronization with 3118other threads. 3119 3120* Menu: 3121 3122* Reentrant Uses:: 3123* Reentrant Overview:: 3124* Reentrant Example:: 3125* Reentrant Detail:: 3126* Reentrant Functions:: 3127 3128 3129File: flex.info, Node: Reentrant Uses, Next: Reentrant Overview, Prev: Reentrant, Up: Reentrant 3130 313119.1 Uses for Reentrant Scanners 3132================================ 3133 3134However, there are other uses for a reentrant scanner. For example, you 3135could scan two or more files simultaneously to implement a `diff' at 3136the token level (i.e., instead of at the character level): 3137 3138 /* Example of maintaining more than one active scanner. */ 3139 3140 do { 3141 int tok1, tok2; 3142 3143 tok1 = yylex( scanner_1 ); 3144 tok2 = yylex( scanner_2 ); 3145 3146 if( tok1 != tok2 ) 3147 printf("Files are different."); 3148 3149 } while ( tok1 && tok2 ); 3150 3151 Another use for a reentrant scanner is recursion. (Note that a 3152recursive scanner can also be created using a non-reentrant scanner and 3153buffer states. *Note Multiple Input Buffers::.) 3154 3155 The following crude scanner supports the `eval' command by invoking 3156another instance of itself. 3157 3158 /* Example of recursive invocation. */ 3159 3160 %option reentrant 3161 3162 %% 3163 "eval(".+")" { 3164 yyscan_t scanner; 3165 YY_BUFFER_STATE buf; 3166 3167 yylex_init( &scanner ); 3168 yytext[yyleng-1] = ' '; 3169 3170 buf = yy_scan_string( yytext + 5, scanner ); 3171 yylex( scanner ); 3172 3173 yy_delete_buffer(buf,scanner); 3174 yylex_destroy( scanner ); 3175 } 3176 ... 3177 %% 3178 3179 3180File: flex.info, Node: Reentrant Overview, Next: Reentrant Example, Prev: Reentrant Uses, Up: Reentrant 3181 318219.2 An Overview of the Reentrant API 3183===================================== 3184 3185The API for reentrant scanners is different than for non-reentrant 3186scanners. Here is a quick overview of the API: 3187 3188 `%option reentrant' must be specified. 3189 3190 * All functions take one additional argument: `yyscanner' 3191 3192 * All global variables are replaced by their macro equivalents. (We 3193 tell you this because it may be important to you during debugging.) 3194 3195 * `yylex_init' and `yylex_destroy' must be called before and after 3196 `yylex', respectively. 3197 3198 * Accessor methods (get/set functions) provide access to common 3199 `flex' variables. 3200 3201 * User-specific data can be stored in `yyextra'. 3202 3203 3204File: flex.info, Node: Reentrant Example, Next: Reentrant Detail, Prev: Reentrant Overview, Up: Reentrant 3205 320619.3 Reentrant Example 3207====================== 3208 3209First, an example of a reentrant scanner: 3210 /* This scanner prints "//" comments. */ 3211 3212 %option reentrant stack noyywrap 3213 %x COMMENT 3214 3215 %% 3216 3217 "//" yy_push_state( COMMENT, yyscanner); 3218 .|\n 3219 3220 <COMMENT>\n yy_pop_state( yyscanner ); 3221 <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); 3222 3223 %% 3224 3225 int main ( int argc, char * argv[] ) 3226 { 3227 yyscan_t scanner; 3228 3229 yylex_init ( &scanner ); 3230 yylex ( scanner ); 3231 yylex_destroy ( scanner ); 3232 return 0; 3233 } 3234 3235 3236File: flex.info, Node: Reentrant Detail, Next: Reentrant Functions, Prev: Reentrant Example, Up: Reentrant 3237 323819.4 The Reentrant API in Detail 3239================================ 3240 3241Here are the things you need to do or know to use the reentrant C API of 3242`flex'. 3243 3244* Menu: 3245 3246* Specify Reentrant:: 3247* Extra Reentrant Argument:: 3248* Global Replacement:: 3249* Init and Destroy Functions:: 3250* Accessor Methods:: 3251* Extra Data:: 3252* About yyscan_t:: 3253 3254 3255File: flex.info, Node: Specify Reentrant, Next: Extra Reentrant Argument, Prev: Reentrant Detail, Up: Reentrant Detail 3256 325719.4.1 Declaring a Scanner As Reentrant 3258--------------------------------------- 3259 3260%option reentrant (-reentrant) must be specified. 3261 3262 Notice that `%option reentrant' is specified in the above example 3263(*note Reentrant Example::. Had this option not been specified, `flex' 3264would have happily generated a non-reentrant scanner without 3265complaining. You may explicitly specify `%option noreentrant', if you 3266do _not_ want a reentrant scanner, although it is not necessary. The 3267default is to generate a non-reentrant scanner. 3268 3269 3270File: flex.info, Node: Extra Reentrant Argument, Next: Global Replacement, Prev: Specify Reentrant, Up: Reentrant Detail 3271 327219.4.2 The Extra Argument 3273------------------------- 3274 3275All functions take one additional argument: `yyscanner'. 3276 3277 Notice that the calls to `yy_push_state' and `yy_pop_state' both 3278have an argument, `yyscanner' , that is not present in a non-reentrant 3279scanner. Here are the declarations of `yy_push_state' and 3280`yy_pop_state' in the reentrant scanner: 3281 3282 static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; 3283 static void yy_pop_state ( yyscan_t yyscanner ) ; 3284 3285 Notice that the argument `yyscanner' appears in the declaration of 3286both functions. In fact, all `flex' functions in a reentrant scanner 3287have this additional argument. It is always the last argument in the 3288argument list, it is always of type `yyscan_t' (which is typedef'd to 3289`void *') and it is always named `yyscanner'. As you may have guessed, 3290`yyscanner' is a pointer to an opaque data structure encapsulating the 3291current state of the scanner. For a list of function declarations, see 3292*note Reentrant Functions::. Note that preprocessor macros, such as 3293`BEGIN', `ECHO', and `REJECT', do not take this additional argument. 3294 3295 3296File: flex.info, Node: Global Replacement, Next: Init and Destroy Functions, Prev: Extra Reentrant Argument, Up: Reentrant Detail 3297 329819.4.3 Global Variables Replaced By Macros 3299------------------------------------------ 3300 3301All global variables in traditional flex have been replaced by macro 3302equivalents. 3303 3304 Note that in the above example, `yyout' and `yytext' are not plain 3305variables. These are macros that will expand to their equivalent lvalue. 3306All of the familiar `flex' globals have been replaced by their macro 3307equivalents. In particular, `yytext', `yyleng', `yylineno', `yyin', 3308`yyout', `yyextra', `yylval', and `yylloc' are macros. You may safely 3309use these macros in actions as if they were plain variables. We only 3310tell you this so you don't expect to link to these variables 3311externally. Currently, each macro expands to a member of an internal 3312struct, e.g., 3313 3314 #define yytext (((struct yyguts_t*)yyscanner)->yytext_r) 3315 3316 One important thing to remember about `yytext' and friends is that 3317`yytext' is not a global variable in a reentrant scanner, you can not 3318access it directly from outside an action or from other functions. You 3319must use an accessor method, e.g., `yyget_text', to accomplish this. 3320(See below). 3321 3322 3323File: flex.info, Node: Init and Destroy Functions, Next: Accessor Methods, Prev: Global Replacement, Up: Reentrant Detail 3324 332519.4.4 Init and Destroy Functions 3326--------------------------------- 3327 3328`yylex_init' and `yylex_destroy' must be called before and after 3329`yylex', respectively. 3330 3331 int yylex_init ( yyscan_t * ptr_yy_globals ) ; 3332 int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; 3333 int yylex ( yyscan_t yyscanner ) ; 3334 int yylex_destroy ( yyscan_t yyscanner ) ; 3335 3336 The function `yylex_init' must be called before calling any other 3337function. The argument to `yylex_init' is the address of an 3338uninitialized pointer to be filled in by `yylex_init', overwriting any 3339previous contents. The function `yylex_init_extra' may be used instead, 3340taking as its first argument a variable of type `YY_EXTRA_TYPE'. See 3341the section on yyextra, below, for more details. 3342 3343 The value stored in `ptr_yy_globals' should thereafter be passed to 3344`yylex' and `yylex_destroy'. Flex does not save the argument passed to 3345`yylex_init', so it is safe to pass the address of a local pointer to 3346`yylex_init' so long as it remains in scope for the duration of all 3347calls to the scanner, up to and including the call to `yylex_destroy'. 3348 3349 The function `yylex' should be familiar to you by now. The reentrant 3350version takes one argument, which is the value returned (via an 3351argument) by `yylex_init'. Otherwise, it behaves the same as the 3352non-reentrant version of `yylex'. 3353 3354 Both `yylex_init' and `yylex_init_extra' returns 0 (zero) on success, 3355or non-zero on failure, in which case errno is set to one of the 3356following values: 3357 3358 * ENOMEM Memory allocation error. *Note memory-management::. 3359 3360 * EINVAL Invalid argument. 3361 3362 The function `yylex_destroy' should be called to free resources used 3363by the scanner. After `yylex_destroy' is called, the contents of 3364`yyscanner' should not be used. Of course, there is no need to destroy 3365a scanner if you plan to reuse it. A `flex' scanner (both reentrant 3366and non-reentrant) may be restarted by calling `yyrestart'. 3367 3368 Below is an example of a program that creates a scanner, uses it, 3369then destroys it when done: 3370 3371 int main () 3372 { 3373 yyscan_t scanner; 3374 int tok; 3375 3376 yylex_init(&scanner); 3377 3378 while ((tok=yylex(scanner)) > 0) 3379 printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); 3380 3381 yylex_destroy(scanner); 3382 return 0; 3383 } 3384 3385 3386File: flex.info, Node: Accessor Methods, Next: Extra Data, Prev: Init and Destroy Functions, Up: Reentrant Detail 3387 338819.4.5 Accessing Variables with Reentrant Scanners 3389-------------------------------------------------- 3390 3391Accessor methods (get/set functions) provide access to common `flex' 3392variables. 3393 3394 Many scanners that you build will be part of a larger project. 3395Portions of your project will need access to `flex' values, such as 3396`yytext'. In a non-reentrant scanner, these values are global, so 3397there is no problem accessing them. However, in a reentrant scanner, 3398there are no global `flex' values. You can not access them directly. 3399Instead, you must access `flex' values using accessor methods (get/set 3400functions). Each accessor method is named `yyget_NAME' or `yyset_NAME', 3401where `NAME' is the name of the `flex' variable you want. For example: 3402 3403 /* Set the last character of yytext to NULL. */ 3404 void chop ( yyscan_t scanner ) 3405 { 3406 int len = yyget_leng( scanner ); 3407 yyget_text( scanner )[len - 1] = '\0'; 3408 } 3409 3410 The above code may be called from within an action like this: 3411 3412 %% 3413 .+\n { chop( yyscanner );} 3414 3415 You may find that `%option header-file' is particularly useful for 3416generating prototypes of all the accessor functions. *Note 3417option-header::. 3418 3419 3420File: flex.info, Node: Extra Data, Next: About yyscan_t, Prev: Accessor Methods, Up: Reentrant Detail 3421 342219.4.6 Extra Data 3423----------------- 3424 3425User-specific data can be stored in `yyextra'. 3426 3427 In a reentrant scanner, it is unwise to use global variables to 3428communicate with or maintain state between different pieces of your 3429program. However, you may need access to external data or invoke 3430external functions from within the scanner actions. Likewise, you may 3431need to pass information to your scanner (e.g., open file descriptors, 3432or database connections). In a non-reentrant scanner, the only way to 3433do this would be through the use of global variables. `Flex' allows 3434you to store arbitrary, "extra" data in a scanner. This data is 3435accessible through the accessor methods `yyget_extra' and `yyset_extra' 3436from outside the scanner, and through the shortcut macro `yyextra' from 3437within the scanner itself. They are defined as follows: 3438 3439 #define YY_EXTRA_TYPE void* 3440 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 3441 void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); 3442 3443 In addition, an extra form of `yylex_init' is provided, 3444`yylex_init_extra'. This function is provided so that the yyextra value 3445can be accessed from within the very first yyalloc, used to allocate 3446the scanner itself. 3447 3448 By default, `YY_EXTRA_TYPE' is defined as type `void *'. You may 3449redefine this type using `%option extra-type="your_type"' in the 3450scanner: 3451 3452 /* An example of overriding YY_EXTRA_TYPE. */ 3453 %{ 3454 #include <sys/stat.h> 3455 #include <unistd.h> 3456 %} 3457 %option reentrant 3458 %option extra-type="struct stat *" 3459 %% 3460 3461 __filesize__ printf( "%ld", yyextra->st_size ); 3462 __lastmod__ printf( "%ld", yyextra->st_mtime ); 3463 %% 3464 void scan_file( char* filename ) 3465 { 3466 yyscan_t scanner; 3467 struct stat buf; 3468 FILE *in; 3469 3470 in = fopen( filename, "r" ); 3471 stat( filename, &buf ); 3472 3473 yylex_init_extra( buf, &scanner ); 3474 yyset_in( in, scanner ); 3475 yylex( scanner ); 3476 yylex_destroy( scanner ); 3477 3478 fclose( in ); 3479 } 3480 3481 3482File: flex.info, Node: About yyscan_t, Prev: Extra Data, Up: Reentrant Detail 3483 348419.4.7 About yyscan_t 3485--------------------- 3486 3487`yyscan_t' is defined as: 3488 3489 typedef void* yyscan_t; 3490 3491 It is initialized by `yylex_init()' to point to an internal 3492structure. You should never access this value directly. In particular, 3493you should never attempt to free it (use `yylex_destroy()' instead.) 3494 3495 3496File: flex.info, Node: Reentrant Functions, Prev: Reentrant Detail, Up: Reentrant 3497 349819.5 Functions and Macros Available in Reentrant C Scanners 3499=========================================================== 3500 3501The following Functions are available in a reentrant scanner: 3502 3503 char *yyget_text ( yyscan_t scanner ); 3504 int yyget_leng ( yyscan_t scanner ); 3505 FILE *yyget_in ( yyscan_t scanner ); 3506 FILE *yyget_out ( yyscan_t scanner ); 3507 int yyget_lineno ( yyscan_t scanner ); 3508 YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); 3509 int yyget_debug ( yyscan_t scanner ); 3510 3511 void yyset_debug ( int flag, yyscan_t scanner ); 3512 void yyset_in ( FILE * in_str , yyscan_t scanner ); 3513 void yyset_out ( FILE * out_str , yyscan_t scanner ); 3514 void yyset_lineno ( int line_number , yyscan_t scanner ); 3515 void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); 3516 3517 There are no "set" functions for yytext and yyleng. This is 3518intentional. 3519 3520 The following Macro shortcuts are available in actions in a reentrant 3521scanner: 3522 3523 yytext 3524 yyleng 3525 yyin 3526 yyout 3527 yylineno 3528 yyextra 3529 yy_flex_debug 3530 3531 In a reentrant C scanner, support for yylineno is always present 3532(i.e., you may access yylineno), but the value is never modified by 3533`flex' unless `%option yylineno' is enabled. This is to allow the user 3534to maintain the line count independently of `flex'. 3535 3536 The following functions and macros are made available when `%option 3537bison-bridge' (`--bison-bridge') is specified: 3538 3539 YYSTYPE * yyget_lval ( yyscan_t scanner ); 3540 void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); 3541 yylval 3542 3543 The following functions and macros are made available when `%option 3544bison-locations' (`--bison-locations') is specified: 3545 3546 YYLTYPE *yyget_lloc ( yyscan_t scanner ); 3547 void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); 3548 yylloc 3549 3550 Support for yylval assumes that `YYSTYPE' is a valid type. Support 3551for yylloc assumes that `YYSLYPE' is a valid type. Typically, these 3552types are generated by `bison', and are included in section 1 of the 3553`flex' input. 3554 3555 3556File: flex.info, Node: Lex and Posix, Next: Memory Management, Prev: Reentrant, Up: Top 3557 355820 Incompatibilities with Lex and Posix 3559*************************************** 3560 3561`flex' is a rewrite of the AT&T Unix _lex_ tool (the two 3562implementations do not share any code, though), with some extensions and 3563incompatibilities, both of which are of concern to those who wish to 3564write scanners acceptable to both implementations. `flex' is fully 3565compliant with the POSIX `lex' specification, except that when using 3566`%pointer' (the default), a call to `unput()' destroys the contents of 3567`yytext', which is counter to the POSIX specification. In this section 3568we discuss all of the known areas of incompatibility between `flex', 3569AT&T `lex', and the POSIX specification. `flex''s `-l' option turns on 3570maximum compatibility with the original AT&T `lex' implementation, at 3571the cost of a major loss in the generated scanner's performance. We 3572note below which incompatibilities can be overcome using the `-l' 3573option. `flex' is fully compatible with `lex' with the following 3574exceptions: 3575 3576 * The undocumented `lex' scanner internal variable `yylineno' is not 3577 supported unless `-l' or `%option yylineno' is used. 3578 3579 * `yylineno' should be maintained on a per-buffer basis, rather than 3580 a per-scanner (single global variable) basis. 3581 3582 * `yylineno' is not part of the POSIX specification. 3583 3584 * The `input()' routine is not redefinable, though it may be called 3585 to read characters following whatever has been matched by a rule. 3586 If `input()' encounters an end-of-file the normal `yywrap()' 3587 processing is done. A "real" end-of-file is returned by `input()' 3588 as `EOF'. 3589 3590 * Input is instead controlled by defining the `YY_INPUT()' macro. 3591 3592 * The `flex' restriction that `input()' cannot be redefined is in 3593 accordance with the POSIX specification, which simply does not 3594 specify any way of controlling the scanner's input other than by 3595 making an initial assignment to `yyin'. 3596 3597 * The `unput()' routine is not redefinable. This restriction is in 3598 accordance with POSIX. 3599 3600 * `flex' scanners are not as reentrant as `lex' scanners. In 3601 particular, if you have an interactive scanner and an interrupt 3602 handler which long-jumps out of the scanner, and the scanner is 3603 subsequently called again, you may get the following message: 3604 3605 fatal flex scanner internal error--end of buffer missed 3606 3607 To reenter the scanner, first use: 3608 3609 yyrestart( yyin ); 3610 3611 Note that this call will throw away any buffered input; usually 3612 this isn't a problem with an interactive scanner. *Note 3613 Reentrant::, for `flex''s reentrant API. 3614 3615 * Also note that `flex' C++ scanner classes _are_ reentrant, so if 3616 using C++ is an option for you, you should use them instead. 3617 *Note Cxx::, and *note Reentrant:: for details. 3618 3619 * `output()' is not supported. Output from the ECHO macro is done 3620 to the file-pointer `yyout' (default `stdout)'. 3621 3622 * `output()' is not part of the POSIX specification. 3623 3624 * `lex' does not support exclusive start conditions (%x), though they 3625 are in the POSIX specification. 3626 3627 * When definitions are expanded, `flex' encloses them in parentheses. 3628 With `lex', the following: 3629 3630 NAME [A-Z][A-Z0-9]* 3631 %% 3632 foo{NAME}? printf( "Found it\n" ); 3633 %% 3634 3635 will not match the string `foo' because when the macro is expanded 3636 the rule is equivalent to `foo[A-Z][A-Z0-9]*?' and the precedence 3637 is such that the `?' is associated with `[A-Z0-9]*'. With `flex', 3638 the rule will be expanded to `foo([A-Z][A-Z0-9]*)?' and so the 3639 string `foo' will match. 3640 3641 * Note that if the definition begins with `^' or ends with `$' then 3642 it is _not_ expanded with parentheses, to allow these operators to 3643 appear in definitions without losing their special meanings. But 3644 the `<s>', `/', and `<<EOF>>' operators cannot be used in a `flex' 3645 definition. 3646 3647 * Using `-l' results in the `lex' behavior of no parentheses around 3648 the definition. 3649 3650 * The POSIX specification is that the definition be enclosed in 3651 parentheses. 3652 3653 * Some implementations of `lex' allow a rule's action to begin on a 3654 separate line, if the rule's pattern has trailing whitespace: 3655 3656 %% 3657 foo|bar<space here> 3658 { foobar_action();} 3659 3660 `flex' does not support this feature. 3661 3662 * The `lex' `%r' (generate a Ratfor scanner) option is not 3663 supported. It is not part of the POSIX specification. 3664 3665 * After a call to `unput()', _yytext_ is undefined until the next 3666 token is matched, unless the scanner was built using `%array'. 3667 This is not the case with `lex' or the POSIX specification. The 3668 `-l' option does away with this incompatibility. 3669 3670 * The precedence of the `{,}' (numeric range) operator is different. 3671 The AT&T and POSIX specifications of `lex' interpret `abc{1,3}' as 3672 match one, two, or three occurrences of `abc'", whereas `flex' 3673 interprets it as "match `ab' followed by one, two, or three 3674 occurrences of `c'". The `-l' and `--posix' options do away with 3675 this incompatibility. 3676 3677 * The precedence of the `^' operator is different. `lex' interprets 3678 `^foo|bar' as "match either 'foo' at the beginning of a line, or 3679 'bar' anywhere", whereas `flex' interprets it as "match either 3680 `foo' or `bar' if they come at the beginning of a line". The 3681 latter is in agreement with the POSIX specification. 3682 3683 * The special table-size declarations such as `%a' supported by 3684 `lex' are not required by `flex' scanners.. `flex' ignores them. 3685 3686 * The name `FLEX_SCANNER' is `#define''d so scanners may be written 3687 for use with either `flex' or `lex'. Scanners also include 3688 `YY_FLEX_MAJOR_VERSION', `YY_FLEX_MINOR_VERSION' and 3689 `YY_FLEX_SUBMINOR_VERSION' indicating which version of `flex' 3690 generated the scanner. For example, for the 2.5.22 release, these 3691 defines would be 2, 5 and 22 respectively. If the version of 3692 `flex' being used is a beta version, then the symbol `FLEX_BETA' 3693 is defined. 3694 3695 * The symbols `[[' and `]]' in the code sections of the input may 3696 conflict with the m4 delimiters. *Note M4 Dependency::. 3697 3698 3699 The following `flex' features are not included in `lex' or the POSIX 3700specification: 3701 3702 * C++ scanners 3703 3704 * %option 3705 3706 * start condition scopes 3707 3708 * start condition stacks 3709 3710 * interactive/non-interactive scanners 3711 3712 * yy_scan_string() and friends 3713 3714 * yyterminate() 3715 3716 * yy_set_interactive() 3717 3718 * yy_set_bol() 3719 3720 * YY_AT_BOL() <<EOF>> 3721 3722 * <*> 3723 3724 * YY_DECL 3725 3726 * YY_START 3727 3728 * YY_USER_ACTION 3729 3730 * YY_USER_INIT 3731 3732 * #line directives 3733 3734 * %{}'s around actions 3735 3736 * reentrant C API 3737 3738 * multiple actions on a line 3739 3740 * almost all of the `flex' command-line options 3741 3742 The feature "multiple actions on a line" refers to the fact that 3743with `flex' you can put multiple actions on the same line, separated 3744with semi-colons, while with `lex', the following: 3745 3746 foo handle_foo(); ++num_foos_seen; 3747 3748 is (rather surprisingly) truncated to 3749 3750 foo handle_foo(); 3751 3752 `flex' does not truncate the action. Actions that are not enclosed 3753in braces are simply terminated at the end of the line. 3754 3755 3756File: flex.info, Node: Memory Management, Next: Serialized Tables, Prev: Lex and Posix, Up: Top 3757 375821 Memory Management 3759******************** 3760 3761This chapter describes how flex handles dynamic memory, and how you can 3762override the default behavior. 3763 3764* Menu: 3765 3766* The Default Memory Management:: 3767* Overriding The Default Memory Management:: 3768* A Note About yytext And Memory:: 3769 3770 3771File: flex.info, Node: The Default Memory Management, Next: Overriding The Default Memory Management, Prev: Memory Management, Up: Memory Management 3772 377321.1 The Default Memory Management 3774================================== 3775 3776Flex allocates dynamic memory during initialization, and once in a 3777while from within a call to yylex(). Initialization takes place during 3778the first call to yylex(). Thereafter, flex may reallocate more memory 3779if it needs to enlarge a buffer. As of version 2.5.9 Flex will clean up 3780all memory when you call `yylex_destroy' *Note faq-memory-leak::. 3781 3782 Flex allocates dynamic memory for four purposes, listed below (1) 3783 378416kB for the input buffer. 3785 Flex allocates memory for the character buffer used to perform 3786 pattern matching. Flex must read ahead from the input stream and 3787 store it in a large character buffer. This buffer is typically 3788 the largest chunk of dynamic memory flex consumes. This buffer 3789 will grow if necessary, doubling the size each time. Flex frees 3790 this memory when you call yylex_destroy(). The default size of 3791 this buffer (16384 bytes) is almost always too large. The ideal 3792 size for this buffer is the length of the longest token expected, 3793 in bytes, plus a little more. Flex will allocate a few extra 3794 bytes for housekeeping. Currently, to override the size of the 3795 input buffer you must `#define YY_BUF_SIZE' to whatever number of 3796 bytes you want. We don't plan to change this in the near future, 3797 but we reserve the right to do so if we ever add a more robust 3798 memory management API. 3799 380064kb for the REJECT state. This will only be allocated if you use REJECT. 3801 The size is large enough to hold the same number of states as 3802 characters in the input buffer. If you override the size of the 3803 input buffer (via `YY_BUF_SIZE'), then you automatically override 3804 the size of this buffer as well. 3805 3806100 bytes for the start condition stack. 3807 Flex allocates memory for the start condition stack. This is the 3808 stack used for pushing start states, i.e., with yy_push_state(). 3809 It will grow if necessary. Since the states are simply integers, 3810 this stack doesn't consume much memory. This stack is not present 3811 if `%option stack' is not specified. You will rarely need to tune 3812 this buffer. The ideal size for this stack is the maximum depth 3813 expected. The memory for this stack is automatically destroyed 3814 when you call yylex_destroy(). *Note option-stack::. 3815 381640 bytes for each YY_BUFFER_STATE. 3817 Flex allocates memory for each YY_BUFFER_STATE. The buffer state 3818 itself is about 40 bytes, plus an additional large character 3819 buffer (described above.) The initial buffer state is created 3820 during initialization, and with each call to yy_create_buffer(). 3821 You can't tune the size of this, but you can tune the character 3822 buffer as described above. Any buffer state that you explicitly 3823 create by calling yy_create_buffer() is _NOT_ destroyed 3824 automatically. You must call yy_delete_buffer() to free the 3825 memory. The exception to this rule is that flex will delete the 3826 current buffer automatically when you call yylex_destroy(). If you 3827 delete the current buffer, be sure to set it to NULL. That way, 3828 flex will not try to delete the buffer a second time (possibly 3829 crashing your program!) At the time of this writing, flex does not 3830 provide a growable stack for the buffer states. You have to 3831 manage that yourself. *Note Multiple Input Buffers::. 3832 383384 bytes for the reentrant scanner guts 3834 Flex allocates about 84 bytes for the reentrant scanner structure 3835 when you call yylex_init(). It is destroyed when the user calls 3836 yylex_destroy(). 3837 3838 3839 ---------- Footnotes ---------- 3840 3841 (1) The quantities given here are approximate, and may vary due to 3842host architecture, compiler configuration, or due to future 3843enhancements to flex. 3844 3845 3846File: flex.info, Node: Overriding The Default Memory Management, Next: A Note About yytext And Memory, Prev: The Default Memory Management, Up: Memory Management 3847 384821.2 Overriding The Default Memory Management 3849============================================= 3850 3851Flex calls the functions `yyalloc', `yyrealloc', and `yyfree' when it 3852needs to allocate or free memory. By default, these functions are 3853wrappers around the standard C functions, `malloc', `realloc', and 3854`free', respectively. You can override the default implementations by 3855telling flex that you will provide your own implementations. 3856 3857 To override the default implementations, you must do two things: 3858 3859 1. Suppress the default implementations by specifying one or more of 3860 the following options: 3861 3862 * `%option noyyalloc' 3863 3864 * `%option noyyrealloc' 3865 3866 * `%option noyyfree'. 3867 3868 2. Provide your own implementation of the following functions: (1) 3869 3870 // For a non-reentrant scanner 3871 void * yyalloc (size_t bytes); 3872 void * yyrealloc (void * ptr, size_t bytes); 3873 void yyfree (void * ptr); 3874 3875 // For a reentrant scanner 3876 void * yyalloc (size_t bytes, void * yyscanner); 3877 void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); 3878 void yyfree (void * ptr, void * yyscanner); 3879 3880 3881 In the following example, we will override all three memory 3882routines. We assume that there is a custom allocator with garbage 3883collection. In order to make this example interesting, we will use a 3884reentrant scanner, passing a pointer to the custom allocator through 3885`yyextra'. 3886 3887 %{ 3888 #include "some_allocator.h" 3889 %} 3890 3891 /* Suppress the default implementations. */ 3892 %option noyyalloc noyyrealloc noyyfree 3893 %option reentrant 3894 3895 /* Initialize the allocator. */ 3896 #define YY_EXTRA_TYPE struct allocator* 3897 #define YY_USER_INIT yyextra = allocator_create(); 3898 3899 %% 3900 .|\n ; 3901 %% 3902 3903 /* Provide our own implementations. */ 3904 void * yyalloc (size_t bytes, void* yyscanner) { 3905 return allocator_alloc (yyextra, bytes); 3906 } 3907 3908 void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { 3909 return allocator_realloc (yyextra, bytes); 3910 } 3911 3912 void yyfree (void * ptr, void * yyscanner) { 3913 /* Do nothing -- we leave it to the garbage collector. */ 3914 } 3915 3916 ---------- Footnotes ---------- 3917 3918 (1) It is not necessary to override all (or any) of the memory 3919management routines. You may, for example, override `yyrealloc', but 3920not `yyfree' or `yyalloc'. 3921 3922 3923File: flex.info, Node: A Note About yytext And Memory, Prev: Overriding The Default Memory Management, Up: Memory Management 3924 392521.3 A Note About yytext And Memory 3926=================================== 3927 3928When flex finds a match, `yytext' points to the first character of the 3929match in the input buffer. The string itself is part of the input 3930buffer, and is _NOT_ allocated separately. The value of yytext will be 3931overwritten the next time yylex() is called. In short, the value of 3932yytext is only valid from within the matched rule's action. 3933 3934 Often, you want the value of yytext to persist for later processing, 3935i.e., by a parser with non-zero lookahead. In order to preserve yytext, 3936you will have to copy it with strdup() or a similar function. But this 3937introduces some headache because your parser is now responsible for 3938freeing the copy of yytext. If you use a yacc or bison parser, 3939(commonly used with flex), you will discover that the error recovery 3940mechanisms can cause memory to be leaked. 3941 3942 To prevent memory leaks from strdup'd yytext, you will have to track 3943the memory somehow. Our experience has shown that a garbage collection 3944mechanism or a pooled memory mechanism will save you a lot of grief 3945when writing parsers. 3946 3947 3948File: flex.info, Node: Serialized Tables, Next: Diagnostics, Prev: Memory Management, Up: Top 3949 395022 Serialized Tables 3951******************** 3952 3953A `flex' scanner has the ability to save the DFA tables to a file, and 3954load them at runtime when needed. The motivation for this feature is 3955to reduce the runtime memory footprint. Traditionally, these tables 3956have been compiled into the scanner as C arrays, and are sometimes 3957quite large. Since the tables are compiled into the scanner, the 3958memory used by the tables can never be freed. This is a waste of 3959memory, especially if an application uses several scanners, but none of 3960them at the same time. 3961 3962 The serialization feature allows the tables to be loaded at runtime, 3963before scanning begins. The tables may be discarded when scanning is 3964finished. 3965 3966* Menu: 3967 3968* Creating Serialized Tables:: 3969* Loading and Unloading Serialized Tables:: 3970* Tables File Format:: 3971 3972 3973File: flex.info, Node: Creating Serialized Tables, Next: Loading and Unloading Serialized Tables, Prev: Serialized Tables, Up: Serialized Tables 3974 397522.1 Creating Serialized Tables 3976=============================== 3977 3978You may create a scanner with serialized tables by specifying: 3979 3980 %option tables-file=FILE 3981 or 3982 --tables-file=FILE 3983 3984 These options instruct flex to save the DFA tables to the file FILE. 3985The tables will _not_ be embedded in the generated scanner. The scanner 3986will not function on its own. The scanner will be dependent upon the 3987serialized tables. You must load the tables from this file at runtime 3988before you can scan anything. 3989 3990 If you do not specify a filename to `--tables-file', the tables will 3991be saved to `lex.yy.tables', where `yy' is the appropriate prefix. 3992 3993 If your project uses several different scanners, you can concatenate 3994the serialized tables into one file, and flex will find the correct set 3995of tables, using the scanner prefix as part of the lookup key. An 3996example follows: 3997 3998 $ flex --tables-file --prefix=cpp cpp.l 3999 $ flex --tables-file --prefix=c c.l 4000 $ cat lex.cpp.tables lex.c.tables > all.tables 4001 4002 The above example created two scanners, `cpp', and `c'. Since we did 4003not specify a filename, the tables were serialized to `lex.c.tables' and 4004`lex.cpp.tables', respectively. Then, we concatenated the two files 4005together into `all.tables', which we will distribute with our project. 4006At runtime, we will open the file and tell flex to load the tables from 4007it. Flex will find the correct tables automatically. (See next 4008section). 4009 4010 4011File: flex.info, Node: Loading and Unloading Serialized Tables, Next: Tables File Format, Prev: Creating Serialized Tables, Up: Serialized Tables 4012 401322.2 Loading and Unloading Serialized Tables 4014============================================ 4015 4016If you've built your scanner with `%option tables-file', then you must 4017load the scanner tables at runtime. This can be accomplished with the 4018following function: 4019 4020 -- Function: int yytables_fload (FILE* FP [, yyscan_t SCANNER]) 4021 Locates scanner tables in the stream pointed to by FP and loads 4022 them. Memory for the tables is allocated via `yyalloc'. You must 4023 call this function before the first call to `yylex'. The argument 4024 SCANNER only appears in the reentrant scanner. This function 4025 returns `0' (zero) on success, or non-zero on error. 4026 4027 The loaded tables are *not* automatically destroyed (unloaded) when 4028you call `yylex_destroy'. The reason is that you may create several 4029scanners of the same type (in a reentrant scanner), each of which needs 4030access to these tables. To avoid a nasty memory leak, you must call 4031the following function: 4032 4033 -- Function: int yytables_destroy ([yyscan_t SCANNER]) 4034 Unloads the scanner tables. The tables must be loaded again before 4035 you can scan any more data. The argument SCANNER only appears in 4036 the reentrant scanner. This function returns `0' (zero) on 4037 success, or non-zero on error. 4038 4039 *The functions `yytables_fload' and `yytables_destroy' are not 4040thread-safe.* You must ensure that these functions are called exactly 4041once (for each scanner type) in a threaded program, before any thread 4042calls `yylex'. After the tables are loaded, they are never written to, 4043and no thread protection is required thereafter - until you destroy 4044them. 4045 4046 4047File: flex.info, Node: Tables File Format, Prev: Loading and Unloading Serialized Tables, Up: Serialized Tables 4048 404922.3 Tables File Format 4050======================= 4051 4052This section defines the file format of serialized `flex' tables. 4053 4054 The tables format allows for one or more sets of tables to be 4055specified, where each set corresponds to a given scanner. Scanners are 4056indexed by name, as described below. The file format is as follows: 4057 4058 TABLE SET 1 4059 +-------------------------------+ 4060 Header | uint32 th_magic; | 4061 | uint32 th_hsize; | 4062 | uint32 th_ssize; | 4063 | uint16 th_flags; | 4064 | char th_version[]; | 4065 | char th_name[]; | 4066 | uint8 th_pad64[]; | 4067 +-------------------------------+ 4068 Table 1 | uint16 td_id; | 4069 | uint16 td_flags; | 4070 | uint32 td_hilen; | 4071 | uint32 td_lolen; | 4072 | void td_data[]; | 4073 | uint8 td_pad64[]; | 4074 +-------------------------------+ 4075 Table 2 | | 4076 . . . 4077 . . . 4078 . . . 4079 . . . 4080 Table n | | 4081 +-------------------------------+ 4082 TABLE SET 2 4083 . 4084 . 4085 . 4086 TABLE SET N 4087 4088 The above diagram shows that a complete set of tables consists of a 4089header followed by multiple individual tables. Furthermore, multiple 4090complete sets may be present in the same file, each set with its own 4091header and tables. The sets are contiguous in the file. The only way to 4092know if another set follows is to check the next four bytes for the 4093magic number (or check for EOF). The header and tables sections are 4094padded to 64-bit boundaries. Below we describe each field in detail. 4095This format does not specify how the scanner will expand the given 4096data, i.e., data may be serialized as int8, but expanded to an int32 4097array at runtime. This is to reduce the size of the serialized data 4098where possible. Remember, _all integer values are in network byte 4099order_. 4100 4101Fields of a table header: 4102 4103`th_magic' 4104 Magic number, always 0xF13C57B1. 4105 4106`th_hsize' 4107 Size of this entire header, in bytes, including all fields plus 4108 any padding. 4109 4110`th_ssize' 4111 Size of this entire set, in bytes, including the header, all 4112 tables, plus any padding. 4113 4114`th_flags' 4115 Bit flags for this table set. Currently unused. 4116 4117`th_version[]' 4118 Flex version in NULL-terminated string format. e.g., `2.5.13a'. 4119 This is the version of flex that was used to create the serialized 4120 tables. 4121 4122`th_name[]' 4123 Contains the name of this table set. The default is `yytables', 4124 and is prefixed accordingly, e.g., `footables'. Must be 4125 NULL-terminated. 4126 4127`th_pad64[]' 4128 Zero or more NULL bytes, padding the entire header to the next 4129 64-bit boundary as calculated from the beginning of the header. 4130 4131Fields of a table: 4132 4133`td_id' 4134 Specifies the table identifier. Possible values are: 4135 `YYTD_ID_ACCEPT (0x01)' 4136 `yy_accept' 4137 4138 `YYTD_ID_BASE (0x02)' 4139 `yy_base' 4140 4141 `YYTD_ID_CHK (0x03)' 4142 `yy_chk' 4143 4144 `YYTD_ID_DEF (0x04)' 4145 `yy_def' 4146 4147 `YYTD_ID_EC (0x05)' 4148 `yy_ec ' 4149 4150 `YYTD_ID_META (0x06)' 4151 `yy_meta' 4152 4153 `YYTD_ID_NUL_TRANS (0x07)' 4154 `yy_NUL_trans' 4155 4156 `YYTD_ID_NXT (0x08)' 4157 `yy_nxt'. This array may be two dimensional. See the 4158 `td_hilen' field below. 4159 4160 `YYTD_ID_RULE_CAN_MATCH_EOL (0x09)' 4161 `yy_rule_can_match_eol' 4162 4163 `YYTD_ID_START_STATE_LIST (0x0A)' 4164 `yy_start_state_list'. This array is handled specially 4165 because it is an array of pointers to structs. See the 4166 `td_flags' field below. 4167 4168 `YYTD_ID_TRANSITION (0x0B)' 4169 `yy_transition'. This array is handled specially because it 4170 is an array of structs. See the `td_lolen' field below. 4171 4172 `YYTD_ID_ACCLIST (0x0C)' 4173 `yy_acclist' 4174 4175`td_flags' 4176 Bit flags describing how to interpret the data in `td_data'. The 4177 data arrays are one-dimensional by default, but may be two 4178 dimensional as specified in the `td_hilen' field. 4179 4180 `YYTD_DATA8 (0x01)' 4181 The data is serialized as an array of type int8. 4182 4183 `YYTD_DATA16 (0x02)' 4184 The data is serialized as an array of type int16. 4185 4186 `YYTD_DATA32 (0x04)' 4187 The data is serialized as an array of type int32. 4188 4189 `YYTD_PTRANS (0x08)' 4190 The data is a list of indexes of entries in the expanded 4191 `yy_transition' array. Each index should be expanded to a 4192 pointer to the corresponding entry in the `yy_transition' 4193 array. We count on the fact that the `yy_transition' array 4194 has already been seen. 4195 4196 `YYTD_STRUCT (0x10)' 4197 The data is a list of yy_trans_info structs, each of which 4198 consists of two integers. There is no padding between struct 4199 elements or between structs. The type of each member is 4200 determined by the `YYTD_DATA*' bits. 4201 4202`td_hilen' 4203 If `td_hilen' is non-zero, then the data is a two-dimensional 4204 array. Otherwise, the data is a one-dimensional array. `td_hilen' 4205 contains the number of elements in the higher dimensional array, 4206 and `td_lolen' contains the number of elements in the lowest 4207 dimension. 4208 4209 Conceptually, `td_data' is either `sometype td_data[td_lolen]', or 4210 `sometype td_data[td_hilen][td_lolen]', where `sometype' is 4211 specified by the `td_flags' field. It is possible for both 4212 `td_lolen' and `td_hilen' to be zero, in which case `td_data' is a 4213 zero length array, and no data is loaded, i.e., this table is 4214 simply skipped. Flex does not currently generate tables of zero 4215 length. 4216 4217`td_lolen' 4218 Specifies the number of elements in the lowest dimension array. If 4219 this is a one-dimensional array, then it is simply the number of 4220 elements in this array. The element size is determined by the 4221 `td_flags' field. 4222 4223`td_data[]' 4224 The table data. This array may be a one- or two-dimensional array, 4225 of type `int8', `int16', `int32', `struct yy_trans_info', or 4226 `struct yy_trans_info*', depending upon the values in the 4227 `td_flags', `td_hilen', and `td_lolen' fields. 4228 4229`td_pad64[]' 4230 Zero or more NULL bytes, padding the entire table to the next 4231 64-bit boundary as calculated from the beginning of this table. 4232 4233 4234File: flex.info, Node: Diagnostics, Next: Limitations, Prev: Serialized Tables, Up: Top 4235 423623 Diagnostics 4237************** 4238 4239The following is a list of `flex' diagnostic messages: 4240 4241 * `warning, rule cannot be matched' indicates that the given rule 4242 cannot be matched because it follows other rules that will always 4243 match the same text as it. For example, in the following `foo' 4244 cannot be matched because it comes after an identifier "catch-all" 4245 rule: 4246 4247 [a-z]+ got_identifier(); 4248 foo got_foo(); 4249 4250 Using `REJECT' in a scanner suppresses this warning. 4251 4252 * `warning, -s option given but default rule can be matched' means 4253 that it is possible (perhaps only in a particular start condition) 4254 that the default rule (match any single character) is the only one 4255 that will match a particular input. Since `-s' was given, 4256 presumably this is not intended. 4257 4258 * `reject_used_but_not_detected undefined' or 4259 `yymore_used_but_not_detected undefined'. These errors can occur 4260 at compile time. They indicate that the scanner uses `REJECT' or 4261 `yymore()' but that `flex' failed to notice the fact, meaning that 4262 `flex' scanned the first two sections looking for occurrences of 4263 these actions and failed to find any, but somehow you snuck some in 4264 (via a #include file, for example). Use `%option reject' or 4265 `%option yymore' to indicate to `flex' that you really do use 4266 these features. 4267 4268 * `flex scanner jammed'. a scanner compiled with `-s' has 4269 encountered an input string which wasn't matched by any of its 4270 rules. This error can also occur due to internal problems. 4271 4272 * `token too large, exceeds YYLMAX'. your scanner uses `%array' and 4273 one of its rules matched a string longer than the `YYLMAX' 4274 constant (8K bytes by default). You can increase the value by 4275 #define'ing `YYLMAX' in the definitions section of your `flex' 4276 input. 4277 4278 * `scanner requires -8 flag to use the character 'x''. Your scanner 4279 specification includes recognizing the 8-bit character `'x'' and 4280 you did not specify the -8 flag, and your scanner defaulted to 4281 7-bit because you used the `-Cf' or `-CF' table compression 4282 options. See the discussion of the `-7' flag, *note Scanner 4283 Options::, for details. 4284 4285 * `flex scanner push-back overflow'. you used `unput()' to push back 4286 so much text that the scanner's buffer could not hold both the 4287 pushed-back text and the current token in `yytext'. Ideally the 4288 scanner should dynamically resize the buffer in this case, but at 4289 present it does not. 4290 4291 * `input buffer overflow, can't enlarge buffer because scanner uses 4292 REJECT'. the scanner was working on matching an extremely large 4293 token and needed to expand the input buffer. This doesn't work 4294 with scanners that use `REJECT'. 4295 4296 * `fatal flex scanner internal error--end of buffer missed'. This can 4297 occur in a scanner which is reentered after a long-jump has jumped 4298 out (or over) the scanner's activation frame. Before reentering 4299 the scanner, use: 4300 yyrestart( yyin ); 4301 or, as noted above, switch to using the C++ scanner class. 4302 4303 * `too many start conditions in <> construct!' you listed more start 4304 conditions in a <> construct than exist (so you must have listed at 4305 least one of them twice). 4306 4307 4308File: flex.info, Node: Limitations, Next: Bibliography, Prev: Diagnostics, Up: Top 4309 431024 Limitations 4311************** 4312 4313Some trailing context patterns cannot be properly matched and generate 4314warning messages (`dangerous trailing context'). These are patterns 4315where the ending of the first part of the rule matches the beginning of 4316the second part, such as `zx*/xy*', where the 'x*' matches the 'x' at 4317the beginning of the trailing context. (Note that the POSIX draft 4318states that the text matched by such patterns is undefined.) For some 4319trailing context rules, parts which are actually fixed-length are not 4320recognized as such, leading to the abovementioned performance loss. In 4321particular, parts using `|' or `{n}' (such as `foo{3}') are always 4322considered variable-length. Combining trailing context with the 4323special `|' action can result in _fixed_ trailing context being turned 4324into the more expensive _variable_ trailing context. For example, in 4325the following: 4326 4327 %% 4328 abc | 4329 xyz/def 4330 4331 Use of `unput()' invalidates yytext and yyleng, unless the `%array' 4332directive or the `-l' option has been used. Pattern-matching of `NUL's 4333is substantially slower than matching other characters. Dynamic 4334resizing of the input buffer is slow, as it entails rescanning all the 4335text matched so far by the current (generally huge) token. Due to both 4336buffering of input and read-ahead, you cannot intermix calls to 4337`<stdio.h>' routines, such as, getchar(), with `flex' rules and expect 4338it to work. Call `input()' instead. The total table entries listed by 4339the `-v' flag excludes the number of table entries needed to determine 4340what rule has been matched. The number of entries is equal to the 4341number of DFA states if the scanner does not use `REJECT', and somewhat 4342greater than the number of states if it does. `REJECT' cannot be used 4343with the `-f' or `-F' options. 4344 4345 The `flex' internal algorithms need documentation. 4346 4347 4348File: flex.info, Node: Bibliography, Next: FAQ, Prev: Limitations, Up: Top 4349 435025 Additional Reading 4351********************* 4352 4353You may wish to read more about the following programs: 4354 * lex 4355 4356 * yacc 4357 4358 * sed 4359 4360 * awk 4361 4362 The following books may contain material of interest: 4363 4364 John Levine, Tony Mason, and Doug Brown, _Lex & Yacc_, O'Reilly and 4365Associates. Be sure to get the 2nd edition. 4366 4367 M. E. Lesk and E. Schmidt, _LEX - Lexical Analyzer Generator_ 4368 4369 Alfred Aho, Ravi Sethi and Jeffrey Ullman, _Compilers: Principles, 4370Techniques and Tools_, Addison-Wesley (1986). Describes the 4371pattern-matching techniques used by `flex' (deterministic finite 4372automata). 4373 4374 4375File: flex.info, Node: FAQ, Next: Appendices, Prev: Bibliography, Up: Top 4376 4377FAQ 4378*** 4379 4380From time to time, the `flex' maintainer receives certain questions. 4381Rather than repeat answers to well-understood problems, we publish them 4382here. 4383 4384* Menu: 4385 4386* When was flex born?:: 4387* How do I expand backslash-escape sequences in C-style quoted strings?:: 4388* Why do flex scanners call fileno if it is not ANSI compatible?:: 4389* Does flex support recursive pattern definitions?:: 4390* How do I skip huge chunks of input (tens of megabytes) while using flex?:: 4391* Flex is not matching my patterns in the same order that I defined them.:: 4392* My actions are executing out of order or sometimes not at all.:: 4393* How can I have multiple input sources feed into the same scanner at the same time?:: 4394* Can I build nested parsers that work with the same input file?:: 4395* How can I match text only at the end of a file?:: 4396* How can I make REJECT cascade across start condition boundaries?:: 4397* Why cant I use fast or full tables with interactive mode?:: 4398* How much faster is -F or -f than -C?:: 4399* If I have a simple grammar cant I just parse it with flex?:: 4400* Why doesn't yyrestart() set the start state back to INITIAL?:: 4401* How can I match C-style comments?:: 4402* The period isn't working the way I expected.:: 4403* Can I get the flex manual in another format?:: 4404* Does there exist a "faster" NDFA->DFA algorithm?:: 4405* How does flex compile the DFA so quickly?:: 4406* How can I use more than 8192 rules?:: 4407* How do I abandon a file in the middle of a scan and switch to a new file?:: 4408* How do I execute code only during initialization (only before the first scan)?:: 4409* How do I execute code at termination?:: 4410* Where else can I find help?:: 4411* Can I include comments in the "rules" section of the file?:: 4412* I get an error about undefined yywrap().:: 4413* How can I change the matching pattern at run time?:: 4414* How can I expand macros in the input?:: 4415* How can I build a two-pass scanner?:: 4416* How do I match any string not matched in the preceding rules?:: 4417* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: 4418* Is there a way to make flex treat NULL like a regular character?:: 4419* Whenever flex can not match the input it says "flex scanner jammed".:: 4420* Why doesn't flex have non-greedy operators like perl does?:: 4421* Memory leak - 16386 bytes allocated by malloc.:: 4422* How do I track the byte offset for lseek()?:: 4423* How do I use my own I/O classes in a C++ scanner?:: 4424* How do I skip as many chars as possible?:: 4425* deleteme00:: 4426* Are certain equivalent patterns faster than others?:: 4427* Is backing up a big deal?:: 4428* Can I fake multi-byte character support?:: 4429* deleteme01:: 4430* Can you discuss some flex internals?:: 4431* unput() messes up yy_at_bol:: 4432* The | operator is not doing what I want:: 4433* Why can't flex understand this variable trailing context pattern?:: 4434* The ^ operator isn't working:: 4435* Trailing context is getting confused with trailing optional patterns:: 4436* Is flex GNU or not?:: 4437* ERASEME53:: 4438* I need to scan if-then-else blocks and while loops:: 4439* ERASEME55:: 4440* ERASEME56:: 4441* ERASEME57:: 4442* Is there a repository for flex scanners?:: 4443* How can I conditionally compile or preprocess my flex input file?:: 4444* Where can I find grammars for lex and yacc?:: 4445* I get an end-of-buffer message for each character scanned.:: 4446* unnamed-faq-62:: 4447* unnamed-faq-63:: 4448* unnamed-faq-64:: 4449* unnamed-faq-65:: 4450* unnamed-faq-66:: 4451* unnamed-faq-67:: 4452* unnamed-faq-68:: 4453* unnamed-faq-69:: 4454* unnamed-faq-70:: 4455* unnamed-faq-71:: 4456* unnamed-faq-72:: 4457* unnamed-faq-73:: 4458* unnamed-faq-74:: 4459* unnamed-faq-75:: 4460* unnamed-faq-76:: 4461* unnamed-faq-77:: 4462* unnamed-faq-78:: 4463* unnamed-faq-79:: 4464* unnamed-faq-80:: 4465* unnamed-faq-81:: 4466* unnamed-faq-82:: 4467* unnamed-faq-83:: 4468* unnamed-faq-84:: 4469* unnamed-faq-85:: 4470* unnamed-faq-86:: 4471* unnamed-faq-87:: 4472* unnamed-faq-88:: 4473* unnamed-faq-90:: 4474* unnamed-faq-91:: 4475* unnamed-faq-92:: 4476* unnamed-faq-93:: 4477* unnamed-faq-94:: 4478* unnamed-faq-95:: 4479* unnamed-faq-96:: 4480* unnamed-faq-97:: 4481* unnamed-faq-98:: 4482* unnamed-faq-99:: 4483* unnamed-faq-100:: 4484* unnamed-faq-101:: 4485* What is the difference between YYLEX_PARAM and YY_DECL?:: 4486* Why do I get "conflicting types for yylex" error?:: 4487* How do I access the values set in a Flex action from within a Bison action?:: 4488 4489 4490File: flex.info, Node: When was flex born?, Next: How do I expand backslash-escape sequences in C-style quoted strings?, Up: FAQ 4491 4492When was flex born? 4493=================== 4494 4495Vern Paxson took over the `Software Tools' lex project from Jef 4496Poskanzer in 1982. At that point it was written in Ratfor. Around 44971987 or so, Paxson translated it into C, and a legend was born :-). 4498 4499 4500File: flex.info, Node: How do I expand backslash-escape sequences in C-style quoted strings?, Next: Why do flex scanners call fileno if it is not ANSI compatible?, Prev: When was flex born?, Up: FAQ 4501 4502How do I expand backslash-escape sequences in C-style quoted strings? 4503===================================================================== 4504 4505A key point when scanning quoted strings is that you cannot (easily) 4506write a single rule that will precisely match the string if you allow 4507things like embedded escape sequences and newlines. If you try to 4508match strings with a single rule then you'll wind up having to rescan 4509the string anyway to find any escape sequences. 4510 4511 Instead you can use exclusive start conditions and a set of rules, 4512one for matching non-escaped text, one for matching a single escape, 4513one for matching an embedded newline, and one for recognizing the end 4514of the string. Each of these rules is then faced with the question of 4515where to put its intermediary results. The best solution is for the 4516rules to append their local value of `yytext' to the end of a "string 4517literal" buffer. A rule like the escape-matcher will append to the 4518buffer the meaning of the escape sequence rather than the literal text 4519in `yytext'. In this way, `yytext' does not need to be modified at all. 4520 4521 4522File: flex.info, Node: Why do flex scanners call fileno if it is not ANSI compatible?, Next: Does flex support recursive pattern definitions?, Prev: How do I expand backslash-escape sequences in C-style quoted strings?, Up: FAQ 4523 4524Why do flex scanners call fileno if it is not ANSI compatible? 4525============================================================== 4526 4527Flex scanners call `fileno()' in order to get the file descriptor 4528corresponding to `yyin'. The file descriptor may be passed to 4529`isatty()' or `read()', depending upon which `%options' you specified. 4530If your system does not have `fileno()' support, to get rid of the 4531`read()' call, do not specify `%option read'. To get rid of the 4532`isatty()' call, you must specify one of `%option always-interactive' or 4533`%option never-interactive'. 4534 4535 4536File: flex.info, Node: Does flex support recursive pattern definitions?, Next: How do I skip huge chunks of input (tens of megabytes) while using flex?, Prev: Why do flex scanners call fileno if it is not ANSI compatible?, Up: FAQ 4537 4538Does flex support recursive pattern definitions? 4539================================================ 4540 4541e.g., 4542 4543 %% 4544 block "{"({block}|{statement})*"}" 4545 4546 No. You cannot have recursive definitions. The pattern-matching 4547power of regular expressions in general (and therefore flex scanners, 4548too) is limited. In particular, regular expressions cannot "balance" 4549parentheses to an arbitrary degree. For example, it's impossible to 4550write a regular expression that matches all strings containing the same 4551number of '{'s as '}'s. For more powerful pattern matching, you need a 4552parser, such as `GNU bison'. 4553 4554 4555File: flex.info, Node: How do I skip huge chunks of input (tens of megabytes) while using flex?, Next: Flex is not matching my patterns in the same order that I defined them., Prev: Does flex support recursive pattern definitions?, Up: FAQ 4556 4557How do I skip huge chunks of input (tens of megabytes) while using flex? 4558======================================================================== 4559 4560Use `fseek()' (or `lseek()') to position yyin, then call `yyrestart()'. 4561 4562 4563File: flex.info, Node: Flex is not matching my patterns in the same order that I defined them., Next: My actions are executing out of order or sometimes not at all., Prev: How do I skip huge chunks of input (tens of megabytes) while using flex?, Up: FAQ 4564 4565Flex is not matching my patterns in the same order that I defined them. 4566======================================================================= 4567 4568`flex' picks the rule that matches the most text (i.e., the longest 4569possible input string). This is because `flex' uses an entirely 4570different matching technique ("deterministic finite automata") that 4571actually does all of the matching simultaneously, in parallel. (Seems 4572impossible, but it's actually a fairly simple technique once you 4573understand the principles.) 4574 4575 A side-effect of this parallel matching is that when the input 4576matches more than one rule, `flex' scanners pick the rule that matched 4577the _most_ text. This is explained further in the manual, in the 4578section *Note Matching::. 4579 4580 If you want `flex' to choose a shorter match, then you can work 4581around this behavior by expanding your short rule to match more text, 4582then put back the extra: 4583 4584 data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; 4585 4586 Another fix would be to make the second rule active only during the 4587`<BLOCKIDSTATE>' start condition, and make that start condition 4588exclusive by declaring it with `%x' instead of `%s'. 4589 4590 A final fix is to change the input language so that the ambiguity for 4591`data_' is removed, by adding characters to it that don't match the 4592identifier rule, or by removing characters (such as `_') from the 4593identifier rule so it no longer matches `data_'. (Of course, you might 4594also not have the option of changing the input language.) 4595 4596 4597File: flex.info, Node: My actions are executing out of order or sometimes not at all., Next: How can I have multiple input sources feed into the same scanner at the same time?, Prev: Flex is not matching my patterns in the same order that I defined them., Up: FAQ 4598 4599My actions are executing out of order or sometimes not at all. 4600============================================================== 4601 4602Most likely, you have (in error) placed the opening `{' of the action 4603block on a different line than the rule, e.g., 4604 4605 ^(foo|bar) 4606 { <<<--- WRONG! 4607 4608 } 4609 4610 `flex' requires that the opening `{' of an action associated with a 4611rule begin on the same line as does the rule. You need instead to 4612write your rules as follows: 4613 4614 ^(foo|bar) { // CORRECT! 4615 4616 } 4617 4618 4619File: flex.info, Node: How can I have multiple input sources feed into the same scanner at the same time?, Next: Can I build nested parsers that work with the same input file?, Prev: My actions are executing out of order or sometimes not at all., Up: FAQ 4620 4621How can I have multiple input sources feed into the same scanner at the same time? 4622================================================================================== 4623 4624If ... 4625 * your scanner is free of backtracking (verified using `flex''s `-b' 4626 flag), 4627 4628 * AND you run your scanner interactively (`-I' option; default 4629 unless using special table compression options), 4630 4631 * AND you feed it one character at a time by redefining `YY_INPUT' 4632 to do so, 4633 4634 then every time it matches a token, it will have exhausted its input 4635buffer (because the scanner is free of backtracking). This means you 4636can safely use `select()' at the point and only call `yylex()' for 4637another token if `select()' indicates there's data available. 4638 4639 That is, move the `select()' out from the input function to a point 4640where it determines whether `yylex()' gets called for the next token. 4641 4642 With this approach, you will still have problems if your input can 4643arrive piecemeal; `select()' could inform you that the beginning of a 4644token is available, you call `yylex()' to get it, but it winds up 4645blocking waiting for the later characters in the token. 4646 4647 Here's another way: Move your input multiplexing inside of 4648`YY_INPUT'. That is, whenever `YY_INPUT' is called, it `select()''s to 4649see where input is available. If input is available for the scanner, 4650it reads and returns the next byte. If input is available from another 4651source, it calls whatever function is responsible for reading from that 4652source. (If no input is available, it blocks until some input is 4653available.) I've used this technique in an interpreter I wrote that 4654both reads keyboard input using a `flex' scanner and IPC traffic from 4655sockets, and it works fine. 4656 4657 4658File: flex.info, Node: Can I build nested parsers that work with the same input file?, Next: How can I match text only at the end of a file?, Prev: How can I have multiple input sources feed into the same scanner at the same time?, Up: FAQ 4659 4660Can I build nested parsers that work with the same input file? 4661============================================================== 4662 4663This is not going to work without some additional effort. The reason is 4664that `flex' block-buffers the input it reads from `yyin'. This means 4665that the "outermost" `yylex()', when called, will automatically slurp 4666up the first 8K of input available on yyin, and subsequent calls to 4667other `yylex()''s won't see that input. You might be tempted to work 4668around this problem by redefining `YY_INPUT' to only return a small 4669amount of text, but it turns out that that approach is quite difficult. 4670Instead, the best solution is to combine all of your scanners into one 4671large scanner, using a different exclusive start condition for each. 4672 4673 4674File: flex.info, Node: How can I match text only at the end of a file?, Next: How can I make REJECT cascade across start condition boundaries?, Prev: Can I build nested parsers that work with the same input file?, Up: FAQ 4675 4676How can I match text only at the end of a file? 4677=============================================== 4678 4679There is no way to write a rule which is "match this text, but only if 4680it comes at the end of the file". You can fake it, though, if you 4681happen to have a character lying around that you don't allow in your 4682input. Then you redefine `YY_INPUT' to call your own routine which, if 4683it sees an `EOF', returns the magic character first (and remembers to 4684return a real `EOF' next time it's called). Then you could write: 4685 4686 <COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ 4687 4688 4689File: flex.info, Node: How can I make REJECT cascade across start condition boundaries?, Next: Why cant I use fast or full tables with interactive mode?, Prev: How can I match text only at the end of a file?, Up: FAQ 4690 4691How can I make REJECT cascade across start condition boundaries? 4692================================================================ 4693 4694You can do this as follows. Suppose you have a start condition `A', and 4695after exhausting all of the possible matches in `<A>', you want to try 4696matches in `<INITIAL>'. Then you could use the following: 4697 4698 %x A 4699 %% 4700 <A>rule_that_is_long ...; REJECT; 4701 <A>rule ...; REJECT; /* shorter rule */ 4702 <A>etc. 4703 ... 4704 <A>.|\n { 4705 /* Shortest and last rule in <A>, so 4706 * cascaded REJECTs will eventually 4707 * wind up matching this rule. We want 4708 * to now switch to the initial state 4709 * and try matching from there instead. 4710 */ 4711 yyless(0); /* put back matched text */ 4712 BEGIN(INITIAL); 4713 } 4714 4715 4716File: flex.info, Node: Why cant I use fast or full tables with interactive mode?, Next: How much faster is -F or -f than -C?, Prev: How can I make REJECT cascade across start condition boundaries?, Up: FAQ 4717 4718Why can't I use fast or full tables with interactive mode? 4719========================================================== 4720 4721One of the assumptions flex makes is that interactive applications are 4722inherently slow (they're waiting on a human after all). It has to do 4723with how the scanner detects that it must be finished scanning a token. 4724For interactive scanners, after scanning each character the current 4725state is looked up in a table (essentially) to see whether there's a 4726chance of another input character possibly extending the length of the 4727match. If not, the scanner halts. For non-interactive scanners, the 4728end-of-token test is much simpler, basically a compare with 0, so no 4729memory bus cycles. Since the test occurs in the innermost scanning 4730loop, one would like to make it go as fast as possible. 4731 4732 Still, it seems reasonable to allow the user to choose to trade off 4733a bit of performance in this area to gain the corresponding 4734flexibility. There might be another reason, though, why fast scanners 4735don't support the interactive option. 4736 4737 4738File: flex.info, Node: How much faster is -F or -f than -C?, Next: If I have a simple grammar cant I just parse it with flex?, Prev: Why cant I use fast or full tables with interactive mode?, Up: FAQ 4739 4740How much faster is -F or -f than -C? 4741==================================== 4742 4743Much faster (factor of 2-3). 4744 4745 4746File: flex.info, Node: If I have a simple grammar cant I just parse it with flex?, Next: Why doesn't yyrestart() set the start state back to INITIAL?, Prev: How much faster is -F or -f than -C?, Up: FAQ 4747 4748If I have a simple grammar can't I just parse it with flex? 4749=========================================================== 4750 4751Is your grammar recursive? That's almost always a sign that you're 4752better off using a parser/scanner rather than just trying to use a 4753scanner alone. 4754 4755 4756File: flex.info, Node: Why doesn't yyrestart() set the start state back to INITIAL?, Next: How can I match C-style comments?, Prev: If I have a simple grammar cant I just parse it with flex?, Up: FAQ 4757 4758Why doesn't yyrestart() set the start state back to INITIAL? 4759============================================================ 4760 4761There are two reasons. The first is that there might be programs that 4762rely on the start state not changing across file changes. The second 4763is that beginning with `flex' version 2.4, use of `yyrestart()' is no 4764longer required, so fixing the problem there doesn't solve the more 4765general problem. 4766 4767 4768File: flex.info, Node: How can I match C-style comments?, Next: The period isn't working the way I expected., Prev: Why doesn't yyrestart() set the start state back to INITIAL?, Up: FAQ 4769 4770How can I match C-style comments? 4771================================= 4772 4773You might be tempted to try something like this: 4774 4775 "/*".*"*/" // WRONG! 4776 4777 or, worse, this: 4778 4779 "/*"(.|\n)"*/" // WRONG! 4780 4781 The above rules will eat too much input, and blow up on things like: 4782 4783 /* a comment */ do_my_thing( "oops */" ); 4784 4785 Here is one way which allows you to track line information: 4786 4787 <INITIAL>{ 4788 "/*" BEGIN(IN_COMMENT); 4789 } 4790 <IN_COMMENT>{ 4791 "*/" BEGIN(INITIAL); 4792 [^*\n]+ // eat comment in chunks 4793 "*" // eat the lone star 4794 \n yylineno++; 4795 } 4796 4797 4798File: flex.info, Node: The period isn't working the way I expected., Next: Can I get the flex manual in another format?, Prev: How can I match C-style comments?, Up: FAQ 4799 4800The '.' isn't working the way I expected. 4801========================================= 4802 4803Here are some tips for using `.': 4804 4805 * A common mistake is to place the grouping parenthesis AFTER an 4806 operator, when you really meant to place the parenthesis BEFORE 4807 the operator, e.g., you probably want this `(foo|bar)+' and NOT 4808 this `(foo|bar+)'. 4809 4810 The first pattern matches the words `foo' or `bar' any number of 4811 times, e.g., it matches the text `barfoofoobarfoo'. The second 4812 pattern matches a single instance of `foo' or a single instance of 4813 `bar' followed by one or more `r's, e.g., it matches the text 4814 `barrrr' . 4815 4816 * A `.' inside `[]''s just means a literal`.' (period), and NOT "any 4817 character except newline". 4818 4819 * Remember that `.' matches any character EXCEPT `\n' (and `EOF'). 4820 If you really want to match ANY character, including newlines, 4821 then use `(.|\n)' Beware that the regex `(.|\n)+' will match your 4822 entire input! 4823 4824 * Finally, if you want to match a literal `.' (a period), then use 4825 `[.]' or `"."' 4826 4827 4828File: flex.info, Node: Can I get the flex manual in another format?, Next: Does there exist a "faster" NDFA->DFA algorithm?, Prev: The period isn't working the way I expected., Up: FAQ 4829 4830Can I get the flex manual in another format? 4831============================================ 4832 4833The `flex' source distribution includes a texinfo manual. You are free 4834to convert that texinfo into whatever format you desire. The `texinfo' 4835package includes tools for conversion to a number of formats. 4836 4837 4838File: flex.info, Node: Does there exist a "faster" NDFA->DFA algorithm?, Next: How does flex compile the DFA so quickly?, Prev: Can I get the flex manual in another format?, Up: FAQ 4839 4840Does there exist a "faster" NDFA->DFA algorithm? 4841================================================ 4842 4843There's no way around the potential exponential running time - it can 4844take you exponential time just to enumerate all of the DFA states. In 4845practice, though, the running time is closer to linear, or sometimes 4846quadratic. 4847 4848 4849File: flex.info, Node: How does flex compile the DFA so quickly?, Next: How can I use more than 8192 rules?, Prev: Does there exist a "faster" NDFA->DFA algorithm?, Up: FAQ 4850 4851How does flex compile the DFA so quickly? 4852========================================= 4853 4854There are two big speed wins that `flex' uses: 4855 4856 1. It analyzes the input rules to construct equivalence classes for 4857 those characters that always make the same transitions. It then 4858 rewrites the NFA using equivalence classes for transitions instead 4859 of characters. This cuts down the NFA->DFA computation time 4860 dramatically, to the point where, for uncompressed DFA tables, the 4861 DFA generation is often I/O bound in writing out the tables. 4862 4863 2. It maintains hash values for previously computed DFA states, so 4864 testing whether a newly constructed DFA state is equivalent to a 4865 previously constructed state can be done very quickly, by first 4866 comparing hash values. 4867 4868 4869File: flex.info, Node: How can I use more than 8192 rules?, Next: How do I abandon a file in the middle of a scan and switch to a new file?, Prev: How does flex compile the DFA so quickly?, Up: FAQ 4870 4871How can I use more than 8192 rules? 4872=================================== 4873 4874`Flex' is compiled with an upper limit of 8192 rules per scanner. If 4875you need more than 8192 rules in your scanner, you'll have to recompile 4876`flex' with the following changes in `flexdef.h': 4877 4878 < #define YY_TRAILING_MASK 0x2000 4879 < #define YY_TRAILING_HEAD_MASK 0x4000 4880 -- 4881 > #define YY_TRAILING_MASK 0x20000000 4882 > #define YY_TRAILING_HEAD_MASK 0x40000000 4883 4884 This should work okay as long as your C compiler uses 32 bit 4885integers. But you might want to think about whether using such a huge 4886number of rules is the best way to solve your problem. 4887 4888 The following may also be relevant: 4889 4890 With luck, you should be able to increase the definitions in 4891flexdef.h for: 4892 4893 #define JAMSTATE -32766 /* marks a reference to the state that always jams */ 4894 #define MAXIMUM_MNS 31999 4895 #define BAD_SUBSCRIPT -32767 4896 4897 recompile everything, and it'll all work. Flex only has these 489816-bit-like values built into it because a long time ago it was 4899developed on a machine with 16-bit ints. I've given this advice to 4900others in the past but haven't heard back from them whether it worked 4901okay or not... 4902 4903 4904File: flex.info, Node: How do I abandon a file in the middle of a scan and switch to a new file?, Next: How do I execute code only during initialization (only before the first scan)?, Prev: How can I use more than 8192 rules?, Up: FAQ 4905 4906How do I abandon a file in the middle of a scan and switch to a new file? 4907========================================================================= 4908 4909Just call `yyrestart(newfile)'. Be sure to reset the start state if you 4910want a "fresh start, since `yyrestart' does NOT reset the start state 4911back to `INITIAL'. 4912 4913 4914File: flex.info, Node: How do I execute code only during initialization (only before the first scan)?, Next: How do I execute code at termination?, Prev: How do I abandon a file in the middle of a scan and switch to a new file?, Up: FAQ 4915 4916How do I execute code only during initialization (only before the first scan)? 4917============================================================================== 4918 4919You can specify an initial action by defining the macro `YY_USER_INIT' 4920(though note that `yyout' may not be available at the time this macro 4921is executed). Or you can add to the beginning of your rules section: 4922 4923 %% 4924 /* Must be indented! */ 4925 static int did_init = 0; 4926 4927 if ( ! did_init ){ 4928 do_my_init(); 4929 did_init = 1; 4930 } 4931 4932 4933File: flex.info, Node: How do I execute code at termination?, Next: Where else can I find help?, Prev: How do I execute code only during initialization (only before the first scan)?, Up: FAQ 4934 4935How do I execute code at termination? 4936===================================== 4937 4938You can specify an action for the `<<EOF>>' rule. 4939 4940 4941File: flex.info, Node: Where else can I find help?, Next: Can I include comments in the "rules" section of the file?, Prev: How do I execute code at termination?, Up: FAQ 4942 4943Where else can I find help? 4944=========================== 4945 4946You can find the flex homepage on the web at 4947`http://flex.sourceforge.net/'. See that page for details about flex 4948mailing lists as well. 4949 4950 4951File: flex.info, Node: Can I include comments in the "rules" section of the file?, Next: I get an error about undefined yywrap()., Prev: Where else can I find help?, Up: FAQ 4952 4953Can I include comments in the "rules" section of the file? 4954========================================================== 4955 4956Yes, just about anywhere you want to. See the manual for the specific 4957syntax. 4958 4959 4960File: flex.info, Node: I get an error about undefined yywrap()., Next: How can I change the matching pattern at run time?, Prev: Can I include comments in the "rules" section of the file?, Up: FAQ 4961 4962I get an error about undefined yywrap(). 4963======================================== 4964 4965You must supply a `yywrap()' function of your own, or link to `libfl.a' 4966(which provides one), or use 4967 4968 %option noyywrap 4969 4970 in your source to say you don't want a `yywrap()' function. 4971 4972 4973File: flex.info, Node: How can I change the matching pattern at run time?, Next: How can I expand macros in the input?, Prev: I get an error about undefined yywrap()., Up: FAQ 4974 4975How can I change the matching pattern at run time? 4976================================================== 4977 4978You can't, it's compiled into a static table when flex builds the 4979scanner. 4980 4981 4982File: flex.info, Node: How can I expand macros in the input?, Next: How can I build a two-pass scanner?, Prev: How can I change the matching pattern at run time?, Up: FAQ 4983 4984How can I expand macros in the input? 4985===================================== 4986 4987The best way to approach this problem is at a higher level, e.g., in 4988the parser. 4989 4990 However, you can do this using multiple input buffers. 4991 4992 %% 4993 macro/[a-z]+ { 4994 /* Saw the macro "macro" followed by extra stuff. */ 4995 main_buffer = YY_CURRENT_BUFFER; 4996 expansion_buffer = yy_scan_string(expand(yytext)); 4997 yy_switch_to_buffer(expansion_buffer); 4998 } 4999 5000 <<EOF>> { 5001 if ( expansion_buffer ) 5002 { 5003 // We were doing an expansion, return to where 5004 // we were. 5005 yy_switch_to_buffer(main_buffer); 5006 yy_delete_buffer(expansion_buffer); 5007 expansion_buffer = 0; 5008 } 5009 else 5010 yyterminate(); 5011 } 5012 5013 You probably will want a stack of expansion buffers to allow nested 5014macros. From the above though hopefully the idea is clear. 5015 5016 5017File: flex.info, Node: How can I build a two-pass scanner?, Next: How do I match any string not matched in the preceding rules?, Prev: How can I expand macros in the input?, Up: FAQ 5018 5019How can I build a two-pass scanner? 5020=================================== 5021 5022One way to do it is to filter the first pass to a temporary file, then 5023process the temporary file on the second pass. You will probably see a 5024performance hit, due to all the disk I/O. 5025 5026 When you need to look ahead far forward like this, it almost always 5027means that the right solution is to build a parse tree of the entire 5028input, then walk it after the parse in order to generate the output. 5029In a sense, this is a two-pass approach, once through the text and once 5030through the parse tree, but the performance hit for the latter is 5031usually an order of magnitude smaller, since everything is already 5032classified, in binary format, and residing in memory. 5033 5034 5035File: flex.info, Node: How do I match any string not matched in the preceding rules?, Next: I am trying to port code from AT&T lex that uses yysptr and yysbuf., Prev: How can I build a two-pass scanner?, Up: FAQ 5036 5037How do I match any string not matched in the preceding rules? 5038============================================================= 5039 5040One way to assign precedence, is to place the more specific rules 5041first. If two rules would match the same input (same sequence of 5042characters) then the first rule listed in the `flex' input wins, e.g., 5043 5044 %% 5045 foo[a-zA-Z_]+ return FOO_ID; 5046 bar[a-zA-Z_]+ return BAR_ID; 5047 [a-zA-Z_]+ return GENERIC_ID; 5048 5049 Note that the rule `[a-zA-Z_]+' must come *after* the others. It 5050will match the same amount of text as the more specific rules, and in 5051that case the `flex' scanner will pick the first rule listed in your 5052scanner as the one to match. 5053 5054 5055File: flex.info, Node: I am trying to port code from AT&T lex that uses yysptr and yysbuf., Next: Is there a way to make flex treat NULL like a regular character?, Prev: How do I match any string not matched in the preceding rules?, Up: FAQ 5056 5057I am trying to port code from AT&T lex that uses yysptr and yysbuf. 5058=================================================================== 5059 5060Those are internal variables pointing into the AT&T scanner's input 5061buffer. I imagine they're being manipulated in user versions of the 5062`input()' and `unput()' functions. If so, what you need to do is 5063analyze those functions to figure out what they're doing, and then 5064replace `input()' with an appropriate definition of `YY_INPUT'. You 5065shouldn't need to (and must not) replace `flex''s `unput()' function. 5066 5067 5068File: flex.info, Node: Is there a way to make flex treat NULL like a regular character?, Next: Whenever flex can not match the input it says "flex scanner jammed"., Prev: I am trying to port code from AT&T lex that uses yysptr and yysbuf., Up: FAQ 5069 5070Is there a way to make flex treat NULL like a regular character? 5071================================================================ 5072 5073Yes, `\0' and `\x00' should both do the trick. Perhaps you have an 5074ancient version of `flex'. The latest release is version 2.5.39. 5075 5076 5077File: flex.info, Node: Whenever flex can not match the input it says "flex scanner jammed"., Next: Why doesn't flex have non-greedy operators like perl does?, Prev: Is there a way to make flex treat NULL like a regular character?, Up: FAQ 5078 5079Whenever flex can not match the input it says "flex scanner jammed". 5080==================================================================== 5081 5082You need to add a rule that matches the otherwise-unmatched text, e.g., 5083 5084 %option yylineno 5085 %% 5086 [[a bunch of rules here]] 5087 5088 . printf("bad input character '%s' at line %d\n", yytext, yylineno); 5089 5090 See `%option default' for more information. 5091 5092 5093File: flex.info, Node: Why doesn't flex have non-greedy operators like perl does?, Next: Memory leak - 16386 bytes allocated by malloc., Prev: Whenever flex can not match the input it says "flex scanner jammed"., Up: FAQ 5094 5095Why doesn't flex have non-greedy operators like perl does? 5096========================================================== 5097 5098A DFA can do a non-greedy match by stopping the first time it enters an 5099accepting state, instead of consuming input until it determines that no 5100further matching is possible (a "jam" state). This is actually easier 5101to implement than longest leftmost match (which flex does). 5102 5103 But it's also much less useful than longest leftmost match. In 5104general, when you find yourself wishing for non-greedy matching, that's 5105usually a sign that you're trying to make the scanner do some parsing. 5106That's generally the wrong approach, since it lacks the power to do a 5107decent job. Better is to either introduce a separate parser, or to 5108split the scanner into multiple scanners using (exclusive) start 5109conditions. 5110 5111 You might have a separate start state once you've seen the `BEGIN'. 5112In that state, you might then have a regex that will match `END' (to 5113kick you out of the state), and perhaps `(.|\n)' to get a single 5114character within the chunk ... 5115 5116 This approach also has much better error-reporting properties. 5117 5118 5119File: flex.info, Node: Memory leak - 16386 bytes allocated by malloc., Next: How do I track the byte offset for lseek()?, Prev: Why doesn't flex have non-greedy operators like perl does?, Up: FAQ 5120 5121Memory leak - 16386 bytes allocated by malloc. 5122============================================== 5123 5124UPDATED 2002-07-10: As of `flex' version 2.5.9, this leak means that 5125you did not call `yylex_destroy()'. If you are using an earlier version 5126of `flex', then read on. 5127 5128 The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the 5129read-buffer, and about 40 for `struct yy_buffer_state' (depending upon 5130alignment). The leak is in the non-reentrant C scanner only (NOT in the 5131reentrant scanner, NOT in the C++ scanner). Since `flex' doesn't know 5132when you are done, the buffer is never freed. 5133 5134 However, the leak won't multiply since the buffer is reused no 5135matter how many times you call `yylex()'. 5136 5137 If you want to reclaim the memory when you are completely done 5138scanning, then you might try this: 5139 5140 /* For non-reentrant C scanner only. */ 5141 yy_delete_buffer(YY_CURRENT_BUFFER); 5142 yy_init = 1; 5143 5144 Note: `yy_init' is an "internal variable", and hasn't been tested in 5145this situation. It is possible that some other globals may need 5146resetting as well. 5147 5148 5149File: flex.info, Node: How do I track the byte offset for lseek()?, Next: How do I use my own I/O classes in a C++ scanner?, Prev: Memory leak - 16386 bytes allocated by malloc., Up: FAQ 5150 5151How do I track the byte offset for lseek()? 5152=========================================== 5153 5154 > We thought that it would be possible to have this number through the 5155 > evaluation of the following expression: 5156 > 5157 > seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf 5158 5159 While this is the right idea, it has two problems. The first is that 5160it's possible that `flex' will request less than `YY_READ_BUF_SIZE' 5161during an invocation of `YY_INPUT' (or that your input source will 5162return less even though `YY_READ_BUF_SIZE' bytes were requested). The 5163second problem is that when refilling its internal buffer, `flex' keeps 5164some characters from the previous buffer (because usually it's in the 5165middle of a match, and needs those characters to construct `yytext' for 5166the match once it's done). Because of this, `yy_c_buf_p - 5167YY_CURRENT_BUFFER->yy_ch_buf' won't be exactly the number of characters 5168already read from the current buffer. 5169 5170 An alternative solution is to count the number of characters you've 5171matched since starting to scan. This can be done by using 5172`YY_USER_ACTION'. For example, 5173 5174 #define YY_USER_ACTION num_chars += yyleng; 5175 5176 (You need to be careful to update your bookkeeping if you use 5177`yymore('), `yyless()', `unput()', or `input()'.) 5178 5179 5180File: flex.info, Node: How do I use my own I/O classes in a C++ scanner?, Next: How do I skip as many chars as possible?, Prev: How do I track the byte offset for lseek()?, Up: FAQ 5181 5182How do I use my own I/O classes in a C++ scanner? 5183================================================= 5184 5185When the flex C++ scanning class rewrite finally happens, then this 5186sort of thing should become much easier. 5187 5188 You can do this by passing the various functions (such as 5189`LexerInput()' and `LexerOutput()') NULL `iostream*''s, and then 5190dealing with your own I/O classes surreptitiously (i.e., stashing them 5191in special member variables). This works because the only assumption 5192about the lexer regarding what's done with the iostream's is that 5193they're ultimately passed to `LexerInput()' and `LexerOutput', which 5194then do whatever is necessary with them. 5195 5196 5197File: flex.info, Node: How do I skip as many chars as possible?, Next: deleteme00, Prev: How do I use my own I/O classes in a C++ scanner?, Up: FAQ 5198 5199How do I skip as many chars as possible? 5200======================================== 5201 5202How do I skip as many chars as possible - without interfering with the 5203other patterns? 5204 5205 In the example below, we want to skip over characters until we see 5206the phrase "endskip". The following will _NOT_ work correctly (do you 5207see why not?) 5208 5209 /* INCORRECT SCANNER */ 5210 %x SKIP 5211 %% 5212 <INITIAL>startskip BEGIN(SKIP); 5213 ... 5214 <SKIP>"endskip" BEGIN(INITIAL); 5215 <SKIP>.* ; 5216 5217 The problem is that the pattern .* will eat up the word "endskip." 5218The simplest (but slow) fix is: 5219 5220 <SKIP>"endskip" BEGIN(INITIAL); 5221 <SKIP>. ; 5222 5223 The fix involves making the second rule match more, without making 5224it match "endskip" plus something else. So for example: 5225 5226 <SKIP>"endskip" BEGIN(INITIAL); 5227 <SKIP>[^e]+ ; 5228 <SKIP>. ;/* so you eat up e's, too */ 5229 5230 5231File: flex.info, Node: deleteme00, Next: Are certain equivalent patterns faster than others?, Prev: How do I skip as many chars as possible?, Up: FAQ 5232 5233deleteme00 5234========== 5235 5236 QUESTION: 5237 When was flex born? 5238 5239 Vern Paxson took over 5240 the Software Tools lex project from Jef Poskanzer in 1982. At that point it 5241 was written in Ratfor. Around 1987 or so, Paxson translated it into C, and 5242 a legend was born :-). 5243 5244 5245File: flex.info, Node: Are certain equivalent patterns faster than others?, Next: Is backing up a big deal?, Prev: deleteme00, Up: FAQ 5246 5247Are certain equivalent patterns faster than others? 5248=================================================== 5249 5250 To: Adoram Rogel <adoram@orna.hybridge.com> 5251 Subject: Re: Flex 2.5.2 performance questions 5252 In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. 5253 Date: Wed, 18 Sep 96 10:51:02 PDT 5254 From: Vern Paxson <vern> 5255 5256 [Note, the most recent flex release is 2.5.4, which you can get from 5257 ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] 5258 5259 > 1. Using the pattern 5260 > ([Ff](oot)?)?[Nn](ote)?(\.)? 5261 > instead of 5262 > (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) 5263 > (in a very complicated flex program) caused the program to slow from 5264 > 300K+/min to 100K/min (no other changes were done). 5265 5266 These two are not equivalent. For example, the first can match "footnote." 5267 but the second can only match "footnote". This is almost certainly the 5268 cause in the discrepancy - the slower scanner run is matching more tokens, 5269 and/or having to do more backing up. 5270 5271 > 2. Which of these two are better: [Ff]oot or (F|f)oot ? 5272 5273 From a performance point of view, they're equivalent (modulo presumably 5274 minor effects such as memory cache hit rates; and the presence of trailing 5275 context, see below). From a space point of view, the first is slightly 5276 preferable. 5277 5278 > 3. I have a pattern that look like this: 5279 > pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) 5280 > 5281 > running yet another complicated program that includes the following rule: 5282 > <snext>{and}/{no4}{bb}{pats} 5283 > 5284 > gets me to "too complicated - over 32,000 states"... 5285 5286 I can't tell from this example whether the trailing context is variable-length 5287 or fixed-length (it could be the latter if {and} is fixed-length). If it's 5288 variable length, which flex -p will tell you, then this reflects a basic 5289 performance problem, and if you can eliminate it by restructuring your 5290 scanner, you will see significant improvement. 5291 5292 > so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about 5293 > 10 patterns and changed the rule to be 5 rules. 5294 > This did compile, but what is the rule of thumb here ? 5295 5296 The rule is to avoid trailing context other than fixed-length, in which for 5297 a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use 5298 of the '|' operator automatically makes the pattern variable length, so in 5299 this case '[Ff]oot' is preferred to '(F|f)oot'. 5300 5301 > 4. I changed a rule that looked like this: 5302 > <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... 5303 > 5304 > to the next 2 rules: 5305 > <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} 5306 > <snext8>{and}{bb}/{ROMAN} { BEGIN... 5307 > 5308 > Again, I understand the using [^...] will cause a great performance loss 5309 5310 Actually, it doesn't cause any sort of performance loss. It's a surprising 5311 fact about regular expressions that they always match in linear time 5312 regardless of how complex they are. 5313 5314 > but are there any specific rules about it ? 5315 5316 See the "Performance Considerations" section of the man page, and also 5317 the example in MISC/fastwc/. 5318 5319 Vern 5320 5321 5322File: flex.info, Node: Is backing up a big deal?, Next: Can I fake multi-byte character support?, Prev: Are certain equivalent patterns faster than others?, Up: FAQ 5323 5324Is backing up a big deal? 5325========================= 5326 5327 To: Adoram Rogel <adoram@hybridge.com> 5328 Subject: Re: Flex 2.5.2 performance questions 5329 In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. 5330 Date: Thu, 19 Sep 96 09:58:00 PDT 5331 From: Vern Paxson <vern> 5332 5333 > a lot about the backing up problem. 5334 > I believe that there lies my biggest problem, and I'll try to improve 5335 > it. 5336 5337 Since you have variable trailing context, this is a bigger performance 5338 problem. Fixing it is usually easier than fixing backing up, which in a 5339 complicated scanner (yours seems to fit the bill) can be extremely 5340 difficult to do correctly. 5341 5342 You also don't mention what flags you are using for your scanner. 5343 -f makes a large speed difference, and -Cfe buys you nearly as much 5344 speed but the resulting scanner is considerably smaller. 5345 5346 > I have an | operator in {and} and in {pats} so both of them are variable 5347 > length. 5348 5349 -p should have reported this. 5350 5351 > Is changing one of them to fixed-length is enough ? 5352 5353 Yes. 5354 5355 > Is it possible to change the 32,000 states limit ? 5356 5357 Yes. I've appended instructions on how. Before you make this change, 5358 though, you should think about whether there are ways to fundamentally 5359 simplify your scanner - those are certainly preferable! 5360 5361 Vern 5362 5363 To increase the 32K limit (on a machine with 32 bit integers), you increase 5364 the magnitude of the following in flexdef.h: 5365 5366 #define JAMSTATE -32766 /* marks a reference to the state that always jams */ 5367 #define MAXIMUM_MNS 31999 5368 #define BAD_SUBSCRIPT -32767 5369 #define MAX_SHORT 32700 5370 5371 Adding a 0 or two after each should do the trick. 5372 5373 5374File: flex.info, Node: Can I fake multi-byte character support?, Next: deleteme01, Prev: Is backing up a big deal?, Up: FAQ 5375 5376Can I fake multi-byte character support? 5377======================================== 5378 5379 To: Heeman_Lee@hp.com 5380 Subject: Re: flex - multi-byte support? 5381 In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. 5382 Date: Fri, 04 Oct 1996 11:42:18 PDT 5383 From: Vern Paxson <vern> 5384 5385 > I assume as long as my *.l file defines the 5386 > range of expected character code values (in octal format), flex will 5387 > scan the file and read multi-byte characters correctly. But I have no 5388 > confidence in this assumption. 5389 5390 Your lack of confidence is justified - this won't work. 5391 5392 Flex has in it a widespread assumption that the input is processed 5393 one byte at a time. Fixing this is on the to-do list, but is involved, 5394 so it won't happen any time soon. In the interim, the best I can suggest 5395 (unless you want to try fixing it yourself) is to write your rules in 5396 terms of pairs of bytes, using definitions in the first section: 5397 5398 X \xfe\xc2 5399 ... 5400 %% 5401 foo{X}bar found_foo_fe_c2_bar(); 5402 5403 etc. Definitely a pain - sorry about that. 5404 5405 By the way, the email address you used for me is ancient, indicating you 5406 have a very old version of flex. You can get the most recent, 2.5.4, from 5407 ftp.ee.lbl.gov. 5408 5409 Vern 5410 5411 5412File: flex.info, Node: deleteme01, Next: Can you discuss some flex internals?, Prev: Can I fake multi-byte character support?, Up: FAQ 5413 5414deleteme01 5415========== 5416 5417 To: moleary@primus.com 5418 Subject: Re: Flex / Unicode compatibility question 5419 In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. 5420 Date: Tue, 22 Oct 1996 11:06:13 PDT 5421 From: Vern Paxson <vern> 5422 5423 Unfortunately flex at the moment has a widespread assumption within it 5424 that characters are processed 8 bits at a time. I don't see any easy 5425 fix for this (other than writing your rules in terms of double characters - 5426 a pain). I also don't know of a wider lex, though you might try surfing 5427 the Plan 9 stuff because I know it's a Unicode system, and also the PCCT 5428 toolkit (try searching say Alta Vista for "Purdue Compiler Construction 5429 Toolkit"). 5430 5431 Fixing flex to handle wider characters is on the long-term to-do list. 5432 But since flex is a strictly spare-time project these days, this probably 5433 won't happen for quite a while, unless someone else does it first. 5434 5435 Vern 5436 5437 5438File: flex.info, Node: Can you discuss some flex internals?, Next: unput() messes up yy_at_bol, Prev: deleteme01, Up: FAQ 5439 5440Can you discuss some flex internals? 5441==================================== 5442 5443 To: Johan Linde <jl@theophys.kth.se> 5444 Subject: Re: translation of flex 5445 In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. 5446 Date: Mon, 11 Nov 1996 10:33:50 PST 5447 From: Vern Paxson <vern> 5448 5449 > I'm working for the Swedish team translating GNU program, and I'm currently 5450 > working with flex. I have a few questions about some of the messages which 5451 > I hope you can answer. 5452 5453 All of the things you're wondering about, by the way, concerning flex 5454 internals - probably the only person who understands what they mean in 5455 English is me! So I wouldn't worry too much about getting them right. 5456 That said ... 5457 5458 > #: main.c:545 5459 > msgid " %d protos created\n" 5460 > 5461 > Does proto mean prototype? 5462 5463 Yes - prototypes of state compression tables. 5464 5465 > #: main.c:539 5466 > msgid " %d/%d (peak %d) template nxt-chk entries created\n" 5467 > 5468 > Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) 5469 > However, 'template next-check entries' doesn't make much sense to me. To be 5470 > able to find a good translation I need to know a little bit more about it. 5471 5472 There is a scheme in the Aho/Sethi/Ullman compiler book for compressing 5473 scanner tables. It involves creating two pairs of tables. The first has 5474 "base" and "default" entries, the second has "next" and "check" entries. 5475 The "base" entry is indexed by the current state and yields an index into 5476 the next/check table. The "default" entry gives what to do if the state 5477 transition isn't found in next/check. The "next" entry gives the next 5478 state to enter, but only if the "check" entry verifies that this entry is 5479 correct for the current state. Flex creates templates of series of 5480 next/check entries and then encodes differences from these templates as a 5481 way to compress the tables. 5482 5483 > #: main.c:533 5484 > msgid " %d/%d base-def entries created\n" 5485 > 5486 > The same problem here for 'base-def'. 5487 5488 See above. 5489 5490 Vern 5491 5492 5493File: flex.info, Node: unput() messes up yy_at_bol, Next: The | operator is not doing what I want, Prev: Can you discuss some flex internals?, Up: FAQ 5494 5495unput() messes up yy_at_bol 5496=========================== 5497 5498 To: Xinying Li <xli@npac.syr.edu> 5499 Subject: Re: FLEX ? 5500 In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. 5501 Date: Wed, 13 Nov 1996 19:51:54 PST 5502 From: Vern Paxson <vern> 5503 5504 > "unput()" them to input flow, question occurs. If I do this after I scan 5505 > a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That 5506 > means the carriage flag has gone. 5507 5508 You can control this by calling yy_set_bol(). It's described in the manual. 5509 5510 > And if in pre-reading it goes to the end of file, is anything done 5511 > to control the end of curren buffer and end of file? 5512 5513 No, there's no way to put back an end-of-file. 5514 5515 > By the way I am using flex 2.5.2 and using the "-l". 5516 5517 The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and 5518 2.5.3. You can get it from ftp.ee.lbl.gov. 5519 5520 Vern 5521 5522 5523File: flex.info, Node: The | operator is not doing what I want, Next: Why can't flex understand this variable trailing context pattern?, Prev: unput() messes up yy_at_bol, Up: FAQ 5524 5525The | operator is not doing what I want 5526======================================= 5527 5528 To: Alain.ISSARD@st.com 5529 Subject: Re: Start condition with FLEX 5530 In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. 5531 Date: Mon, 18 Nov 1996 10:41:34 PST 5532 From: Vern Paxson <vern> 5533 5534 > I am not able to use the start condition scope and to use the | (OR) with 5535 > rules having start conditions. 5536 5537 The problem is that if you use '|' as a regular expression operator, for 5538 example "a|b" meaning "match either 'a' or 'b'", then it must *not* have 5539 any blanks around it. If you instead want the special '|' *action* (which 5540 from your scanner appears to be the case), which is a way of giving two 5541 different rules the same action: 5542 5543 foo | 5544 bar matched_foo_or_bar(); 5545 5546 then '|' *must* be separated from the first rule by whitespace and *must* 5547 be followed by a new line. You *cannot* write it as: 5548 5549 foo | bar matched_foo_or_bar(); 5550 5551 even though you might think you could because yacc supports this syntax. 5552 The reason for this unfortunately incompatibility is historical, but it's 5553 unlikely to be changed. 5554 5555 Your problems with start condition scope are simply due to syntax errors 5556 from your use of '|' later confusing flex. 5557 5558 Let me know if you still have problems. 5559 5560 Vern 5561 5562 5563File: flex.info, Node: Why can't flex understand this variable trailing context pattern?, Next: The ^ operator isn't working, Prev: The | operator is not doing what I want, Up: FAQ 5564 5565Why can't flex understand this variable trailing context pattern? 5566================================================================= 5567 5568 To: Gregory Margo <gmargo@newton.vip.best.com> 5569 Subject: Re: flex-2.5.3 bug report 5570 In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. 5571 Date: Sat, 23 Nov 1996 17:07:32 PST 5572 From: Vern Paxson <vern> 5573 5574 > Enclosed is a lex file that "real" lex will process, but I cannot get 5575 > flex to process it. Could you try it and maybe point me in the right direction? 5576 5577 Your problem is that some of the definitions in the scanner use the '/' 5578 trailing context operator, and have it enclosed in ()'s. Flex does not 5579 allow this operator to be enclosed in ()'s because doing so allows undefined 5580 regular expressions such as "(a/b)+". So the solution is to remove the 5581 parentheses. Note that you must also be building the scanner with the -l 5582 option for AT&T lex compatibility. Without this option, flex automatically 5583 encloses the definitions in parentheses. 5584 5585 Vern 5586 5587 5588File: flex.info, Node: The ^ operator isn't working, Next: Trailing context is getting confused with trailing optional patterns, Prev: Why can't flex understand this variable trailing context pattern?, Up: FAQ 5589 5590The ^ operator isn't working 5591============================ 5592 5593 To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> 5594 Subject: Re: Flex Bug ? 5595 In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. 5596 Date: Tue, 26 Nov 1996 11:15:05 PST 5597 From: Vern Paxson <vern> 5598 5599 > In my lexer code, i have the line : 5600 > ^\*.* { } 5601 > 5602 > Thus all lines starting with an astrix (*) are comment lines. 5603 > This does not work ! 5604 5605 I can't get this problem to reproduce - it works fine for me. Note 5606 though that if what you have is slightly different: 5607 5608 COMMENT ^\*.* 5609 %% 5610 {COMMENT} { } 5611 5612 then it won't work, because flex pushes back macro definitions enclosed 5613 in ()'s, so the rule becomes 5614 5615 (^\*.*) { } 5616 5617 and now that the '^' operator is not at the immediate beginning of the 5618 line, it's interpreted as just a regular character. You can avoid this 5619 behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". 5620 5621 Vern 5622 5623 5624File: flex.info, Node: Trailing context is getting confused with trailing optional patterns, Next: Is flex GNU or not?, Prev: The ^ operator isn't working, Up: FAQ 5625 5626Trailing context is getting confused with trailing optional patterns 5627==================================================================== 5628 5629 To: Adoram Rogel <adoram@hybridge.com> 5630 Subject: Re: Flex 2.5.4 BOF ??? 5631 In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. 5632 Date: Wed, 27 Nov 1996 10:56:25 PST 5633 From: Vern Paxson <vern> 5634 5635 > Organization(s)?/[a-z] 5636 > 5637 > This matched "Organizations" (looking in debug mode, the trailing s 5638 > was matched with trailing context instead of the optional (s) in the 5639 > end of the word. 5640 5641 That should only happen with lex. Flex can properly match this pattern. 5642 (That might be what you're saying, I'm just not sure.) 5643 5644 > Is there a way to avoid this dangerous trailing context problem ? 5645 5646 Unfortunately, there's no easy way. On the other hand, I don't see why 5647 it should be a problem. Lex's matching is clearly wrong, and I'd hope 5648 that usually the intent remains the same as expressed with the pattern, 5649 so flex's matching will be correct. 5650 5651 Vern 5652 5653 5654File: flex.info, Node: Is flex GNU or not?, Next: ERASEME53, Prev: Trailing context is getting confused with trailing optional patterns, Up: FAQ 5655 5656Is flex GNU or not? 5657=================== 5658 5659 To: Cameron MacKinnon <mackin@interlog.com> 5660 Subject: Re: Flex documentation bug 5661 In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. 5662 Date: Sun, 01 Dec 1996 22:29:39 PST 5663 From: Vern Paxson <vern> 5664 5665 > I'm not sure how or where to submit bug reports (documentation or 5666 > otherwise) for the GNU project stuff ... 5667 5668 Well, strictly speaking flex isn't part of the GNU project. They just 5669 distribute it because no one's written a decent GPL'd lex replacement. 5670 So you should send bugs directly to me. Those sent to the GNU folks 5671 sometimes find there way to me, but some may drop between the cracks. 5672 5673 > In GNU Info, under the section 'Start Conditions', and also in the man 5674 > page (mine's dated April '95) is a nice little snippet showing how to 5675 > parse C quoted strings into a buffer, defined to be MAX_STR_CONST in 5676 > size. Unfortunately, no overflow checking is ever done ... 5677 5678 This is already mentioned in the manual: 5679 5680 Finally, here's an example of how to match C-style quoted 5681 strings using exclusive start conditions, including expanded 5682 escape sequences (but not including checking for a string 5683 that's too long): 5684 5685 The reason for not doing the overflow checking is that it will needlessly 5686 clutter up an example whose main purpose is just to demonstrate how to 5687 use flex. 5688 5689 The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. 5690 5691 Vern 5692 5693 5694File: flex.info, Node: ERASEME53, Next: I need to scan if-then-else blocks and while loops, Prev: Is flex GNU or not?, Up: FAQ 5695 5696ERASEME53 5697========= 5698 5699 To: tsv@cs.UManitoba.CA 5700 Subject: Re: Flex (reg).. 5701 In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. 5702 Date: Thu, 06 Mar 1997 15:54:19 PST 5703 From: Vern Paxson <vern> 5704 5705 > [:alpha:] ([:alnum:] | \\_)* 5706 5707 If your rule really has embedded blanks as shown above, then it won't 5708 work, as the first blank delimits the rule from the action. (It wouldn't 5709 even compile ...) You need instead: 5710 5711 [:alpha:]([:alnum:]|\\_)* 5712 5713 and that should work fine - there's no restriction on what can go inside 5714 of ()'s except for the trailing context operator, '/'. 5715 5716 Vern 5717 5718 5719File: flex.info, Node: I need to scan if-then-else blocks and while loops, Next: ERASEME55, Prev: ERASEME53, Up: FAQ 5720 5721I need to scan if-then-else blocks and while loops 5722================================================== 5723 5724 To: "Mike Stolnicki" <mstolnic@ford.com> 5725 Subject: Re: FLEX help 5726 In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. 5727 Date: Fri, 30 May 1997 10:46:35 PDT 5728 From: Vern Paxson <vern> 5729 5730 > We'd like to add "if-then-else", "while", and "for" statements to our 5731 > language ... 5732 > We've investigated many possible solutions. The one solution that seems 5733 > the most reasonable involves knowing the position of a TOKEN in yyin. 5734 5735 I strongly advise you to instead build a parse tree (abstract syntax tree) 5736 and loop over that instead. You'll find this has major benefits in keeping 5737 your interpreter simple and extensible. 5738 5739 That said, the functionality you mention for get_position and set_position 5740 have been on the to-do list for a while. As flex is a purely spare-time 5741 project for me, no guarantees when this will be added (in particular, it 5742 for sure won't be for many months to come). 5743 5744 Vern 5745 5746 5747File: flex.info, Node: ERASEME55, Next: ERASEME56, Prev: I need to scan if-then-else blocks and while loops, Up: FAQ 5748 5749ERASEME55 5750========= 5751 5752 To: Colin Paul Adams <colin@colina.demon.co.uk> 5753 Subject: Re: Flex C++ classes and Bison 5754 In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. 5755 Date: Fri, 15 Aug 1997 10:48:19 PDT 5756 From: Vern Paxson <vern> 5757 5758 > #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control 5759 > *parm) 5760 > 5761 > I have been trying to get this to work as a C++ scanner, but it does 5762 > not appear to be possible (warning that it matches no declarations in 5763 > yyFlexLexer, or something like that). 5764 > 5765 > Is this supposed to be possible, or is it being worked on (I DID 5766 > notice the comment that scanner classes are still experimental, so I'm 5767 > not too hopeful)? 5768 5769 What you need to do is derive a subclass from yyFlexLexer that provides 5770 the above yylex() method, squirrels away lvalp and parm into member 5771 variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. 5772 5773 Vern 5774 5775 5776File: flex.info, Node: ERASEME56, Next: ERASEME57, Prev: ERASEME55, Up: FAQ 5777 5778ERASEME56 5779========= 5780 5781 To: Mikael.Latvala@lmf.ericsson.se 5782 Subject: Re: Possible mistake in Flex v2.5 document 5783 In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. 5784 Date: Fri, 05 Sep 1997 10:01:54 PDT 5785 From: Vern Paxson <vern> 5786 5787 > In that example you show how to count comment lines when using 5788 > C style /* ... */ comments. My question is, shouldn't you take into 5789 > account a scenario where end of a comment marker occurs inside 5790 > character or string literals? 5791 5792 The scanner certainly needs to also scan character and string literals. 5793 However it does that (there's an example in the man page for strings), the 5794 lexer will recognize the beginning of the literal before it runs across the 5795 embedded "/*". Consequently, it will finish scanning the literal before it 5796 even considers the possibility of matching "/*". 5797 5798 Example: 5799 5800 '([^']*|{ESCAPE_SEQUENCE})' 5801 5802 will match all the text between the ''s (inclusive). So the lexer 5803 considers this as a token beginning at the first ', and doesn't even 5804 attempt to match other tokens inside it. 5805 5806 I thinnk this subtlety is not worth putting in the manual, as I suspect 5807 it would confuse more people than it would enlighten. 5808 5809 Vern 5810 5811 5812File: flex.info, Node: ERASEME57, Next: Is there a repository for flex scanners?, Prev: ERASEME56, Up: FAQ 5813 5814ERASEME57 5815========= 5816 5817 To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> 5818 Subject: Re: flex limitations 5819 In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. 5820 Date: Mon, 08 Sep 1997 11:38:08 PDT 5821 From: Vern Paxson <vern> 5822 5823 > %% 5824 > [a-zA-Z]+ /* skip a line */ 5825 > { printf("got %s\n", yytext); } 5826 > %% 5827 5828 What version of flex are you using? If I feed this to 2.5.4, it complains: 5829 5830 "bug.l", line 5: EOF encountered inside an action 5831 "bug.l", line 5: unrecognized rule 5832 "bug.l", line 5: fatal parse error 5833 5834 Not the world's greatest error message, but it manages to flag the problem. 5835 5836 (With the introduction of start condition scopes, flex can't accommodate 5837 an action on a separate line, since it's ambiguous with an indented rule.) 5838 5839 You can get 2.5.4 from ftp.ee.lbl.gov. 5840 5841 Vern 5842 5843 5844File: flex.info, Node: Is there a repository for flex scanners?, Next: How can I conditionally compile or preprocess my flex input file?, Prev: ERASEME57, Up: FAQ 5845 5846Is there a repository for flex scanners? 5847======================================== 5848 5849Not that we know of. You might try asking on comp.compilers. 5850 5851 5852File: flex.info, Node: How can I conditionally compile or preprocess my flex input file?, Next: Where can I find grammars for lex and yacc?, Prev: Is there a repository for flex scanners?, Up: FAQ 5853 5854How can I conditionally compile or preprocess my flex input file? 5855================================================================= 5856 5857Flex doesn't have a preprocessor like C does. You might try using m4, 5858or the C preprocessor plus a sed script to clean up the result. 5859 5860 5861File: flex.info, Node: Where can I find grammars for lex and yacc?, Next: I get an end-of-buffer message for each character scanned., Prev: How can I conditionally compile or preprocess my flex input file?, Up: FAQ 5862 5863Where can I find grammars for lex and yacc? 5864=========================================== 5865 5866In the sources for flex and bison. 5867 5868 5869File: flex.info, Node: I get an end-of-buffer message for each character scanned., Next: unnamed-faq-62, Prev: Where can I find grammars for lex and yacc?, Up: FAQ 5870 5871I get an end-of-buffer message for each character scanned. 5872========================================================== 5873 5874This will happen if your LexerInput() function returns only one 5875character at a time, which can happen either if you're scanner is 5876"interactive", or if the streams library on your platform always 5877returns 1 for yyin->gcount(). 5878 5879 Solution: override LexerInput() with a version that returns whole 5880buffers. 5881 5882 5883File: flex.info, Node: unnamed-faq-62, Next: unnamed-faq-63, Prev: I get an end-of-buffer message for each character scanned., Up: FAQ 5884 5885unnamed-faq-62 5886============== 5887 5888 To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 5889 Subject: Re: Flex maximums 5890 In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. 5891 Date: Mon, 17 Nov 1997 17:16:15 PST 5892 From: Vern Paxson <vern> 5893 5894 > I took a quick look into the flex-sources and altered some #defines in 5895 > flexdefs.h: 5896 > 5897 > #define INITIAL_MNS 64000 5898 > #define MNS_INCREMENT 1024000 5899 > #define MAXIMUM_MNS 64000 5900 5901 The things to fix are to add a couple of zeroes to: 5902 5903 #define JAMSTATE -32766 /* marks a reference to the state that always jams */ 5904 #define MAXIMUM_MNS 31999 5905 #define BAD_SUBSCRIPT -32767 5906 #define MAX_SHORT 32700 5907 5908 and, if you get complaints about too many rules, make the following change too: 5909 5910 #define YY_TRAILING_MASK 0x200000 5911 #define YY_TRAILING_HEAD_MASK 0x400000 5912 5913 - Vern 5914 5915 5916File: flex.info, Node: unnamed-faq-63, Next: unnamed-faq-64, Prev: unnamed-faq-62, Up: FAQ 5917 5918unnamed-faq-63 5919============== 5920 5921 To: jimmey@lexis-nexis.com (Jimmey Todd) 5922 Subject: Re: FLEX question regarding istream vs ifstream 5923 In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. 5924 Date: Mon, 15 Dec 1997 13:21:35 PST 5925 From: Vern Paxson <vern> 5926 5927 > stdin_handle = YY_CURRENT_BUFFER; 5928 > ifstream fin( "aFile" ); 5929 > yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); 5930 > 5931 > What I'm wanting to do, is pass the contents of a file thru one set 5932 > of rules and then pass stdin thru another set... It works great if, I 5933 > don't use the C++ classes. But since everything else that I'm doing is 5934 > in C++, I thought I'd be consistent. 5935 > 5936 > The problem is that 'yy_create_buffer' is expecting an istream* as it's 5937 > first argument (as stated in the man page). However, fin is a ifstream 5938 > object. Any ideas on what I might be doing wrong? Any help would be 5939 > appreciated. Thanks!! 5940 5941 You need to pass &fin, to turn it into an ifstream* instead of an ifstream. 5942 Then its type will be compatible with the expected istream*, because ifstream 5943 is derived from istream. 5944 5945 Vern 5946 5947 5948File: flex.info, Node: unnamed-faq-64, Next: unnamed-faq-65, Prev: unnamed-faq-63, Up: FAQ 5949 5950unnamed-faq-64 5951============== 5952 5953 To: Enda Fadian <fadiane@piercom.ie> 5954 Subject: Re: Question related to Flex man page? 5955 In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. 5956 Date: Tue, 16 Dec 1997 14:17:09 PST 5957 From: Vern Paxson <vern> 5958 5959 > Can you explain to me what is ment by a long-jump in relation to flex? 5960 5961 Using the longjmp() function while inside yylex() or a routine called by it. 5962 5963 > what is the flex activation frame. 5964 5965 Just yylex()'s stack frame. 5966 5967 > As far as I can see yyrestart will bring me back to the sart of the input 5968 > file and using flex++ isnot really an option! 5969 5970 No, yyrestart() doesn't imply a rewind, even though its name might sound 5971 like it does. It tells the scanner to flush its internal buffers and 5972 start reading from the given file at its present location. 5973 5974 Vern 5975 5976 5977File: flex.info, Node: unnamed-faq-65, Next: unnamed-faq-66, Prev: unnamed-faq-64, Up: FAQ 5978 5979unnamed-faq-65 5980============== 5981 5982 To: hassan@larc.info.uqam.ca (Hassan Alaoui) 5983 Subject: Re: Need urgent Help 5984 In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. 5985 Date: Sun, 21 Dec 1997 21:30:46 PST 5986 From: Vern Paxson <vern> 5987 5988 > /usr/lib/yaccpar: In function `int yyparse()': 5989 > /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' 5990 > 5991 > ld: Undefined symbol 5992 > _yylex 5993 > _yyparse 5994 > _yyin 5995 5996 This is a known problem with Solaris C++ (and/or Solaris yacc). I believe 5997 the fix is to explicitly insert some 'extern "C"' statements for the 5998 corresponding routines/symbols. 5999 6000 Vern 6001 6002 6003File: flex.info, Node: unnamed-faq-66, Next: unnamed-faq-67, Prev: unnamed-faq-65, Up: FAQ 6004 6005unnamed-faq-66 6006============== 6007 6008 To: mc0307@mclink.it 6009 Cc: gnu@prep.ai.mit.edu 6010 Subject: Re: [mc0307@mclink.it: Help request] 6011 In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. 6012 Date: Sun, 21 Dec 1997 22:33:37 PST 6013 From: Vern Paxson <vern> 6014 6015 > This is my definition for float and integer types: 6016 > . . . 6017 > NZD [1-9] 6018 > ... 6019 > I've tested my program on other lex version (on UNIX Sun Solaris an HP 6020 > UNIX) and it work well, so I think that my definitions are correct. 6021 > There are any differences between Lex and Flex? 6022 6023 There are indeed differences, as discussed in the man page. The one 6024 you are probably running into is that when flex expands a name definition, 6025 it puts parentheses around the expansion, while lex does not. There's 6026 an example in the man page of how this can lead to different matching. 6027 Flex's behavior complies with the POSIX standard (or at least with the 6028 last POSIX draft I saw). 6029 6030 Vern 6031 6032 6033File: flex.info, Node: unnamed-faq-67, Next: unnamed-faq-68, Prev: unnamed-faq-66, Up: FAQ 6034 6035unnamed-faq-67 6036============== 6037 6038 To: hassan@larc.info.uqam.ca (Hassan Alaoui) 6039 Subject: Re: Thanks 6040 In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. 6041 Date: Mon, 22 Dec 1997 14:35:05 PST 6042 From: Vern Paxson <vern> 6043 6044 > Thank you very much for your help. I compile and link well with C++ while 6045 > declaring 'yylex ...' extern, But a little problem remains. I get a 6046 > segmentation default when executing ( I linked with lfl library) while it 6047 > works well when using LEX instead of flex. Do you have some ideas about the 6048 > reason for this ? 6049 6050 The one possible reason for this that comes to mind is if you've defined 6051 yytext as "extern char yytext[]" (which is what lex uses) instead of 6052 "extern char *yytext" (which is what flex uses). If it's not that, then 6053 I'm afraid I don't know what the problem might be. 6054 6055 Vern 6056 6057 6058File: flex.info, Node: unnamed-faq-68, Next: unnamed-faq-69, Prev: unnamed-faq-67, Up: FAQ 6059 6060unnamed-faq-68 6061============== 6062 6063 To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> 6064 Subject: Re: flex 2.5: c++ scanners & start conditions 6065 In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. 6066 Date: Tue, 06 Jan 1998 19:19:30 PST 6067 From: Vern Paxson <vern> 6068 6069 > The problem is that when I do this (using %option c++) start 6070 > conditions seem to not apply. 6071 6072 The BEGIN macro modifies the yy_start variable. For C scanners, this 6073 is a static with scope visible through the whole file. For C++ scanners, 6074 it's a member variable, so it only has visible scope within a member 6075 function. Your lexbegin() routine is not a member function when you 6076 build a C++ scanner, so it's not modifying the correct yy_start. The 6077 diagnostic that indicates this is that you found you needed to add 6078 a declaration of yy_start in order to get your scanner to compile when 6079 using C++; instead, the correct fix is to make lexbegin() a member 6080 function (by deriving from yyFlexLexer). 6081 6082 Vern 6083 6084 6085File: flex.info, Node: unnamed-faq-69, Next: unnamed-faq-70, Prev: unnamed-faq-68, Up: FAQ 6086 6087unnamed-faq-69 6088============== 6089 6090 To: "Boris Zinin" <boris@ippe.rssi.ru> 6091 Subject: Re: current position in flex buffer 6092 In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. 6093 Date: Mon, 12 Jan 1998 12:03:15 PST 6094 From: Vern Paxson <vern> 6095 6096 > The problem is how to determine the current position in flex active 6097 > buffer when a rule is matched.... 6098 6099 You will need to keep track of this explicitly, such as by redefining 6100 YY_USER_ACTION to count the number of characters matched. 6101 6102 The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. 6103 6104 Vern 6105 6106 6107File: flex.info, Node: unnamed-faq-70, Next: unnamed-faq-71, Prev: unnamed-faq-69, Up: FAQ 6108 6109unnamed-faq-70 6110============== 6111 6112 To: Bik.Dhaliwal@bis.org 6113 Subject: Re: Flex question 6114 In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. 6115 Date: Tue, 27 Jan 1998 22:41:52 PST 6116 From: Vern Paxson <vern> 6117 6118 > That requirement involves knowing 6119 > the character position at which a particular token was matched 6120 > in the lexer. 6121 6122 The way you have to do this is by explicitly keeping track of where 6123 you are in the file, by counting the number of characters scanned 6124 for each token (available in yyleng). It may prove convenient to 6125 do this by redefining YY_USER_ACTION, as described in the manual. 6126 6127 Vern 6128 6129 6130File: flex.info, Node: unnamed-faq-71, Next: unnamed-faq-72, Prev: unnamed-faq-70, Up: FAQ 6131 6132unnamed-faq-71 6133============== 6134 6135 To: Vladimir Alexiev <vladimir@cs.ualberta.ca> 6136 Subject: Re: flex: how to control start condition from parser? 6137 In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. 6138 Date: Tue, 27 Jan 1998 22:45:37 PST 6139 From: Vern Paxson <vern> 6140 6141 > It seems useful for the parser to be able to tell the lexer about such 6142 > context dependencies, because then they don't have to be limited to 6143 > local or sequential context. 6144 6145 One way to do this is to have the parser call a stub routine that's 6146 included in the scanner's .l file, and consequently that has access ot 6147 BEGIN. The only ugliness is that the parser can't pass in the state 6148 it wants, because those aren't visible - but if you don't have many 6149 such states, then using a different set of names doesn't seem like 6150 to much of a burden. 6151 6152 While generating a .h file like you suggests is certainly cleaner, 6153 flex development has come to a virtual stand-still :-(, so a workaround 6154 like the above is much more pragmatic than waiting for a new feature. 6155 6156 Vern 6157 6158 6159File: flex.info, Node: unnamed-faq-72, Next: unnamed-faq-73, Prev: unnamed-faq-71, Up: FAQ 6160 6161unnamed-faq-72 6162============== 6163 6164 To: Barbara Denny <denny@3com.com> 6165 Subject: Re: freebsd flex bug? 6166 In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. 6167 Date: Fri, 30 Jan 1998 12:42:32 PST 6168 From: Vern Paxson <vern> 6169 6170 > lex.yy.c:1996: parse error before `=' 6171 6172 This is the key, identifying this error. (It may help to pinpoint 6173 it by using flex -L, so it doesn't generate #line directives in its 6174 output.) I will bet you heavy money that you have a start condition 6175 name that is also a variable name, or something like that; flex spits 6176 out #define's for each start condition name, mapping them to a number, 6177 so you can wind up with: 6178 6179 %x foo 6180 %% 6181 ... 6182 %% 6183 void bar() 6184 { 6185 int foo = 3; 6186 } 6187 6188 and the penultimate will turn into "int 1 = 3" after C preprocessing, 6189 since flex will put "#define foo 1" in the generated scanner. 6190 6191 Vern 6192 6193 6194File: flex.info, Node: unnamed-faq-73, Next: unnamed-faq-74, Prev: unnamed-faq-72, Up: FAQ 6195 6196unnamed-faq-73 6197============== 6198 6199 To: Maurice Petrie <mpetrie@infoscigroup.com> 6200 Subject: Re: Lost flex .l file 6201 In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. 6202 Date: Mon, 02 Feb 1998 11:15:12 PST 6203 From: Vern Paxson <vern> 6204 6205 > I am curious as to 6206 > whether there is a simple way to backtrack from the generated source to 6207 > reproduce the lost list of tokens we are searching on. 6208 6209 In theory, it's straight-forward to go from the DFA representation 6210 back to a regular-expression representation - the two are isomorphic. 6211 In practice, a huge headache, because you have to unpack all the tables 6212 back into a single DFA representation, and then write a program to munch 6213 on that and translate it into an RE. 6214 6215 Sorry for the less-than-happy news ... 6216 6217 Vern 6218 6219 6220File: flex.info, Node: unnamed-faq-74, Next: unnamed-faq-75, Prev: unnamed-faq-73, Up: FAQ 6221 6222unnamed-faq-74 6223============== 6224 6225 To: jimmey@lexis-nexis.com (Jimmey Todd) 6226 Subject: Re: Flex performance question 6227 In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 6228 Date: Thu, 19 Feb 1998 08:48:51 PST 6229 From: Vern Paxson <vern> 6230 6231 > What I have found, is that the smaller the data chunk, the faster the 6232 > program executes. This is the opposite of what I expected. Should this be 6233 > happening this way? 6234 6235 This is exactly what will happen if your input file has embedded NULs. 6236 From the man page: 6237 6238 A final note: flex is slow when matching NUL's, particularly 6239 when a token contains multiple NUL's. It's best to write 6240 rules which match short amounts of text if it's anticipated 6241 that the text will often include NUL's. 6242 6243 So that's the first thing to look for. 6244 6245 Vern 6246 6247 6248File: flex.info, Node: unnamed-faq-75, Next: unnamed-faq-76, Prev: unnamed-faq-74, Up: FAQ 6249 6250unnamed-faq-75 6251============== 6252 6253 To: jimmey@lexis-nexis.com (Jimmey Todd) 6254 Subject: Re: Flex performance question 6255 In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. 6256 Date: Thu, 19 Feb 1998 15:42:25 PST 6257 From: Vern Paxson <vern> 6258 6259 So there are several problems. 6260 6261 First, to go fast, you want to match as much text as possible, which 6262 your scanners don't in the case that what they're scanning is *not* 6263 a <RN> tag. So you want a rule like: 6264 6265 [^<]+ 6266 6267 Second, C++ scanners are particularly slow if they're interactive, 6268 which they are by default. Using -B speeds it up by a factor of 3-4 6269 on my workstation. 6270 6271 Third, C++ scanners that use the istream interface are slow, because 6272 of how poorly implemented istream's are. I built two versions of 6273 the following scanner: 6274 6275 %% 6276 .*\n 6277 .* 6278 %% 6279 6280 and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. 6281 The C++ istream version, using -B, takes 3.8 seconds. 6282 6283 Vern 6284 6285 6286File: flex.info, Node: unnamed-faq-76, Next: unnamed-faq-77, Prev: unnamed-faq-75, Up: FAQ 6287 6288unnamed-faq-76 6289============== 6290 6291 To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> 6292 Subject: Re: FLEX 2.5 & THE YEAR 2000 6293 In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. 6294 Date: Wed, 03 Jun 1998 10:22:26 PDT 6295 From: Vern Paxson <vern> 6296 6297 > I am researching the Y2K problem with General Electric R&D 6298 > and need to know if there are any known issues concerning 6299 > the above mentioned software and Y2K regardless of version. 6300 6301 There shouldn't be, all it ever does with the date is ask the system 6302 for it and then print it out. 6303 6304 Vern 6305 6306 6307File: flex.info, Node: unnamed-faq-77, Next: unnamed-faq-78, Prev: unnamed-faq-76, Up: FAQ 6308 6309unnamed-faq-77 6310============== 6311 6312 To: "Hans Dermot Doran" <htd@ibhdoran.com> 6313 Subject: Re: flex problem 6314 In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. 6315 Date: Tue, 21 Jul 1998 14:23:34 PDT 6316 From: Vern Paxson <vern> 6317 6318 > To overcome this, I gets() the stdin into a string and lex the string. The 6319 > string is lexed OK except that the end of string isn't lexed properly 6320 > (yy_scan_string()), that is the lexer dosn't recognise the end of string. 6321 6322 Flex doesn't contain mechanisms for recognizing buffer endpoints. But if 6323 you use fgets instead (which you should anyway, to protect against buffer 6324 overflows), then the final \n will be preserved in the string, and you can 6325 scan that in order to find the end of the string. 6326 6327 Vern 6328 6329 6330File: flex.info, Node: unnamed-faq-78, Next: unnamed-faq-79, Prev: unnamed-faq-77, Up: FAQ 6331 6332unnamed-faq-78 6333============== 6334 6335 To: soumen@almaden.ibm.com 6336 Subject: Re: Flex++ 2.5.3 instance member vs. static member 6337 In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. 6338 Date: Tue, 28 Jul 1998 01:10:34 PDT 6339 From: Vern Paxson <vern> 6340 6341 > %{ 6342 > int mylineno = 0; 6343 > %} 6344 > ws [ \t]+ 6345 > alpha [A-Za-z] 6346 > dig [0-9] 6347 > %% 6348 > 6349 > Now you'd expect mylineno to be a member of each instance of class 6350 > yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to 6351 > indicate otherwise; unless I am missing something the declaration of 6352 > mylineno seems to be outside any class scope. 6353 > 6354 > How will this work if I want to run a multi-threaded application with each 6355 > thread creating a FlexLexer instance? 6356 6357 Derive your own subclass and make mylineno a member variable of it. 6358 6359 Vern 6360 6361 6362File: flex.info, Node: unnamed-faq-79, Next: unnamed-faq-80, Prev: unnamed-faq-78, Up: FAQ 6363 6364unnamed-faq-79 6365============== 6366 6367 To: Adoram Rogel <adoram@hybridge.com> 6368 Subject: Re: More than 32K states change hangs 6369 In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. 6370 Date: Tue, 04 Aug 1998 22:28:45 PDT 6371 From: Vern Paxson <vern> 6372 6373 > Vern Paxson, 6374 > 6375 > I followed your advice, posted on Usenet bu you, and emailed to me 6376 > personally by you, on how to overcome the 32K states limit. I'm running 6377 > on Linux machines. 6378 > I took the full source of version 2.5.4 and did the following changes in 6379 > flexdef.h: 6380 > #define JAMSTATE -327660 6381 > #define MAXIMUM_MNS 319990 6382 > #define BAD_SUBSCRIPT -327670 6383 > #define MAX_SHORT 327000 6384 > 6385 > and compiled. 6386 > All looked fine, including check and bigcheck, so I installed. 6387 6388 Hmmm, you shouldn't increase MAX_SHORT, though looking through my email 6389 archives I see that I did indeed recommend doing so. Try setting it back 6390 to 32700; that should suffice that you no longer need -Ca. If it still 6391 hangs, then the interesting question is - where? 6392 6393 > Compiling the same hanged program with a out-of-the-box (RedHat 4.2 6394 > distribution of Linux) 6395 > flex 2.5.4 binary works. 6396 6397 Since Linux comes with source code, you should diff it against what 6398 you have to see what problems they missed. 6399 6400 > Should I always compile with the -Ca option now ? even short and simple 6401 > filters ? 6402 6403 No, definitely not. It's meant to be for those situations where you 6404 absolutely must squeeze every last cycle out of your scanner. 6405 6406 Vern 6407 6408 6409File: flex.info, Node: unnamed-faq-80, Next: unnamed-faq-81, Prev: unnamed-faq-79, Up: FAQ 6410 6411unnamed-faq-80 6412============== 6413 6414 To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> 6415 Subject: Re: flex output for static code portion 6416 In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. 6417 Date: Mon, 17 Aug 1998 23:57:42 PDT 6418 From: Vern Paxson <vern> 6419 6420 > I would like to use flex under the hood to generate a binary file 6421 > containing the data structures that control the parse. 6422 6423 This has been on the wish-list for a long time. In principle it's 6424 straight-forward - you redirect mkdata() et al's I/O to another file, 6425 and modify the skeleton to have a start-up function that slurps these 6426 into dynamic arrays. The concerns are (1) the scanner generation code 6427 is hairy and full of corner cases, so it's easy to get surprised when 6428 going down this path :-( ; and (2) being careful about buffering so 6429 that when the tables change you make sure the scanner starts in the 6430 correct state and reading at the right point in the input file. 6431 6432 > I was wondering if you know of anyone who has used flex in this way. 6433 6434 I don't - but it seems like a reasonable project to undertake (unlike 6435 numerous other flex tweaks :-). 6436 6437 Vern 6438 6439 6440File: flex.info, Node: unnamed-faq-81, Next: unnamed-faq-82, Prev: unnamed-faq-80, Up: FAQ 6441 6442unnamed-faq-81 6443============== 6444 6445 Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) 6446 by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 6447 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) 6448 Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) 6449 by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 6450 for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 6451 Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 6452 From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> 6453 Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> 6454 Subject: "flex scanner push-back overflow" 6455 To: vern@ee.lbl.gov 6456 Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) 6457 Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 6458 X-NoJunk: Do NOT send commercial mail, spam or ads to this address! 6459 X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ 6460 X-Mailer: ELM [version 2.4ME+ PL28 (25)] 6461 MIME-Version: 1.0 6462 Content-Type: text/plain; charset=US-ASCII 6463 Content-Transfer-Encoding: 7bit 6464 6465 Hi Vern, 6466 6467 Yesterday, I encountered a strange problem: I use the macro processor m4 6468 to include some lengthy lists into a .l file. Following is a flex macro 6469 definition that causes some serious pain in my neck: 6470 6471 AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) 6472 6473 The complete list contains about 10kB. When I try to "flex" this file 6474 (on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased 6475 some of the predefined values in flexdefs.h) I get the error: 6476 6477 myflex/flex -8 sentag.tmp.l 6478 flex scanner push-back overflow 6479 6480 When I remove the slashes in the macro definition everything works fine. 6481 As I understand it, the double quotes escape the slash-character so it 6482 really means "/" and not "trailing context". Furthermore, I tried to 6483 escape the slashes with backslashes, but with no use, the same error message 6484 appeared when flexing the code. 6485 6486 Do you have an idea what's going on here? 6487 6488 Greetings from Germany, 6489 Georg 6490 -- 6491 Georg Rehm georg@cl-ki.uni-osnabrueck.de 6492 Institute for Semantic Information Processing, University of Osnabrueck, FRG 6493 6494 6495File: flex.info, Node: unnamed-faq-82, Next: unnamed-faq-83, Prev: unnamed-faq-81, Up: FAQ 6496 6497unnamed-faq-82 6498============== 6499 6500 To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE 6501 Subject: Re: "flex scanner push-back overflow" 6502 In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. 6503 Date: Thu, 20 Aug 1998 07:05:35 PDT 6504 From: Vern Paxson <vern> 6505 6506 > myflex/flex -8 sentag.tmp.l 6507 > flex scanner push-back overflow 6508 6509 Flex itself uses a flex scanner. That scanner is running out of buffer 6510 space when it tries to unput() the humongous macro you've defined. When 6511 you remove the '/'s, you make it small enough so that it fits in the buffer; 6512 removing spaces would do the same thing. 6513 6514 The fix is to either rethink how come you're using such a big macro and 6515 perhaps there's another/better way to do it; or to rebuild flex's own 6516 scan.c with a larger value for 6517 6518 #define YY_BUF_SIZE 16384 6519 6520 - Vern 6521 6522 6523File: flex.info, Node: unnamed-faq-83, Next: unnamed-faq-84, Prev: unnamed-faq-82, Up: FAQ 6524 6525unnamed-faq-83 6526============== 6527 6528 To: Jan Kort <jan@research.techforce.nl> 6529 Subject: Re: Flex 6530 In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. 6531 Date: Sat, 05 Sep 1998 00:59:49 PDT 6532 From: Vern Paxson <vern> 6533 6534 > %% 6535 > 6536 > "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } 6537 > ^\n { fprintf(stderr, "empty line\n"); } 6538 > . { } 6539 > \n { fprintf(stderr, "new line\n"); } 6540 > 6541 > %% 6542 > -- input --------------------------------------- 6543 > TEST1 6544 > -- output -------------------------------------- 6545 > TEST1 6546 > empty line 6547 > ------------------------------------------------ 6548 6549 IMHO, it's not clear whether or not this is in fact a bug. It depends 6550 on whether you view yyless() as backing up in the input stream, or as 6551 pushing new characters onto the beginning of the input stream. Flex 6552 interprets it as the latter (for implementation convenience, I'll admit), 6553 and so considers the newline as in fact matching at the beginning of a 6554 line, as after all the last token scanned an entire line and so the 6555 scanner is now at the beginning of a new line. 6556 6557 I agree that this is counter-intuitive for yyless(), given its 6558 functional description (it's less so for unput(), depending on whether 6559 you're unput()'ing new text or scanned text). But I don't plan to 6560 change it any time soon, as it's a pain to do so. Consequently, 6561 you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak 6562 your scanner into the behavior you desire. 6563 6564 Sorry for the less-than-completely-satisfactory answer. 6565 6566 Vern 6567 6568 6569File: flex.info, Node: unnamed-faq-84, Next: unnamed-faq-85, Prev: unnamed-faq-83, Up: FAQ 6570 6571unnamed-faq-84 6572============== 6573 6574 To: Patrick Krusenotto <krusenot@mac-info-link.de> 6575 Subject: Re: Problems with restarting flex-2.5.2-generated scanner 6576 In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. 6577 Date: Thu, 24 Sep 1998 23:28:43 PDT 6578 From: Vern Paxson <vern> 6579 6580 > I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately 6581 > trying to make my scanner restart with a new file after my parser stops 6582 > with a parse error. When my compiler restarts, the parser always 6583 > receives the token after the token (in the old file!) that caused the 6584 > parser error. 6585 6586 I suspect the problem is that your parser has read ahead in order 6587 to attempt to resolve an ambiguity, and when it's restarted it picks 6588 up with that token rather than reading a fresh one. If you're using 6589 yacc, then the special "error" production can sometimes be used to 6590 consume tokens in an attempt to get the parser into a consistent state. 6591 6592 Vern 6593 6594 6595File: flex.info, Node: unnamed-faq-85, Next: unnamed-faq-86, Prev: unnamed-faq-84, Up: FAQ 6596 6597unnamed-faq-85 6598============== 6599 6600 To: Henric Jungheim <junghelh@pe-nelson.com> 6601 Subject: Re: flex 2.5.4a 6602 In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. 6603 Date: Tue, 27 Oct 1998 16:50:14 PST 6604 From: Vern Paxson <vern> 6605 6606 > This brings up a feature request: How about a command line 6607 > option to specify the filename when reading from stdin? That way one 6608 > doesn't need to create a temporary file in order to get the "#line" 6609 > directives to make sense. 6610 6611 Use -o combined with -t (per the man page description of -o). 6612 6613 > P.S., Is there any simple way to use non-blocking IO to parse multiple 6614 > streams? 6615 6616 Simple, no. 6617 6618 One approach might be to return a magic character on EWOULDBLOCK and 6619 have a rule 6620 6621 .*<magic-character> // put back .*, eat magic character 6622 6623 This is off the top of my head, not sure it'll work. 6624 6625 Vern 6626 6627 6628File: flex.info, Node: unnamed-faq-86, Next: unnamed-faq-87, Prev: unnamed-faq-85, Up: FAQ 6629 6630unnamed-faq-86 6631============== 6632 6633 To: "Repko, Billy D" <billy.d.repko@intel.com> 6634 Subject: Re: Compiling scanners 6635 In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. 6636 Date: Thu, 14 Jan 1999 00:25:30 PST 6637 From: Vern Paxson <vern> 6638 6639 > It appears that maybe it cannot find the lfl library. 6640 6641 The Makefile in the distribution builds it, so you should have it. 6642 It's exceedingly trivial, just a main() that calls yylex() and 6643 a yyrap() that always returns 1. 6644 6645 > %% 6646 > \n ++num_lines; ++num_chars; 6647 > . ++num_chars; 6648 6649 You can't indent your rules like this - that's where the errors are coming 6650 from. Flex copies indented text to the output file, it's how you do things 6651 like 6652 6653 int num_lines_seen = 0; 6654 6655 to declare local variables. 6656 6657 Vern 6658 6659 6660File: flex.info, Node: unnamed-faq-87, Next: unnamed-faq-88, Prev: unnamed-faq-86, Up: FAQ 6661 6662unnamed-faq-87 6663============== 6664 6665 To: Erick Branderhorst <Erick.Branderhorst@asml.nl> 6666 Subject: Re: flex input buffer 6667 In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. 6668 Date: Tue, 09 Feb 1999 21:03:37 PST 6669 From: Vern Paxson <vern> 6670 6671 > In the flex.skl file the size of the default input buffers is set. Can you 6672 > explain why this size is set and why it is such a high number. 6673 6674 It's large to optimize performance when scanning large files. You can 6675 safely make it a lot lower if needed. 6676 6677 Vern 6678 6679 6680File: flex.info, Node: unnamed-faq-88, Next: unnamed-faq-90, Prev: unnamed-faq-87, Up: FAQ 6681 6682unnamed-faq-88 6683============== 6684 6685 To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> 6686 Subject: Re: Flex error message 6687 In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. 6688 Date: Thu, 25 Feb 1999 00:11:31 PST 6689 From: Vern Paxson <vern> 6690 6691 > I'm extending a larger scanner written in Flex and I keep running into 6692 > problems. More specifically, I get the error message: 6693 > "flex: input rules are too complicated (>= 32000 NFA states)" 6694 6695 Increase the definitions in flexdef.h for: 6696 6697 #define JAMSTATE -32766 /* marks a reference to the state that always j 6698 ams */ 6699 #define MAXIMUM_MNS 31999 6700 #define BAD_SUBSCRIPT -32767 6701 6702 recompile everything, and it should all work. 6703 6704 Vern 6705 6706 6707File: flex.info, Node: unnamed-faq-90, Next: unnamed-faq-91, Prev: unnamed-faq-88, Up: FAQ 6708 6709unnamed-faq-90 6710============== 6711 6712 To: "Dmitriy Goldobin" <gold@ems.chel.su> 6713 Subject: Re: FLEX trouble 6714 In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. 6715 Date: Tue, 01 Jun 1999 00:15:07 PDT 6716 From: Vern Paxson <vern> 6717 6718 > I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 6719 > but rule "/*"(.|\n)*"*/" don't work ? 6720 6721 The second of these will have to scan the entire input stream (because 6722 "(.|\n)*" matches an arbitrary amount of any text) in order to see if 6723 it ends with "*/", terminating the comment. That potentially will overflow 6724 the input buffer. 6725 6726 > More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error 6727 > 'unrecognized rule'. 6728 6729 You can't use the '/' operator inside parentheses. It's not clear 6730 what "(a/b)*" actually means. 6731 6732 > I now use workaround with state <comment>, but single-rule is 6733 > better, i think. 6734 6735 Single-rule is nice but will always have the problem of either setting 6736 restrictions on comments (like not allowing multi-line comments) and/or 6737 running the risk of consuming the entire input stream, as noted above. 6738 6739 Vern 6740 6741 6742File: flex.info, Node: unnamed-faq-91, Next: unnamed-faq-92, Prev: unnamed-faq-90, Up: FAQ 6743 6744unnamed-faq-91 6745============== 6746 6747 Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) 6748 by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 6749 for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) 6750 Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 6751 To: vern@ee.lbl.gov 6752 Date: Tue, 15 Jun 1999 08:55:43 -0700 6753 From: "Aki Niimura" <neko@my-deja.com> 6754 Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> 6755 Mime-Version: 1.0 6756 Cc: 6757 X-Sent-Mail: on 6758 Reply-To: 6759 X-Mailer: MailCity Service 6760 Subject: A question on flex C++ scanner 6761 X-Sender-Ip: 12.72.207.61 6762 Organization: My Deja Email (http://www.my-deja.com:80) 6763 Content-Type: text/plain; charset=us-ascii 6764 Content-Transfer-Encoding: 7bit 6765 6766 Dear Dr. Paxon, 6767 6768 I have been using flex for years. 6769 It works very well on many projects. 6770 Most case, I used it to generate a scanner on C language. 6771 However, one project I needed to generate a scanner 6772 on C++ lanuage. Thanks to your enhancement, flex did 6773 the job. 6774 6775 Currently, I'm working on enhancing my previous project. 6776 I need to deal with multiple input streams (recursive 6777 inclusion) in this scanner (C++). 6778 I did similar thing for another scanner (C) as you 6779 explained in your documentation. 6780 6781 The generated scanner (C++) has necessary methods: 6782 - switch_to_buffer(struct yy_buffer_state *b) 6783 - yy_create_buffer(istream *is, int sz) 6784 - yy_delete_buffer(struct yy_buffer_state *b) 6785 6786 However, I couldn't figure out how to access current 6787 buffer (yy_current_buffer). 6788 6789 yy_current_buffer is a protected member of yyFlexLexer. 6790 I can't access it directly. 6791 Then, I thought yy_create_buffer() with is = 0 might 6792 return current stream buffer. But it seems not as far 6793 as I checked the source. (flex 2.5.4) 6794 6795 I went through the Web in addition to Flex documentation. 6796 However, it hasn't been successful, so far. 6797 6798 It is not my intention to bother you, but, can you 6799 comment about how to obtain the current stream buffer? 6800 6801 Your response would be highly appreciated. 6802 6803 Best regards, 6804 Aki Niimura 6805 6806 --== Sent via Deja.com http://www.deja.com/ ==-- 6807 Share what you know. Learn what you don't. 6808 6809 6810File: flex.info, Node: unnamed-faq-92, Next: unnamed-faq-93, Prev: unnamed-faq-91, Up: FAQ 6811 6812unnamed-faq-92 6813============== 6814 6815 To: neko@my-deja.com 6816 Subject: Re: A question on flex C++ scanner 6817 In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. 6818 Date: Tue, 15 Jun 1999 09:04:24 PDT 6819 From: Vern Paxson <vern> 6820 6821 > However, I couldn't figure out how to access current 6822 > buffer (yy_current_buffer). 6823 6824 Derive your own subclass from yyFlexLexer. 6825 6826 Vern 6827 6828 6829File: flex.info, Node: unnamed-faq-93, Next: unnamed-faq-94, Prev: unnamed-faq-92, Up: FAQ 6830 6831unnamed-faq-93 6832============== 6833 6834 To: "Stones, Darren" <Darren.Stones@nectech.co.uk> 6835 Subject: Re: You're the man to see? 6836 In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. 6837 Date: Wed, 23 Jun 1999 09:01:40 PDT 6838 From: Vern Paxson <vern> 6839 6840 > I hope you can help me. I am using Flex and Bison to produce an interpreted 6841 > language. However all goes well until I try to implement an IF statement or 6842 > a WHILE. I cannot get this to work as the parser parses all the conditions 6843 > eg. the TRUE and FALSE conditons to check for a rule match. So I cannot 6844 > make a decision!! 6845 6846 You need to use the parser to build a parse tree (= abstract syntax trwee), 6847 and when that's all done you recursively evaluate the tree, binding variables 6848 to values at that time. 6849 6850 Vern 6851 6852 6853File: flex.info, Node: unnamed-faq-94, Next: unnamed-faq-95, Prev: unnamed-faq-93, Up: FAQ 6854 6855unnamed-faq-94 6856============== 6857 6858 To: Petr Danecek <petr@ics.cas.cz> 6859 Subject: Re: flex - question 6860 In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. 6861 Date: Fri, 02 Jul 1999 16:52:13 PDT 6862 From: Vern Paxson <vern> 6863 6864 > file, it takes an enormous amount of time. It is funny, because the 6865 > source code has only 12 rules!!! I think it looks like an exponencial 6866 > growth. 6867 6868 Right, that's the problem - some patterns (those with a lot of 6869 ambiguity, where yours has because at any given time the scanner can 6870 be in the middle of all sorts of combinations of the different 6871 rules) blow up exponentially. 6872 6873 For your rules, there is an easy fix. Change the ".*" that comes fater 6874 the directory name to "[^ ]*". With that in place, the rules are no 6875 longer nearly so ambiguous, because then once one of the directories 6876 has been matched, no other can be matched (since they all require a 6877 leading blank). 6878 6879 If that's not an acceptable solution, then you can enter a start state 6880 to pick up the .*\n after each directory is matched. 6881 6882 Also note that for speed, you'll want to add a ".*" rule at the end, 6883 otherwise rules that don't match any of the patterns will be matched 6884 very slowly, a character at a time. 6885 6886 Vern 6887 6888 6889File: flex.info, Node: unnamed-faq-95, Next: unnamed-faq-96, Prev: unnamed-faq-94, Up: FAQ 6890 6891unnamed-faq-95 6892============== 6893 6894 To: Tielman Koekemoer <tielman@spi.co.za> 6895 Subject: Re: Please help. 6896 In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. 6897 Date: Thu, 08 Jul 1999 08:20:39 PDT 6898 From: Vern Paxson <vern> 6899 6900 > I was hoping you could help me with my problem. 6901 > 6902 > I tried compiling (gnu)flex on a Solaris 2.4 machine 6903 > but when I ran make (after configure) I got an error. 6904 > 6905 > -------------------------------------------------------------- 6906 > gcc -c -I. -I. -g -O parse.c 6907 > ./flex -t -p ./scan.l >scan.c 6908 > sh: ./flex: not found 6909 > *** Error code 1 6910 > make: Fatal error: Command failed for target `scan.c' 6911 > ------------------------------------------------------------- 6912 > 6913 > What's strange to me is that I'm only 6914 > trying to install flex now. I then edited the Makefile to 6915 > and changed where it says "FLEX = flex" to "FLEX = lex" 6916 > ( lex: the native Solaris one ) but then it complains about 6917 > the "-p" option. Is there any way I can compile flex without 6918 > using flex or lex? 6919 > 6920 > Thanks so much for your time. 6921 6922 You managed to step on the bootstrap sequence, which first copies 6923 initscan.c to scan.c in order to build flex. Try fetching a fresh 6924 distribution from ftp.ee.lbl.gov. (Or you can first try removing 6925 ".bootstrap" and doing a make again.) 6926 6927 Vern 6928 6929 6930File: flex.info, Node: unnamed-faq-96, Next: unnamed-faq-97, Prev: unnamed-faq-95, Up: FAQ 6931 6932unnamed-faq-96 6933============== 6934 6935 To: Tielman Koekemoer <tielman@spi.co.za> 6936 Subject: Re: Please help. 6937 In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. 6938 Date: Fri, 09 Jul 1999 00:27:20 PDT 6939 From: Vern Paxson <vern> 6940 6941 > First I removed .bootstrap (and ran make) - no luck. I downloaded the 6942 > software but I still have the same problem. Is there anything else I 6943 > could try. 6944 6945 Try: 6946 6947 cp initscan.c scan.c 6948 touch scan.c 6949 make scan.o 6950 6951 If this last tries to first build scan.c from scan.l using ./flex, then 6952 your "make" is broken, in which case compile scan.c to scan.o by hand. 6953 6954 Vern 6955 6956 6957File: flex.info, Node: unnamed-faq-97, Next: unnamed-faq-98, Prev: unnamed-faq-96, Up: FAQ 6958 6959unnamed-faq-97 6960============== 6961 6962 To: Sumanth Kamenani <skamenan@crl.nmsu.edu> 6963 Subject: Re: Error 6964 In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. 6965 Date: Tue, 20 Jul 1999 00:18:26 PDT 6966 From: Vern Paxson <vern> 6967 6968 > I am getting a compilation error. The error is given as "unknown symbol- yylex". 6969 6970 The parser relies on calling yylex(), but you're instead using the C++ scanning 6971 class, so you need to supply a yylex() "glue" function that calls an instance 6972 scanner of the scanner (e.g., "scanner->yylex()"). 6973 6974 Vern 6975 6976 6977File: flex.info, Node: unnamed-faq-98, Next: unnamed-faq-99, Prev: unnamed-faq-97, Up: FAQ 6978 6979unnamed-faq-98 6980============== 6981 6982 To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) 6983 Subject: Re: lex 6984 In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. 6985 Date: Tue, 23 Nov 1999 15:54:30 PST 6986 From: Vern Paxson <vern> 6987 6988 Well, your problem is the 6989 6990 switch (yybgin-yysvec-1) { /* witchcraft */ 6991 6992 at the beginning of lex rules. "witchcraft" == "non-portable". It's 6993 assuming knowledge of the AT&T lex's internal variables. 6994 6995 For flex, you can probably do the equivalent using a switch on YYSTATE. 6996 6997 Vern 6998 6999 7000File: flex.info, Node: unnamed-faq-99, Next: unnamed-faq-100, Prev: unnamed-faq-98, Up: FAQ 7001 7002unnamed-faq-99 7003============== 7004 7005 To: archow@hss.hns.com 7006 Subject: Re: Regarding distribution of flex and yacc based grammars 7007 In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. 7008 Date: Wed, 22 Dec 1999 01:56:24 PST 7009 From: Vern Paxson <vern> 7010 7011 > When we provide the customer with an object code distribution, is it 7012 > necessary for us to provide source 7013 > for the generated C files from flex and bison since they are generated by 7014 > flex and bison ? 7015 7016 For flex, no. I don't know what the current state of this is for bison. 7017 7018 > Also, is there any requrirement for us to neccessarily provide source for 7019 > the grammar files which are fed into flex and bison ? 7020 7021 Again, for flex, no. 7022 7023 See the file "COPYING" in the flex distribution for the legalese. 7024 7025 Vern 7026 7027 7028File: flex.info, Node: unnamed-faq-100, Next: unnamed-faq-101, Prev: unnamed-faq-99, Up: FAQ 7029 7030unnamed-faq-100 7031=============== 7032 7033 To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> 7034 Subject: Re: Flex, and self referencing rules 7035 In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. 7036 Date: Sat, 19 Feb 2000 18:33:16 PST 7037 From: Vern Paxson <vern> 7038 7039 > However, I do not use unput anywhere. I do use self-referencing 7040 > rules like this: 7041 > 7042 > UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) 7043 7044 You can't do this - flex is *not* a parser like yacc (which does indeed 7045 allow recursion), it is a scanner that's confined to regular expressions. 7046 7047 Vern 7048 7049 7050File: flex.info, Node: unnamed-faq-101, Next: What is the difference between YYLEX_PARAM and YY_DECL?, Prev: unnamed-faq-100, Up: FAQ 7051 7052unnamed-faq-101 7053=============== 7054 7055 To: slg3@lehigh.edu (SAMUEL L. GULDEN) 7056 Subject: Re: Flex problem 7057 In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. 7058 Date: Thu, 02 Mar 2000 23:00:46 PST 7059 From: Vern Paxson <vern> 7060 7061 If this is exactly your program: 7062 7063 > digit [0-9] 7064 > digits {digit}+ 7065 > whitespace [ \t\n]+ 7066 > 7067 > %% 7068 > "[" { printf("open_brac\n");} 7069 > "]" { printf("close_brac\n");} 7070 > "+" { printf("addop\n");} 7071 > "*" { printf("multop\n");} 7072 > {digits} { printf("NUMBER = %s\n", yytext);} 7073 > whitespace ; 7074 7075 then the problem is that the last rule needs to be "{whitespace}" ! 7076 7077 Vern 7078 7079 7080File: flex.info, Node: What is the difference between YYLEX_PARAM and YY_DECL?, Next: Why do I get "conflicting types for yylex" error?, Prev: unnamed-faq-101, Up: FAQ 7081 7082What is the difference between YYLEX_PARAM and YY_DECL? 7083======================================================= 7084 7085YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to 7086pass extra params when it calls yylex() from the parser. 7087 7088 YY_DECL is the Flex declaration of yylex. The default is similar to 7089this: 7090 7091 #define int yy_lex () 7092 7093 7094File: flex.info, Node: Why do I get "conflicting types for yylex" error?, Next: How do I access the values set in a Flex action from within a Bison action?, Prev: What is the difference between YYLEX_PARAM and YY_DECL?, Up: FAQ 7095 7096Why do I get "conflicting types for yylex" error? 7097================================================= 7098 7099This is a compiler error regarding a generated Bison parser, not a Flex 7100scanner. It means you need a prototype of yylex() in the top of the 7101Bison file. Be sure the prototype matches YY_DECL. 7102 7103 7104File: flex.info, Node: How do I access the values set in a Flex action from within a Bison action?, Prev: Why do I get "conflicting types for yylex" error?, Up: FAQ 7105 7106How do I access the values set in a Flex action from within a Bison action? 7107=========================================================================== 7108 7109With $1, $2, $3, etc. These are called "Semantic Values" in the Bison 7110manual. See *note Top: (bison)Top. 7111 7112 7113File: flex.info, Node: Appendices, Next: Indices, Prev: FAQ, Up: Top 7114 7115Appendix A Appendices 7116********************* 7117 7118* Menu: 7119 7120* Makefiles and Flex:: 7121* Bison Bridge:: 7122* M4 Dependency:: 7123* Common Patterns:: 7124 7125 7126File: flex.info, Node: Makefiles and Flex, Next: Bison Bridge, Prev: Appendices, Up: Appendices 7127 7128A.1 Makefiles and Flex 7129====================== 7130 7131In this appendix, we provide tips for writing Makefiles to build your 7132scanners. 7133 7134 In a traditional build environment, we say that the `.c' files are 7135the sources, and the `.o' files are the intermediate files. When using 7136`flex', however, the `.l' files are the sources, and the generated `.c' 7137files (along with the `.o' files) are the intermediate files. This 7138requires you to carefully plan your Makefile. 7139 7140 Modern `make' programs understand that `foo.l' is intended to 7141generate `lex.yy.c' or `foo.c', and will behave accordingly(1)(2). The 7142following Makefile does not explicitly instruct `make' how to build 7143`foo.c' from `foo.l'. Instead, it relies on the implicit rules of the 7144`make' program to build the intermediate file, `scan.c': 7145 7146 # Basic Makefile -- relies on implicit rules 7147 # Creates "myprogram" from "scan.l" and "myprogram.c" 7148 # 7149 LEX=flex 7150 myprogram: scan.o myprogram.o 7151 scan.o: scan.l 7152 7153 For simple cases, the above may be sufficient. For other cases, you 7154may have to explicitly instruct `make' how to build your scanner. The 7155following is an example of a Makefile containing explicit rules: 7156 7157 # Basic Makefile -- provides explicit rules 7158 # Creates "myprogram" from "scan.l" and "myprogram.c" 7159 # 7160 LEX=flex 7161 myprogram: scan.o myprogram.o 7162 $(CC) -o $@ $(LDFLAGS) $^ 7163 7164 myprogram.o: myprogram.c 7165 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 7166 7167 scan.o: scan.c 7168 $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ 7169 7170 scan.c: scan.l 7171 $(LEX) $(LFLAGS) -o $@ $^ 7172 7173 clean: 7174 $(RM) *.o scan.c 7175 7176 Notice in the above example that `scan.c' is in the `clean' target. 7177This is because we consider the file `scan.c' to be an intermediate 7178file. 7179 7180 Finally, we provide a realistic example of a `flex' scanner used 7181with a `bison' parser(3). There is a tricky problem we have to deal 7182with. Since a `flex' scanner will typically include a header file 7183(e.g., `y.tab.h') generated by the parser, we need to be sure that the 7184header file is generated BEFORE the scanner is compiled. We handle this 7185case in the following example: 7186 7187 # Makefile example -- scanner and parser. 7188 # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" 7189 # 7190 LEX = flex 7191 YACC = bison -y 7192 YFLAGS = -d 7193 objects = scan.o parse.o myprogram.o 7194 7195 myprogram: $(objects) 7196 scan.o: scan.l parse.c 7197 parse.o: parse.y 7198 myprogram.o: myprogram.c 7199 7200 In the above example, notice the line, 7201 7202 scan.o: scan.l parse.c 7203 7204 , which lists the file `parse.c' (the generated parser) as a 7205dependency of `scan.o'. We want to ensure that the parser is created 7206before the scanner is compiled, and the above line seems to do the 7207trick. Feel free to experiment with your specific implementation of 7208`make'. 7209 7210 For more details on writing Makefiles, see *note Top: (make)Top. 7211 7212 ---------- Footnotes ---------- 7213 7214 (1) GNU `make' and GNU `automake' are two such programs that provide 7215implicit rules for flex-generated scanners. 7216 7217 (2) GNU `automake' may generate code to execute flex in 7218lex-compatible mode, or to stdout. If this is not what you want, then 7219you should provide an explicit rule in your Makefile.am 7220 7221 (3) This example also applies to yacc parsers. 7222 7223 7224File: flex.info, Node: Bison Bridge, Next: M4 Dependency, Prev: Makefiles and Flex, Up: Appendices 7225 7226A.2 C Scanners with Bison Parsers 7227================================= 7228 7229This section describes the `flex' features useful when integrating 7230`flex' with `GNU bison'(1). Skip this section if you are not using 7231`bison' with your scanner. Here we discuss only the `flex' half of the 7232`flex' and `bison' pair. We do not discuss `bison' in any detail. For 7233more information about generating `bison' parsers, see *note Top: 7234(bison)Top. 7235 7236 A compatible `bison' scanner is generated by declaring `%option 7237bison-bridge' or by supplying `--bison-bridge' when invoking `flex' 7238from the command line. This instructs `flex' that the macro `yylval' 7239may be used. The data type for `yylval', `YYSTYPE', is typically 7240defined in a header file, included in section 1 of the `flex' input 7241file. For a list of functions and macros available, *Note 7242bison-functions::. 7243 7244 The declaration of yylex becomes, 7245 7246 int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); 7247 7248 If `%option bison-locations' is specified, then the declaration 7249becomes, 7250 7251 int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); 7252 7253 Note that the macros `yylval' and `yylloc' evaluate to pointers. 7254Support for `yylloc' is optional in `bison', so it is optional in 7255`flex' as well. The following is an example of a `flex' scanner that is 7256compatible with `bison'. 7257 7258 /* Scanner for "C" assignment statements... sort of. */ 7259 %{ 7260 #include "y.tab.h" /* Generated by bison. */ 7261 %} 7262 7263 %option bison-bridge bison-locations 7264 % 7265 7266 [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} 7267 [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} 7268 "="|";" { return yytext[0];} 7269 . {} 7270 % 7271 7272 As you can see, there really is no magic here. We just use `yylval' 7273as we would any other variable. The data type of `yylval' is generated 7274by `bison', and included in the file `y.tab.h'. Here is the 7275corresponding `bison' parser: 7276 7277 /* Parser to convert "C" assignments to lisp. */ 7278 %{ 7279 /* Pass the argument to yyparse through to yylex. */ 7280 #define YYPARSE_PARAM scanner 7281 #define YYLEX_PARAM scanner 7282 %} 7283 %locations 7284 %pure_parser 7285 %union { 7286 int num; 7287 char* str; 7288 } 7289 %token <str> STRING 7290 %token <num> NUMBER 7291 %% 7292 assignment: 7293 STRING '=' NUMBER ';' { 7294 printf( "(setf %s %d)", $1, $3 ); 7295 } 7296 ; 7297 7298 ---------- Footnotes ---------- 7299 7300 (1) The features described here are purely optional, and are by no 7301means the only way to use flex with bison. We merely provide some glue 7302to ease development of your parser-scanner pair. 7303 7304 7305File: flex.info, Node: M4 Dependency, Next: Common Patterns, Prev: Bison Bridge, Up: Appendices 7306 7307A.3 M4 Dependency 7308================= 7309 7310The macro processor `m4'(1) must be installed wherever flex is 7311installed. `flex' invokes `m4', found by searching the directories in 7312the `PATH' environment variable. Any code you place in section 1 or in 7313the actions will be sent through m4. Please follow these rules to 7314protect your code from unwanted `m4' processing. 7315 7316 * Do not use symbols that begin with, `m4_', such as, `m4_define', 7317 or `m4_include', since those are reserved for `m4' macro names. If 7318 for some reason you need m4_ as a prefix, use a preprocessor 7319 #define to get your symbol past m4 unmangled. 7320 7321 * Do not use the strings `[[' or `]]' anywhere in your code. The 7322 former is not valid in C, except within comments and strings, but 7323 the latter is valid in code such as `x[y[z]]'. The solution is 7324 simple. To get the literal string `"]]"', use `"]""]"'. To get the 7325 array notation `x[y[z]]', use `x[y[z] ]'. Flex will attempt to 7326 detect these sequences in user code, and escape them. However, 7327 it's best to avoid this complexity where possible, by removing 7328 such sequences from your code. 7329 7330 7331 `m4' is only required at the time you run `flex'. The generated 7332scanner is ordinary C or C++, and does _not_ require `m4'. 7333 7334 ---------- Footnotes ---------- 7335 7336 (1) The use of m4 is subject to change in future revisions of flex. 7337It is not part of the public API of flex. Do not depend on it. 7338 7339 7340File: flex.info, Node: Common Patterns, Prev: M4 Dependency, Up: Appendices 7341 7342A.4 Common Patterns 7343=================== 7344 7345This appendix provides examples of common regular expressions you might 7346use in your scanner. 7347 7348* Menu: 7349 7350* Numbers:: 7351* Identifiers:: 7352* Quoted Constructs:: 7353* Addresses:: 7354 7355 7356File: flex.info, Node: Numbers, Next: Identifiers, Up: Common Patterns 7357 7358A.4.1 Numbers 7359------------- 7360 7361C99 decimal constant 7362 `([[:digit:]]{-}[0])[[:digit:]]*' 7363 7364C99 hexadecimal constant 7365 `0[xX][[:xdigit:]]+' 7366 7367C99 octal constant 7368 `0[01234567]*' 7369 7370C99 floating point constant 7371 {dseq} ([[:digit:]]+) 7372 {dseq_opt} ([[:digit:]]*) 7373 {frac} (({dseq_opt}"."{dseq})|{dseq}".") 7374 {exp} ([eE][+-]?{dseq}) 7375 {exp_opt} ({exp}?) 7376 {fsuff} [flFL] 7377 {fsuff_opt} ({fsuff}?) 7378 {hpref} (0[xX]) 7379 {hdseq} ([[:xdigit:]]+) 7380 {hdseq_opt} ([[:xdigit:]]*) 7381 {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) 7382 {bexp} ([pP][+-]?{dseq}) 7383 {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) 7384 {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) 7385 7386 {c99_floating_point_constant} ({dfc}|{hfc}) 7387 7388 See C99 section 6.4.4.2 for the gory details. 7389 7390 7391 7392File: flex.info, Node: Identifiers, Next: Quoted Constructs, Prev: Numbers, Up: Common Patterns 7393 7394A.4.2 Identifiers 7395----------------- 7396 7397C99 Identifier 7398 ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) 7399 nondigit [_[:alpha:]] 7400 c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* 7401 7402 Technically, the above pattern does not encompass all possible C99 7403 identifiers, since C99 allows for "implementation-defined" 7404 characters. In practice, C compilers follow the above pattern, 7405 with the addition of the `$' character. 7406 7407UTF-8 Encoded Unicode Code Point 7408 [\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) 7409 7410 7411 7412File: flex.info, Node: Quoted Constructs, Next: Addresses, Prev: Identifiers, Up: Common Patterns 7413 7414A.4.3 Quoted Constructs 7415----------------------- 7416 7417C99 String Literal 7418 `L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]{1,3}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8})))*\"' 7419 7420C99 Comment 7421 `("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)' 7422 7423 Note that in C99, a `//'-style comment may be split across lines, 7424 and, contrary to popular belief, does not include the trailing 7425 `\n' character. 7426 7427 A better way to scan `/* */' comments is by line, rather than 7428 matching possibly huge comments all at once. This will allow you 7429 to scan comments of unlimited length, as long as line breaks 7430 appear at sane intervals. This is also more efficient when used 7431 with automatic line number processing. *Note option-yylineno::. 7432 7433 <INITIAL>{ 7434 "/*" BEGIN(COMMENT); 7435 } 7436 <COMMENT>{ 7437 "*/" BEGIN(0); 7438 [^*\n]+ ; 7439 "*"[^/] ; 7440 \n ; 7441 } 7442 7443 7444 7445File: flex.info, Node: Addresses, Prev: Quoted Constructs, Up: Common Patterns 7446 7447A.4.4 Addresses 7448--------------- 7449 7450IPv4 Address 7451 dec-octet [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] 7452 IPv4address {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet} 7453 7454IPv6 Address 7455 h16 [0-9A-Fa-f]{1,4} 7456 ls32 {h16}:{h16}|{IPv4address} 7457 IPv6address ({h16}:){6}{ls32}| 7458 ::({h16}:){5}{ls32}| 7459 ({h16})?::({h16}:){4}{ls32}| 7460 (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}| 7461 (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}| 7462 (({h16}:){0,3}{h16})?::{h16}:{ls32}| 7463 (({h16}:){0,4}{h16})?::{ls32}| 7464 (({h16}:){0,5}{h16})?::{h16}| 7465 (({h16}:){0,6}{h16})?:: 7466 7467 See RFC 2373 (http://www.ietf.org/rfc/rfc2373.txt) for details. 7468 Note that you have to fold the definition of `IPv6address' into one 7469 line and that it also matches the "unspecified address" "::". 7470 7471URI 7472 `(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?' 7473 7474 This pattern is nearly useless, since it allows just about any 7475 character to appear in a URI, including spaces and control 7476 characters. See RFC 2396 (http://www.ietf.org/rfc/rfc2396.txt) 7477 for details. 7478 7479 7480 7481File: flex.info, Node: Indices, Prev: Appendices, Up: Top 7482 7483Indices 7484******* 7485 7486* Menu: 7487 7488* Concept Index:: 7489* Index of Functions and Macros:: 7490* Index of Variables:: 7491* Index of Data Types:: 7492* Index of Hooks:: 7493* Index of Scanner Options:: 7494 7495