1LEX(1) 386BSD Reference Manual LEX(1) 2 3NNAAMMEE 4 lleexx - fast lexical analyzer generator 5 6SSYYNNOOPPSSIISS 7 lleexx [[--bbccddffiinnppssttvvFFIILLTT88] --CC[eeffmmFF] --SS_s_k_e_l_e_t_o_n] [_f_i_l_e ...] 8 9DDEESSCCRRIIPPTTIIOONN 10 LLeexx is a tool for generating _s_c_a_n_n_e_r_s: programs which recognized lexical 11 patterns in text. LLeexx reads the given input files, or its standard input 12 if no file names are given, for a description of a scanner to generate. 13 The description is in the form of pairs of regular expressions and C 14 code, called _r_u_l_e_s. LLeexx generates as output a C source file, _l_e_x._y_y._c, 15 which defines a routine yyyylleexx(). This file is compiled and linked with 16 the --llffll library to produce an executable. When the executable is run, 17 it analyzes its input for occurrences of the regular expressions. 18 Whenever it finds one, it executes the corresponding C code. 19 20 For full documentation, see _L_e_x_d_o_c. This manual entry is intended for use 21 as a quick reference. 22 23OOPPTTIIOONNSS 24 LLeexx has the following options: 25 26 --bb Generate backtracking information to _l_e_x._b_a_c_k_t_r_a_c_k. This is a 27 list of scanner states which require backtracking and the input 28 characters on which they do so. By adding rules one can remove 29 backtracking states. If all backtracking states are eliminated 30 and --ff or --FF is used, the generated scanner will run faster. 31 32 --cc is a do-nothing, deprecated option included for POSIX compliance. 33 34 _N_O_T_E: in previous releases of LLeexx [--cc] specified table- 35 compression options. This functionality is now given by the --CC 36 flag. To ease the the impact of this change, when lleexx encounters 37 --cc,, it currently issues a warning message and assumes that --CC was 38 desired instead. In the future this "promotion" of --cc to --CC will 39 go away in the name of full POSIX compliance (unless the POSIX 40 meaning is removed first). 41 42 --dd Makes the generated scanner run in _d_e_b_u_g mode. Whenever a 43 pattern is recognized and the global _y_y__L_e_x__d_e_b_u_g is non-zero 44 (which is the default), the scanner will write to stderr a line 45 of the form: 46 47 --accepting rule at line 53 ("the matched text") 48 49 The line number refers to the location of the rule in the file 50 defining the scanner (i.e., the file that was fed to lex). 51 Messages are also generated when the scanner backtracks, accepts 52 the default rule, reaches the end of its input buffer (or 53 encounters a NUL; the two look the same as far as the scanner's 54 concerned), or reaches an end-of-file. 55 56 --ff Specifies (take your pick) _f_u_l_l _t_a_b_l_e or _f_a_s_t _s_c_a_n_n_e_r. No table 57 compression is done. The result is large but fast. This option 58 is equivalent to --CCff (see below). 59 60 --ii Instructs lleexx to generate a _c_a_s_e-_i_n_s_e_n_s_i_t_i_v_e scanner. The case 61 of letters given in the lleexx input patterns will be ignored, and 62 tokens in the input will be matched regardless of case. The 63 matched text given in _y_y_t_e_x_t will have the preserved case (i.e., 64 65 66 it will not be folded). 67 68 --nn Is another do-nothing, deprecated option included only for POSIX 69 compliance. 70 71 --pp Generates a performance report to stderr. The report consists of 72 comments regarding features of the lleexx input file which will 73 cause a loss of performance in the resulting scanner. 74 75 --ss Causes the _d_e_f_a_u_l_t _r_u_l_e (that unmatched scanner input is echoed 76 to _s_t_d_o_u_t) to be suppressed. If the scanner encounters input 77 that does not match any of its rules, it aborts with an error. 78 79 --tt Instructs lleexx to write the scanner it generates to standard 80 output instead of _l_e_x._y_y._c. 81 82 --vv Specifies that lleexx should write to stderr a summary of statistics 83 regarding the scanner it generates. 84 85 --FF Specifies that the _f_a_s_t scanner table representation should be 86 used. This representation is about as fast as the full table 87 representation (--ff), and for some sets of patterns will be 88 considerably smaller (and for others, larger). See _L_e_x_d_o_c for 89 details. 90 91 This option is equivalent to --CCFF (see below). 92 93 --II Instructs lleexx to generate an _i_n_t_e_r_a_c_t_i_v_e scanner, that is, a 94 scanner which stops immediately rather than looking ahead if it 95 knows that the currently scanned text cannot be part of a longer 96 rule's match. Again, see _L_e_x_d_o_c for details. 97 98 Note, --II cannot be used in conjunction with _f_u_l_l or _f_a_s_t _t_a_b_l_e_s, 99 i.e., the --ff, --FF, --CCff, or --CCFF flags. 100 101 --LL Instructs lleexx not to generate #line directives in _l_e_x._y_y._c. The 102 default is to generate such directives so error messages in the 103 actions will be correctly located with respect to the original 104 lleexx input file, and not to the fairly meaningless line numbers of 105 _l_e_x._y_y._c. 106 107 --TT Makes lleexx run in _t_r_a_c_e mode. It will generate a lot of messages 108 to stdout concerning the form of the input and the resultant non- 109 deterministic and deterministic finite automata. This option is 110 mostly for use in maintaining lleexx. 111 112 --88 Instructs lleexx to generate an 8-bit scanner. On some sites, this 113 is the default. On others, the default is 7-bit characters. To 114 see which is the case, check the verbose (--vv) output for 115 "equivalence classes created". If the denominator of the number 116 shown is 128, then by default lleexx is generating 7-bit characters. 117 If it is 256, then the default is 8-bit characters. 118 119 --CC[eeffmmFF] 120 Controls the degree of table compression. The default setting is 121 --CCeemm. 122 123 --CC A lone --CC specifies that the scanner tables should be 124 compressed but neither equivalence classes nor meta- 125 equivalence classes should be used. 126 127 --CCee Directs lleexx to construct _e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s_e_s, i.e., sets 128 of characters which have identical lexical properties. 129 Equivalence classes usually give dramatic reductions in 130 the final table/object file sizes (typically a factor of 131 2-5) and are pretty cheap performance-wise (one array 132 133 look-up per character scanned). 134 135 --CCff Specifies that the _f_u_l_l scanner tables should be 136 generated - lleexx should not compress the tables by taking 137 advantages of similar transition functions for different 138 states. 139 140 --CCFF Specifies that the alternate fast scanner representation 141 (described in _L_e_x_d_o_c) should be used. 142 143 --CCmm Directs lleexx to construct _m_e_t_a-_e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s_e_s, which 144 are sets of equivalence classes (or characters, if 145 equivalence classes are not being used) that are commonly 146 used together. Meta-equivalence classes are often a big 147 win when using compressed tables, but they have a 148 moderate performance impact (one or two "if" tests and 149 one array look-up per character scanned). 150 151 --CCeemm (Default) Generate both equivalence classes and meta- 152 equivalence classes. This setting provides the highest 153 degree of table compression. 154 155 Faster-executing scanners can be traded off at the cost of larger 156 tables with the following generally being true: 157 158 slowest & smallest 159 -Cem 160 -Cm 161 -Ce 162 -C 163 -C{f,F}e 164 -C{f,F} fastest & largest 165 166 --CC options are not cumulative; whenever the flag is encountered, 167 the previous -C settings are forgotten. 168 169 The options --CCff or --CCFF and --CCmm do not make sense together - there 170 is no opportunity for meta-equivalence classes if the table is 171 not being compressed. Otherwise the options may be freely mixed. 172 173 --SS_s_k_e_l_e_t_o_n__f_i_l_e 174 Overrides the default skeleton file from which lleexx constructs its 175 scanners. Useful for lleexx maintenance or development. 176 177SSUUMMMMAARRYY OOFF LLEEXX RREEGGUULLAARR EEXXPPRREESSSSIIOONNSS 178 The patterns in the input are written using an extended set of regular 179 expressions. These are: 180 181 x Match the character 'x'. 182 . Any character except newline. 183 [xyz] A "character class"; in this case, the pattern matches either 184 an 'x', a 'y', or a 'z'. 185 [abj-oZ] A "character class" with a range in it; matches an 'a', a 186 'b', any letter from 'j' through 'o', or a 'Z'. 187 [^A-Z] A "negated character class", i.e., any character but those in 188 the class. In this case, any character _e_x_c_e_p_t an uppercase 189 letter. 190 [^A-Z\n] Any character _e_x_c_e_p_t an uppercase letter or a newline. 191 r* Zero or more r's, where r is any regular expression. 192 r+ One or more r's. 193 r? Zero or one r's (that is, "an optional r"). 194 r{2,5} Anywhere from two to five r's. 195 r{2,} Two or more r's. 196 197 198 r{4} Exactly 4 r's. 199 {name} The expansion of the "name" definition (see above). 200 [xyz]\"foo The literal string: [xyz]"foo. 201 \X If X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the 202 ANSI-C interpretation of \x. Otherwise, a literal 'X' (used 203 to escape operators such as '*'). 204 \123 The character with octal value 123. 205 \x2a The character with hexadecimal value 2a. 206 (r) Match an r; parentheses are used to override precedence (see 207 below). 208 rs The regular expression r followed by the regular expression 209 s; called "concatenation". 210 rs Either an r or an s. 211 r/s An r but only if it is followed by an s. The s is not part 212 of the matched text. This type of pattern is called as 213 "trailing context". 214 ^r An r, but only at the beginning of a line. 215 r$ An r, but only at the end of a line. Equivalent to "r/\n". 216 <s>r An r, but only in start condition s (see below for discussion 217 of start conditions). 218 <s1,s2,s3>r 219 Same, but in any of start conditions s1, s2, or s3. 220 <<EOF>> An end-of-file. 221 <s1,s2><<EOF>> 222 An end-of-file when in start condition s1 or s2. 223 The regular expressions listed above are grouped according to precedence, 224 from highest precedence at the top to lowest at the bottom. Those 225 grouped together have equal precedence. 226 227 Some notes on patterns: 228 229 Negated character classes _m_a_t_c_h _n_e_w_l_i_n_e_s unless "\n" (or an equivalent 230 escape sequence) is one of the characters explicitly present in the 231 negated character class (e.g., " [^A-Z\n] "). 232 233 A rule can have at most one instance of trailing context (the '/' 234 operator or the '$' operator). The start condition, '^', and "<<EOF>>" 235 patterns can only occur at the beginning of a pattern, and, as well as 236 with '/' and '$', cannot be grouped inside parentheses. The following 237 are all illegal: 238 239 foo/bar$ 240 foo(bar$) 241 foo^bar 242 <sc1>foo<sc2>bar 243 244SSUUMMMMAARRYY OOFF SSPPEECCIIAALL AACCTTIIOONNSS 245 In addition to arbitrary C code, the following can appear in actions: 246 247 EECCHHOO Copies _y_y_t_e_x_t to the scanner's output. 248 249 BBEEGGIINN Followed by the name of a start condition places the scanner 250 in the corresponding start condition. 251 252 RREEJJEECCTT Directs the scanner to proceed on to the "second best" rule 253 which matched the input (or a prefix of the input). _y_y_t_e_x_t 254 and _y_y_l_e_n_g are set up appropriately. Note that RREEJJEECCTT is a 255 particularly expensive feature in terms scanner performance; 256 if it is used in _a_n_y of the scanner's actions it will slow 257 down _a_l_l of the scanner's matching. Furthermore, RREEJJEECCTT 258 cannot be used with the --ff or --FF options. 259 260 Note also that unlike the other special actions, RREEJJEECCTT is a 261 _b_r_a_n_c_h; code immediately following it in the action will _n_o_t 262 263 264 be executed. 265 266 yyyymmoorree() tells the scanner that the next time it matches a rule, the 267 corresponding token should be _a_p_p_e_n_d_e_d onto the current value 268 of _y_y_t_e_x_t rather than replacing it. 269 270 yyyylleessss(_n) returns all but the first _n characters of the current token 271 back to the input stream, where they will be rescanned when 272 the scanner looks for the next match. _y_y_t_e_x_t and _y_y_l_e_n_g are 273 adjusted appropriately (e.g., _y_y_l_e_n_g will now be equal to _n). 274 275 uunnppuutt(_c) puts the character _c back onto the input stream. It will be 276 the next character scanned. 277 278 iinnppuutt() reads the next character from the input stream (this routine 279 is called yyyyiinnppuutt() if the scanner is compiled using _C ++). 280 281 yyyytteerrmmiinnaattee() 282 can be used in lieu of a return statement in an action. It 283 terminates the scanner and returns a 0 to the scanner's 284 caller, indicating "all done". 285 286 By default, yyyytteerrmmiinnaattee() is also called when an end-of-file 287 is encountered. It is a macro and may be redefined. 288 289 YYYY__NNEEWW__FFIILLEE 290 is an action available only in <<EOF>> rules. It means 291 "Okay, I've set up a new input file, continue scanning". 292 293 yyyy__ccrreeaattee__bbuuffffeerr(_f_i_l_e, _s_i_z_e) 294 takes a FFIILLEE pointer and an integer _s_i_z_e. It returns a 295 YY_BUFFER_STATE handle to a new input buffer large enough to 296 accomodate _s_i_z_e characters and associated with the given 297 file. When in doubt, use _Y_Y__B_U_F__S_I_Z_E for the size. 298 299 yyyy__sswwiittcchh__ttoo__bbuuffffeerr(_n_e_w__b_u_f_f_e_r) 300 switches the scanner's processing to scan for tokens from the 301 given buffer, which must be a YY_BUFFER_STATE. 302 303 yyyy__ddeelleettee__bbuuffffeerr(_b_u_f_f_e_r) 304 deletes the given buffer. 305 306VVAALLUUEESS AAVVAAIILLAABBLLEE TTOO TTHHEE UUSSEERR 307 _c_h_a_r *_y_y_t_e_x_t 308 holds the text of the current token. It may not be modified. 309 310 _i_n_t _y_y_l_e_n_g holds the length of the current token. It may not be 311 modified. 312 313 _F_I_L_E *_y_y_i_n is the file which by default lleexx reads from. It may be 314 redefined but doing so only makes sense before scanning 315 begins. Changing it in the middle of scanning will have 316 unexpected results since lleexx buffers its input. Once 317 scanning terminates because an end-of-file has been seen, 318 vvooiidd yyyyrreessttaarrtt(_F_I_L_E *_n_e_w__f_i_l_e) may be called to point _y_y_i_n at 319 the new input file. 320 321 _F_I_L_E *_y_y_o_u_t 322 is the file to which _E_C_H_O actions are done. It can be 323 reassigned by the user. 324 325 _Y_Y__C_U_R_R_E_N_T__B_U_F_F_E_R 326 returns a YY_BUFFER_STATE handle to the current buffer. 327 328MMAACCRROOSS TTHHEE UUSSEERR CCAANN RREEDDEEFFIINNEE 329 330 331 _Y_Y__D_E_C_L controls how the scanning routine is declared. By default, 332 it is "int yylex()", or, if prototypes are being used, "int 333 yylex(void)". This definition may be changed by redefining 334 the "YY_DECL" macro. Note that if you give arguments to the 335 scanning routine using a K&R-style/non-prototyped function 336 declaration, you must terminate the definition with a semi- 337 colon (;). 338 339 _Y_Y__I_N_P_U_T The nature of how the scanner gets its input can be 340 controlled by redefining the YY_INPUT macro. YY_INPUT's 341 calling sequence is "YY_INPUT(buf,result,max_size)". Its 342 action is to place up to _m_a_x __s_i_z_e characters in the 343 character array _b_u_f and return in the integer variable _r_e_s_u_l_t 344 either the number of characters read or the constant YY_NULL 345 (0 on Unix systems) to indicate EOF. The default YY_INPUT 346 reads from the global file-pointer "yyin". A sample 347 redefinition of YY_INPUT (in the definitions section of the 348 input file): 349 350 %{ 351 #undef YY_INPUT 352 #define YY_INPUT(buf,result,max_size) \ 353 result = ((buf[0] = getchar()) == EOF) ? YY_NULL : 1; 354 %} 355 356 _Y_Y__I_N_P_U_T When the scanner receives an end-of-file indication from 357 YY_INPUT, it then checks the yyyywwrraapp() function. If yyyywwrraapp() 358 returns false (zero), then it is assumed that the function 359 has gone ahead and set up _y_y_i_n to point to another input 360 file, and scanning continues. If it returns true (non-zero), 361 then the scanner terminates, returning 0 to its caller. 362 363 _y_y_w_r_a_p The default yyyywwrraapp() always returns 1. Presently, to 364 redefine it you must first "#undef yywrap", as it is 365 currently implemented as a macro. It is likely that yyyywwrraapp() 366 will soon be defined to be a function rather than a macro. 367 368 _Y_Y__U_S_E_R__A_C_T_I_O_N 369 can be redefined to provide an action which is always 370 executed prior to the matched rule's action. 371 372 _Y_Y__U_S_E_R__I_N_I_T 373 The macro _Y_Y __U_S_E_R__I_N_I_T may be redefined to provide an action 374 which is always executed before the first scan. 375 376 _Y_Y__B_R_E_A_K In the generated scanner, the actions are all gathered in one 377 large switch statement and separated using _Y_Y __B_R_E_A_K, which 378 may be redefined. By default, it is simply a "break", to 379 separate each rule's action from the following rule's. 380 381FFIILLEESS 382 lex.skel skeleton scanner. 383 lex.yy.c generated scanner (called _l_e_x_y_y._c on some systems). 384 lex.backtrack backtracking information for --bb 385 flag (called _l_e_x._b_c_k on some systems). 386 387SSEEEE AALLSSOO 388 lex(1), yacc(1), sed(1), awk(1). 389 390 _l_e_x_d_o_c. 391 392 M. E. Lesk, and E. Schmidt, _L_E_X - _L_e_x_i_c_a_l _A_n_a_l_y_z_e_r _G_e_n_e_r_a_t_o_r. 393 394DDIIAAGGNNOOSSTTIICCSS 395 396 397 reject_used_but_not_detected undefined 398 or 399 400 yymore_used_but_not_detected undefined 401 These errors can occur at compile time. They indicate that 402 the scanner uses RREEJJEECCTT or yyyymmoorree() but that lleexx failed to 403 notice the fact, meaning that lleexx scanned the first two 404 sections looking for occurrences of these actions and failed 405 to find any, but somehow you snuck some in via a #include 406 file, for example . Make an explicit reference to the action 407 in your lleexx input file. Note that previously lleexx supported a 408 %used/%unused mechanism for dealing with this problem; this 409 feature is still supported but now deprecated, and will go 410 away soon unless the author hears from people who can argue 411 compellingly that they need it. 412 413 lex scanner jammed 414 a scanner compiled with --ss has encountered an input string 415 which wasn't matched by any of its rules. 416 417 lex input buffer overflowed 418 a scanner rule matched a string long enough to overflow the 419 scanner's internal input buffer 16K bytes - controlled by 420 _Y_Y__B_U_F__M_A_X in _l_e_x._s_k_e_l. 421 422 scanner requires -8 flag 423 Your scanner specification includes recognizing 8-bit 424 characters and you did not specify the -8 flag and your site 425 has not installed lex with -8 as the default . 426 427 too many %t classes! 428 You managed to put every single character into its own %t 429 class. LLeexx requires that at least one of the classes share 430 characters. 431 432HHIISSTTOORRYY 433 A lleexx appeared in Version 6 AT&T UNIX. The version this man page 434 describes is derived from code contributed by Vern Paxson. 435 436AAUUTTHHOORR 437 Vern Paxson, with the help of many ideas and much inspiration from Van 438 Jacobson. Original version by Jef Poskanzer. 439 440 See _L_e_x_d_o_c for additional credits and the address to send comments to. 441 442BBUUGGSS 443 Some trailing context patterns cannot be properly matched and generate 444 warning messages ("Dangerous trailing context"). These are patterns 445 where the ending of the first part of the rule matches the beginning of 446 the second part, such as "zx*/xy*", where the 'x*' matches the 'x' at the 447 beginning of the trailing context. (Note that the POSIX draft states 448 that the text matched by such patterns is undefined.) 449 450 For some trailing context rules, parts which are actually fixed-length 451 are not recognized as such, leading to the abovementioned performance 452 loss. In particular, parts using '|' or {n} (such as "foo{3}") are 453 always considered variable-length. 454 455 Combining trailing context with the special '|' action can result in 456 _f_i_x_e_d trailing context being turned into the more expensive _v_a_r_i_a_b_l_e 457 trailing context. This happens in the following example: 458 459 %% 460 abc | 461 xyz/def 462 463 Use of uunnppuutt() invalidates yytext and yyleng. 464 465 Use of uunnppuutt() to push back more text than was matched can result in the 466 pushed-back text matching a beginning-of-line ('^') rule even though it 467 didn't come at the beginning of the line (though this is rare!). 468 469 Pattern-matching of NUL's is substantially slower than matching other 470 characters. 471 472 LLeexx does not generate correct #line directives for code internal to the 473 scanner; thus, bugs in _l_e_x._s_k_e_l yield bogus line numbers. 474 475 Due to both buffering of input and read-ahead, you cannot intermix calls 476 to <_s_t_d_i_o._h> routines, such as, for example, ggeettcchhaarr(), with lleexx rules 477 and expect it to work. Call iinnppuutt() instead. 478 479 The total table entries listed by the --vv flag excludes the number of 480 table entries needed to determine what rule has been matched. The number 481 of entries is equal to the number of DFA states if the scanner does not 482 use RREEJJEECCTT, and somewhat greater than the number of states if it does. 483 484 RREEJJEECCTT cannot be used with the --ff or --FF options. 485 486 Some of the macros, such as yyyywwrraapp(), may in the future become functions 487 which live in the --llffll library. This will doubtless break a lot of code, 488 but may be required for POSIX-compliance. 489 490 The lleexx internal algorithms need documentation. 491 492BSD Experimental July 24, 1991 8 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529