1\documentstyle[12pt]{article} 2%\documentstyle[12pt,my]{article} 3\addtolength{\oddsidemargin}{0in} 4\addtolength{\textwidth}{+0.5in} 5\addtolength{\topmargin}{-0.5in} % slightly longer page 6\addtolength{\textheight}{0.5in} 7\renewcommand{\topfraction}{0.95} 8\renewcommand{\textfraction}{0.05} %should make [h] work as desired 9\parindent 0pt 10 11%%%\renewcommand{baselinestretch}{1.2} 12 13\newcommand{\mysk}{\vspace{0.5cm}} 14 15\title{\vspace{2cm}\goodbreak 16\bf {\sl Aflex} \rm -- An Ada Lexical Analyzer Generator 17\\ \vspace{1cm} Version 1.1 \vspace{1cm} 18} 19\author{\large \rm John Self \\ 20\ \\ 21Arcadia Environment Research Project \\ 22Department of Information and Computer Science\\ 23University of California, Irvine \\ 24\\UCI-90-18\\ 25\medskip\\ 26Adapted for the GNU Ada Compiler GNAT\\ 27by the ``Ada for Linux Team (ALT)'', Feb 1999\\ 28\\ 29\thanks{This work was supported in 30part by the National Science Foundation under grants CCR--8704311 31and CCR--8451421 with cooperation from the Defense Advanced Research 32Projects Agency, and by the National Science Foundation under Award 33No. CCR-8521398.} 34} 35 36\date{May 1990} 37 38\begin{document} 39 40\maketitle 41 42\begin{titlepage} 43\tableofcontents 44\end{titlepage} 45 46\section{Introduction} 47\label{intro} 48{\sl Aflex} is a lexical analyzer generating tool written in Ada 49designed for lexical 50processing of character input streams. 51It is a successor to the {\sl Alex}\cite{alex} tool from UCI. {\sl Aflex} 52is upwardly compatible with {\sl alex 1.0}, but is significantly 53faster at generating scanners, and produces smaller scanners for 54equivalent specifications. Internally {\sl aflex} is patterned after the 55{\sl flex} tool from the GNU project. 56{\sl Aflex} accepts high level rules written in regular expressions 57for character string matching, and generates Ada source code comprising a 58lexical analyzer along with two auxiliary Ada packages. The main file 59includes a routine that partitions the input text stream into strings 60matching the expressions. Associated with each rule is an 61action block composed of program fragments. Whenever a rule is recognized 62in the input stream, the corresponding program fragment is executed. 63This feature, combined with the powerful string pattern matching capability, 64allows the user to implement a lexical analyzer for any type of application 65efficiently and quickly. 66For instance, {\sl aflex} can be used alone for simple lexical analysis and 67statistics, or with {\sl ayacc} \cite{ayacc} to generate a parser front-end. 68{\sl Ayacc} is an Ada parser generator that accepts context-free grammars. 69 70\mysk 71{\sl Aflex} is a successor to the Arcadia tool {\sl Alex}\cite{alex} which 72was inspired by the popular Unix operating system tool, {\it lex} 73\cite{lex},. Consequently, most of {\it lex}'s features and conventions are 74retained in {\sl aflex}; however, a few important differences are discussed 75in section \ref{lexdiff}. There are also a few minor differences 76between {\sl aflex} and {\sl alex} which will be discussed in 77section \ref{alexdiff}. 78 79\mysk 80This paper is intended to serve as both the reference manual and the 81user manual for {\sl aflex}. Some knowledge of {\it lex}, while not 82required, would be very useful in understanding the use of{\sl aflex.} 83A good introduction to {\it lex}, as well as lexical and syntactic analysis, 84can be found in \cite{dragon}, frequently referred to as ``the Dragon Book.'' 85Topics to be covered in this paper include the usage of 86{\sl aflex}, the operators' description, the source file format, 87the generated output, the necessary interfaces with {\sl ayacc}, 88and ambiguity among rules. 89The appendices provide a simple example, {\sl aflex} dependencies, 90the differences between {\sl aflex},{\sl alex}, and {\it lex}, known bugs and 91limitations, and references. 92 93\newpage 94\section{Command Line Options} 95Command line options are given in a different format than in the 96old UCI alex. Aflex options are as follows 97\begin{description} 98\item[-t] 99Write the scanner output to the standard output rather than to a file. 100The default names of the scanner files for base.l are base.ads and base.adb. 101Note that this option is not as useful with aflex because in addition 102to the scanner file there are files for the externally visible dfa functions 103(base-dfa.ad[sb]) and the external IO functions (base-io.ad[sb]) 104\item[-b] 105Generate backtracking information to 106{\it} aflex.backtrack. 107This is a list of scanner states which require backtracking 108and the input characters on which they do so. By adding rules one 109can remove backtracking states. If all backtracking states 110are eliminated and 111{\bf -f} 112is used, the generated scanner will run faster (see the 113{\bf -p} 114flag). Only users who wish to squeeze every last cycle out of their 115scanners need worry about this option. 116\item[-d] 117makes the generated scanner run in 118{\it debug} 119mode. Whenever a pattern is recognized the scanner will 120write to 121{\it stderr} 122a line of the form: 123\begin{verbatim} 124 125 --accepting rule #n 126 127\end{verbatim} 128Rules are numbered sequentially with the first one being 1. Rule \#0 129is executed when the scanner backtracks; Rule \#(n+1) (where 130{\it n} 131is the number of rules) indicates the default action; Rule \#(n+2) indicates 132that the input buffer is empty and needs to be refilled and then the scan 133restarted. Rules beyond (n+2) are end-of-file actions. 134\item[-f] 135has the same effect as lex's -f flag (do not compress the scanner 136tables); the mnemonic changes from 137{\it fast compilation} 138to (take your pick) 139{\it full table} 140or 141{\it fast scanner.} 142The actual compilation takes 143{\it longer,} 144since aflex is I/O bound writing out the big table. 145The compilation of the Ada file containing the scanner is also likely 146to take a long time because of the large arrays generated. 147\item[-i] 148instructs aflex to generate a 149{\it case-insensitive} 150scanner. The case of letters given in the aflex input patterns will 151be ignored, and the rules will be matched regardless of case. The 152matched text given in 153{\it yytext} 154will have the preserved case (i.e., it will not be folded). 155\item[-p] 156generates a performance report to stderr. The report 157consists of comments regarding features of the aflex input file 158which will cause a loss of performance in the resulting scanner. 159Note that the use of 160the 161{\bf \verb|^|} 162operator 163and the 164{\bf -I} 165flag entail minor performance penalties. 166\item[-s] 167causes the 168{\it default rule} 169(that unmatched scanner input is echoed to 170{\it stdout)} 171to be suppressed. If the scanner encounters input that does not 172match any of its rules, it aborts with an error. This option is 173useful for finding holes in a scanner's rule set. 174\item[-v] 175has the same meaning as for lex (print to 176{\it stderr} 177a summary of statistics of the generated scanner). Many more statistics 178are printed, though, and the summary spans several lines. Most 179of the statistics are meaningless to the casual aflex user, but the 180first line identifies the version of aflex, which is useful for figuring 181out where you stand with respect to patches and new releases. 182\item[-E] 183instructs aflex to generate additional information about each token, 184including line and column numbers. This is needed for the advanced 185automatic error option correction in ayacc. 186\item[-I] 187instructs aflex to generate an 188{\it interactive} 189scanner. Normally, scanners generated by aflex always look ahead one 190character before deciding that a rule has been matched. At the cost of 191some scanning overhead, aflex will generate a scanner which only looks ahead 192when needed. Such scanners are called 193{\it interactive} 194because if you want to write a scanner for an interactive system such as a 195command shell, you will probably want the user's input to be terminated 196with a newline, and without 197{\bf -I} 198the user will have to type a character in addition to the newline in order 199to have the newline recognized. This leads to dreadful interactive 200performance. 201 202If all this seems to confusing, here's the general rule: if a human will 203be typing in input to your scanner, use 204{\bf -I,} 205otherwise don't; if you don't care about how fast your scanners run and 206don't want to make any assumptions about the input to your scanner, 207always use 208{\bf -I.} 209 210Note, 211{\bf -I} 212cannot be used in conjunction with 213{\it full} 214i.e., the 215{\bf -f} 216flag. 217\item[-L] 218instructs aflex to not generate 219{\bf \#line} 220directives (see below). 221\item[-T] 222makes aflex run in 223{\it trace} 224mode. It will generate a lot of messages to stdout concerning 225the form of the input and the resultant non-deterministic and deterministic 226finite automatons. This option is mostly for use in maintaining aflex. 227\item[-Sskeleton\_file] 228overrides the default internal skeleton from which aflex constructs 229its scanners. You'll probably never need this option unless you are doing 230aflex maintenance or development. 231\end{description} 232\section{{\sl Aflex} Output} 233{\sl Aflex} generates a file containing a lexical analyzer function along 234with two auxiliary packages, all of which are written in Ada. 235The context in which the lexical analyzer function is defined is flexible 236and may be specified by the user. For instance, the file may only 237contain the lexical analyzer function as a single compilation unit which 238may be called by {\sl ayacc}, 239or it may be placed within a package body or embedded within a driver 240routine. This scanner function, when invoked, partitions the character stream 241into tokens as specified by the regular expressions defined in the rules 242section of the source file. The name of the lexical analyzer 243function is {\sl yylex}. Note that it returns values of type {\it token}. 244Type {\it token} must be defined as an enumeration type which contains, 245at a minimum, ({\it End\_of\_Input, Error}). It is up to the user to make 246sure that this type is visible (see Section \ref{alexayacc}). The general 247format of the output file which contains this function is found in 248Figure 3. 249 250\mysk 251The auxiliary packages include a DFA and an IO package. The DFA 252package contains externally visible functions and variables from the 253scanner. Many of the variables in this package should not be modified 254by normal user programs, but they are provided here to allow the user to 255modify the internal behavior of aflex to match specific needs. Only 256the functions YYText and YYLength will be needed by most programs. The 257{\sl GNAT} port of {\sl aflex} generates the DFA package as child packages 258of the base package. For portability and conveniance the previously 259used flat package names {\sl base}\_DFA and {\sl base}\_IO are generated 260as renames of these child packages. 261 262\mysk 263The IO package contains 264routines which allow {\sl yylex} to scan the input source file. 265These include the unput, input, output, and yywrap functions 266from {\it lex},\\ 267plus Open\_Input, Create\_Output, Close\_Input and Close\_Output 268provided for compatibility with {\sl alex.} 269\mysk 270It is also possible to write your own IO and DFA packages. Redefining 271input is possible by changing the YY\_INPUT procedure. As an example 272you might wish to take input from an array instead of from a file. By 273changing the calls to the TEXT\_IO routines to access elements of the 274array you can change the input strategy. If you change the IO or DFA 275packages you should make a copy of the generated files under a 276different name and change that, because {\sl aflex} will overwrite 277them whenever you rerun {\sl aflex}. 278 279\newpage 280\small 281\begin{tabbing} 2821234\=1234\=1234\=1234\=1234\=1234 \kill 283 284 \>\> \>{\bf with} $<$rootname$>$\.DFA; \\ 285 \>\> \>{\bf with} $<$rootname$>$\.IO; \\ 286 \>\> \>{\bf with} TEXT\_IO; \\ 287\\ 288 \>\> \>\verb|--| User Specified Context\\ 289\\ 290 \>\> \> \>{\bf function} yylex {\bf return} Token {\bf is} \\ 291 \>\> \> \>{\bf begin} \\ 292 \>\> \> \> \>\verb|--| Analysis of expressions \\ 293 \>\> \> \> \>\verb|--| Execution of user-defined actions \\ 294 \>\> \> \>{\bf end} yylex; \\ 295\\ 296 \>\> \>\verb|--| User Specified Context\\ 297\end{tabbing} 298\centerline{Figure 3: Example of File Containing Lexical Analyzer} 299 300\mysk 301Before showing the general layout of the specification file, we will 302describe the specification language of {\sl aflex}, namely, regular expressions. 303 304 305\section{Regular Expressions} 306{\sl Aflex} distinguishes two types of character sets used to 307define regular expressions: text characters and operator characters. 308A regular expression specifies how a set of strings from the input 309string can be recognized. It contains text characters (which match the 310corresponding characters in the strings being compared) and 311operator characters (which specify repetitions, choices, and 312other features). The letters of the alphabet and the digits are 313always text characters. 314 315\mysk 316A rule specifies a sequence of characters to be matched. It 317{\bf must} begin in column one. 318The set of {\sl aflex} operators consists of 319the following: 320 321\begin{verbatim} 322 " \ { } [ ] ^ $ < > ? . * + | ( ) / 323\end{verbatim} 324 325The meaning of each operator is summarized below: 326 327\begin{tabbing} 3281234\=1234\=1234\=1234\=1234\=1234 \kill 329 \>\verb|x| \>\>\verb|--| the character ``x" \\ 330 \>\verb|"x"| \>\>\verb|--| an ``x", even if x is an operator. \\ 331 \>\verb|\x| \>\>\verb|--| an ``x", even if x is an operator. \\ 332 \>\verb|^x| \>\>\verb|--| an x at the beginning of a line. \\ 333 \>\verb|x$| \>\>\verb|--| an x at the end of line. \\ 334 \>\verb|x+| \>\>\verb|--| 1 or more instances of x. \\ 335 \>\verb|x*| \>\>\verb|--| 0 or more instances of x. \\ 336 \>\verb|x?| \>\>\verb|--| an optional x. \\ 337 \>\verb|(x)| \>\>\verb|--| an x. \\ 338 \>\verb|.| \>\>\verb|--| any character but newline. \\ 339 \>\verb"x|y" \>\>\verb|--| an x or y. \\ 340 \>\verb|[xy]| \>\>\verb|--| the character x or the character y. \\ 341 \>\verb|[x-z]| \>\>\verb|--| the character x, y or z. \\ 342 \>\verb|[^x]| \>\>\verb|--| any character but x. \\ 343 \>\verb|<y>x| \>\>\verb|--| an x when {\sl aflex} is in start condition y. \\ 344 \>\verb|{xx}| \>\>\verb|--| the translation of xx from the definitions section. \\ 345\end{tabbing} 346 347If any of these operators is used in a regular expression as a character 348literal, it must be either preceded by an escape character or surrounded by 349double quotes. For example, to recognize a dollar sign \verb|$|, the correct 350expression is either \verb|\$| or \verb|"$"|. 351Note a quote cannot be quoted and should therefore be escaped. 352 353\mysk 354A regular expression may {\bf not} contain any spaces 355unless they are within in a quoted string or character class 356or they are preceded by the \verb|"\"| operator. 357 358\mysk 359When in doubt, use parentheses. When an {\sl aflex} operator needs to be 360embedded in a string, it is often neater to quote the entire string rather 361than just the operator, e.g. the string \verb|"what?"| is more readable 362than both \verb|What"?"|, and \verb|What\?|. 363 364\small 365\begin{verbatim} 366Rules Interpretations 367----- --------------- 368a or "a" The character a 369Begin or "Begin" The string Begin 370\"Begin\" The string "Begin" 371^\t or ^"\t" The tab character \t at the beginning of line. 372\n$ The newline character \n at the end of line. 373\end{verbatim} 374\normalsize 375 376There are a few special characters which can be specified in a regular 377expression: 378\begin{tabbing} 3791234\=1234\=1234\=1234\=1234\=1234 \kill 380 \>\verb|\n| \>\>\verb|--| newline \\ 381 \>\verb|\b| \>\>\verb|--| backspace \\ 382 \>\verb|\t| \>\>\verb|--| tab \\ 383 \>\verb|\r| \>\>\verb|--| carriage return \\ 384 \>\verb|\f| \>\>\verb|--| form feed \\ 385 \>\verb|\ddd| \>\>\verb|--| octal ASCII code \\ 386\end{tabbing} 387Here is the precedence of the above operators that have precedence. 388\begin{tabbing} 3891234\=1234\=1234\=1234\=1234\=1234 \kill 390 \>\verb|" [] ()| \>\>\>\>Highest \\ 391 \>\verb|+ * ?| \>\>\>\>\hspace{0.5cm}$\vdots$ \\ 392 \>\verb|concatenation| \>\>\>\>\hspace{0.5cm}$\vdots$ \\ 393 \>\verb"|" \>\>\>\>Lowest \\ 394\end{tabbing} 395 396\begin{description} 397 \item[Character Classes:] Classes of characters can be specified using 398 the operator pair {\bf []}. Within these square brackets, the operator 399 meanings are ignored except for three special characters: \verb|\| 400 and $-$ and \verb|^|. 401 402\small 403\begin{verbatim} 404Rules Interpretations 405----- --------------- 406[^abc] Any character except a, b, or c. 407[abc] The single character a, b, or c. 408[-+0-9] The - or + sign or any digit from 0 to 9. 409[\t\n\b] The tab, newline, or backspace character. 410\end{verbatim} 411\normalsize 412 413 \item[Arbitrary and Optional Characters:] The dot, ``$.$", operator 414 matches all characters except newline. The operator ? indicates an 415 optional character of an expression. 416 417\small 418\begin{verbatim} 419Rules Interpretations 420----- --------------- 421ab?c Matches either abc or ac. 422ab.c Matches all strings of length 4 having a, b and 423 c as the first, second and fourth letter where the 424 third character is not a newline. 425\end{verbatim} 426\normalsize 427 428 429 \item[Repeated Expressions:] Repetitions of classes are indicated by 430 the operators $*$ and $+$. 431 432\small 433\begin{verbatim} 434Rules Interpretations 435----- --------------- 436[a-z]+ Matches all strings of lower case letters. 437[A-Za-z][A-Za-z0-9]* Indicates all alphanumeric strings with a 438 leading alphabetic character. 439\end{verbatim} 440\normalsize 441 442 443 \item[Alternation and Grouping:] The operator \verb"|" indicates alternation 444 and parentheses are used for grouping complex expressions. 445 446\small 447\begin{verbatim} 448Rules Interpretations 449----- --------------- 450ab|cd Matches either ab or cd. 451(ab|cd+)?(ef)* Matches such strings as abefef, efefef, cdef, 452 or cddd; but not abc, abcd, or abcdef. 453\end{verbatim} 454\normalsize 455 456 457 \item[Context Sensitivity:] {\sl aflex} will recognize a small amount of 458 surrounding context. Two simple operators for this are \verb|^| and 459 \$. If the first character of an expression is \verb|^|, the expression 460 will only be matched at the beginning of a line. If the very 461 last character is \$, the expression will only be matched at the 462 end of a line. 463 464\small 465\begin{verbatim} 466Rules Interpretations 467----- --------------- 468^ab Matches ab at the beginning of line. 469ab$ Matches ab at the end of line. 470\end{verbatim} 471\normalsize 472 473 474 \item[Definitions:] The operators \{ \} enclosing a name 475 specify a macro definition expansion. 476 477\small 478\begin{verbatim} 479Rules Interpretations 480----- --------------- 481{INTEGER} If INTEGER is defined in the macro definition 482 section, then it will be expanded here. 483\end{verbatim} 484\normalsize 485\end{description} 486 487 488\subsection{Predefined Variables \& Routines} 489\label {routines} 490Once a token is matched, the textual string representation of the token 491may be obtained by a call to the function {\sl yytext} which is located 492in the {\sl dfa package}. This function returns type string. 493 494\mysk 495The IO package contains 496routines which allow {\sl yylex} to scan the input source file. 497These include the input, output, unput and yywrap functions 498from lex,\\ plus Open\_Input, Create\_Output, Close\_Input and Close\_Output 499provided for compatibility with {\sl alex.} Note that in {\sl alex 5001.0} it was mandatory to call the {\it Open\_Input} and {\it 501Create\_Output} routines before calling {\it YYLex.} This is not 502required in {\sl Aflex.} The default input and output are attached to 503the files that Ada considers to be the {\sc standard\_input} and 504{\sc standard\_output. } 505 506The following routines must be used in lieu of the normal {\sc 507text\_io} routines because of internal buffering and read-ahead done by 508{\sl aflex.} 509 510\begin{description} 511\item[input] function input return character -- inputs a character from the 512current {\sl aflex} input stream. 513\item[unput] procedure unput(c : character) -- returns a character 514already read by input to the input stream. Note that attempting to 515push back more than one character at a time can cause {\sl aflex} to\\ 516raise the exception {\sc pushback\_overflow.} 517\item[output] procedure output(c : character) -- outputs a character to the 518current {\sl aflex} output stream. 519\item[yywrap] function yywrap return boolean -- This function is 520called when {\sl aflex} reaches the end of file. If {\it yywrap} 521returns true, {\sl aflex} continues with normal wrapup at end of 522input. If you wish to arrange for more input to arrive from a new 523source then you provide a yywrap which returns false. The default 524yywrap return true. 525\item[Open\_Input] Open\_Input(fname : in String) -- Uses the file named 526fname as the source for input to {\it YYLex.} If this function is not 527called then the default input is the Ada {\sc standard\_input.} 528\item[Open\_Input] Create\_Output(fname : in String) -- Uses the file named 529fname as output for {\it YYLex.} If this function is not 530called then the default output is the Ada {\sc standard\_output}. 531\item[Close\_Input and Close\_Output] These functions have null 532bodies in {\sl aflex} and are provided only for compatibility with 533{\sl alex.} 534\end{description} 535 536\mysk 537There are a few predefined subroutines that may be used once a token 538is matched. In many lexical processing applications, the printing of 539the string returned by {\sl yytext}, i.e. {\tt put(yytext)}, is desired 540and this action is so common that it may be written as {\tt ECHO}. 541 542\newpage 543\section{{\sl Aflex} Source Specification} 544\label {specformat} 545 546The general format of the source file is 547 548\small 549\begin{verbatim} 550 definitions section 551 %% 552 rules section 553 %% 554 user defined section copied before package statement of SPEC file 555 ## 556 user defined section copied after package statement of SPEC file 557 ## 558 user defined section copied before package statement of BODY file 559 ## 560 user defined section copied after package statement of BODY file 561 but before YYLex 562 ## -- here goes YYLex 563 section copied after YYLex and before end of BODY package 564\end{verbatim} 565\normalsize 566 567where \verb|%%| is used as a delimiter between sections and \verb|##| 568indicates where the user supplied code and where function {\sl yylex} 569will be placed. Both \verb|%%| 570and \verb|##| {\it must} occur in column one. 571 572\mysk 573The definitions section is used to define macros which appear in the rules 574section and also to define start conditions. The rules section defines the 575regular expressions with their corresponding actions. These regular 576expressions, in turn, define the tokens to be identified by the scanner. 577The user defined sections allows the user to define the context in which the 578{\sl yylex} function will be located. The user can include routines which 579may be executed when a certain token or condition is recognized. 580 581 582\subsection{Definitions Section} 583 584The definitions section may contain both macro definitions and 585start condition definitions. Macro and start condition definitions 586must begin in column one and may be interspersed. 587 588\subsubsection{Macros} 589Macro definitions take the form: 590 591\small 592\begin{verbatim} 593 name expression 594\end{verbatim} 595\normalsize 596 597where {\tt name} must begin with a letter and contain only letters, 598digits and underscores, and {\tt expression} is 599any string of characters that {\tt name} will be textually substituted to 600if found in the rule section. At least one space must separate {\tt name} 601from {\tt expression} in the definition. No syntax checking is done in 602the expression, instead the whole rule is parsed after expansion. 603The macro facility is very useful in writing regular expressions which 604have common substrings, and in defining often-used ranges like {\it digit} 605and {\it letter}. 606Perhaps its best advantage is to give a mnemonic name to a rather strange 607regular expression -- making it easier for the programmer to debug the 608expressions. These macros, once defined, can be used in the 609regular expression by surrounding them with \{ and \}, e.g., \verb|{DIGIT}|. 610For example, the rule 611 612\small 613\begin{verbatim} 614[a-zA-Z]([0-9a-zA-Z])* {put_line ("Found an identifier");} 615[0-9]+ {put_line ("Found a number");} 616\end{verbatim} 617\normalsize 618 619defines identifiers and integer numbers. With macros, the source file is 620 621\small 622\begin{verbatim} 623LETTER [a-zA-Z] 624DIGIT [0-9] 625%% 626{LETTER}({DIGIT}|{LETTER})* {put_line ("Found an identifier");} 627{DIGIT}+ {put_line ("Found a number");} 628\end{verbatim} 629\normalsize 630 631\mysk 632It is customary, although not necessary, to use all capital letters 633for macro names. This allows macros to be easily identified in complex rules. 634Macro names are case sensitive, e.g., \verb|{DIGIT}| and \verb|{Digit}| are 635two different macro names. 636 637\subsubsection{Start Conditions} 638Left context is handled in {\sl aflex} by start conditions that are defined 639in the macro definition section. Start conditions are declared as follows, 640 641\begin{verbatim} 642 %Start cond1 cond2 ... 643\end{verbatim} 644 645where cond1 and cond2 indicate start conditions. 646Note that \%Start may be abbreviated as \%S or \%s. 647 648\mysk 649A condition is set only when the {\sl aflex} command {\tt ENTER} in the 650action part is executed, e.g. {\tt ENTER cond1};. Thus the expression 651which has the form \verb|<condition>rule| will only be matched 652when {\tt condition} is set. Note that {\sl aflex} uses {\tt ENTER} 653instead of {\tt BEGIN} which is used in {\it lex.} This is done 654because {\tt BEGIN} is a keyword in Ada. The {\tt ENTER} command must 655have parentheses surrounding its argument. 656\begin{verbatim} 657 ENTER(cond1); 658\end{verbatim} 659 660{\sl Aflex} also provides {\it exclusive start conditions.} These are 661similar to normal start conditions except they have the property that 662when they are active no other rules are active. Exclusive start 663conditions are declared and used like normal start conditions except 664that the declaration is done with \%x instead of \%s. 665 666\subsection{Rules Section} 667 668Contained in the rule section are regular expressions which define the 669format of each token to be recognized by the scanner. 670Each rule has the following format: 671 672\begin{verbatim} 673pattern {action} 674\end{verbatim} 675 676where {\tt pattern} is a regular expression and {\tt action} is an Ada 677code fragment enclosed between \{ and \}. A {\tt pattern} must 678always begin in column one. 679 680\mysk 681While a pattern defines the format of the token, the action portion 682defines 683the operation to be performed by the scanner each time the corresponding 684token is recognized. Therefore, the user must provide a syntactically 685correct Ada code fragment. {\sl aflex} does not check for the validity of the 686program portion, but rather copies it to the output package and leaves it to 687the Ada compiler to detect syntax and semantics errors. There can be more 688than one Ada statement in the code fragment. For example, the rule 689 690\small 691\begin{verbatim} 692%% 693begin|BEGIN {copy (yytext, buffer); 694 Install (yytext,symbol_table); 695 return RESERVED;} 696\end{verbatim} 697\normalsize 698 699recognizes the reserved word ``begin" or ``BEGIN", copies the 700token string into the buffer, inserts it in the symbol table and returns 701the value, RESERVED. 702 703Note that the user must provide the procedures 704{\tt copy} and {\tt install} along with all necessary types and variables 705in the user defined sections. 706 707\subsection{User Defined Sections} 708The user defined sections allows the user to specify the context surrounding 709the {\sl yylex} function. \verb|##| is used to separate the various parts 710of user defined code and where the {\sl yylex} function should be placed. 711It must be present in this section and must occur in the first column. 712Any text following \verb|##| on the same line is ignored. This method of 713using multiple user defined sections that go to specific places in the 714generated .ads and .adb files is specific for the {\sl GNAT} port of 715{\sl aflex}. 716 717\section{Ambiguous Source Rules} 718When a set of regular expressions is ambiguous, {\sl aflex} uses the 719following rules to choose among the regular expressions that match 720the input. 721\begin{enumerate} 722 \item The longest string is matched. 723 \item If the strings are of the same length, the rule given 724 {\bf first} is matched. 725\end{enumerate} 726 727For example, if input \verb|"aabb"| matches both \verb|"a*"| and 728\verb|"aab*"| the action associated with \verb|"aab*"| is executed 729because it matches four as opposed to two characters. 730 731\section{{\sl Aflex} and {\sl Ayacc}} 732\label{alexayacc} 733As briefly mentioned in Section \ref{intro}, {\sl aflex} can be integrated with 734{\sl ayacc} to produce a parser. 735 736\mysk 737Since the parser generated by {\sl ayacc} expects a value of type {\it token}, 738each {\sl aflex} rule should end with 739 740\begin{verbatim} 741 return (token_val); 742\end{verbatim} 743 744to return the appropriate token value. {\sl Ayacc} creates a package 745defining this token type from its specification file, which in turn 746should be {\it with}'ed at the beginning of the user defined section. 747Thus, this token package must be compiled before the lexical analyzer. 748The user is encouraged to read the Ayacc User Manual \cite{ayacc} for 749more information on the interaction between {\sl aflex} and {\sl ayacc}. 750 751 752\newpage 753\section{Appendix A: A Detailed Example} 754 755This section shows a complete {\sl aflex} specification file for translating all 756characters to uppercase. The following file, 757{\it example.l}, defines rules for recognizing lowercase and uppercase words. 758If a word is in lowercase, the scanner converts it to uppercase. 759In addition, the frequencies of lower and uppercase words 760are retained in the two variables defined in the global section. 761All other characters (spaces, tabs, punctuation) remain the same. 762 763\small 764\begin{verbatim} 765LOWER [a-z] 766UPPER [A-Z] 767 768%% 769 770{LOWER}+ { Lower_Case := Lower_Case + 1; 771 TEXT_IO.PUT(To_Upper_Case(Example_DFA.YYText)); } 772 773 -- convert all alphabetic words in lower case 774 -- to upper case 775 776{UPPER}+ { Upper_Case := Upper_Case + 1; 777 TEXT_IO.PUT(Example_DFA.YYText); } 778 779 -- write uppercase word as is 780 781\n { TEXT_IO.NEW_LINE;} 782 783. { TEXT_IO.PUT(Example_DFA.YYText); } 784 -- write anything else as is 785 786%% -- The next section will go to example.ads before the package statement 787with Ada.Command_Line; 788## -- The next section will go to example.ads after the package statement 789procedure Sample; 790## -- The next section will go to example.adb before the package statement 791 792## -- The next section will go to example.adb after the package statement 793procedure Sample is 794 795 type Token is (End_of_Input, Error); 796 797 Tok : Token; 798 Lower_Case : NATURAL := 0; -- frequency of lower case words 799 Upper_Case : NATURAL := 0; -- frequency of upper case words 800 801 function To_Upper_Case (Word : STRING) return STRING is 802 Temp : STRING(1..Word'LENGTH); 803 begin 804 for i in 1.. Word'LENGTH loop 805 Temp(i) := CHARACTER'VAL(CHARACTER'POS(Word(i)) - 32); 806 end loop; 807 return Temp; 808 end To_Upper_Case; 809## -- function YYLex will go here, the follwing lines after YYLex 810begin -- Sample 811 812 Example_IO.Open_Input (Ada.Command_Line.Argument (1)); 813 814 Read_Input : 815 loop 816 Tok := YYLex; 817 exit Read_Input 818 when Tok = End_of_Input; 819 end loop Read_Input; 820 821 TEXT_IO.NEW_LINE; 822 TEXT_IO.PUT_LINE("Number of lowercase words is => " & 823 INTEGER'IMAGE(Lower_Case)); 824 TEXT_IO.PUT_LINE("Number of uppercase words is => " & 825 INTEGER'IMAGE(Upper_Case)); 826end Sample; 827\end{verbatim} 828\normalsize 829 830This source file is run through {\sl aflex} using the command 831 832\small 833\begin{verbatim} 834% aflex example.l 835\end{verbatim} 836\normalsize 837 838{\sl aflex} produces output files called {\it example.ads} and {\it example.adb} 839along with two packages, {\it example\-dfa.ads}, {\it example\-dfa.adb}, 840{\it example\-io.ads} and {\it example\-io.adb}. 841Assuming that the main procedure, {\sl Sample}, is used to construct 842an object file called {\it sample}, the Unix command 843 844\small 845\begin{verbatim} 846% sample example.l 847\end{verbatim} 848\normalsize 849 850prints to the screen the exact file {\it example.l} with letters in 851uppercase, i.e. the output to the screen is 852 853\newpage 854\small 855\begin{verbatim} 856LOWER [A-Z] 857UPPER [A-Z] 858 859%% 860 861{LOWER}+ { LOWER_CASE := LOWER_CASE + 1; 862 TEXT_IO.PUT(TO_UPPER_CASE(EXAMPLE_DFA.YYTEXT)); } 863 864 -- CONVERT ALL ALPHABETIC WORDS IN LOWER CASE 865 -- TO UPPER CASE 866 867{UPPER}+ { UPPER_CASE := UPPER_CASE + 1; 868 TEXT_IO.PUT(EXAMPLE_DFA.YYTEXT); } 869 870 -- WRITE UPPERCASE WORD AS IS 871 872\N { TEXT_IO.NEW_LINE;} 873 874. { TEXT_IO.PUT(EXAMPLE_DFA.YYTEXT); } 875 -- WRITE ANYTHING ELSE AS IS 876 877%% -- THE NEXT SECTION WILL GO TO EXAMPLE.ADS BEFORE THE PACKAGE STATEMENT 878WITH ADA.COMMAND_LINE; 879## -- THE NEXT SECTION WILL GO TO EXAMPLE.ADS AFTER THE PACKAGE STATEMENT 880PROCEDURE SAMPLE; 881## -- THE NEXT SECTION WILL GO TO EXAMPLE.ADB BEFORE THE PACKAGE STATEMENT 882 883## -- THE NEXT SECTION WILL GO TO EXAMPLE.ADB AFTER THE PACKAGE STATEMENT 884PROCEDURE SAMPLE IS 885 886 TYPE TOKEN IS (END_OF_INPUT, ERROR); 887 888 TOK : TOKEN; 889 LOWER_CASE : NATURAL := 0; -- FREQUENCY OF LOWER CASE WORDS 890 UPPER_CASE : NATURAL := 0; -- FREQUENCY OF UPPER CASE WORDS 891 892 FUNCTION TO_UPPER_CASE (WORD : STRING) RETURN STRING IS 893 TEMP : STRING(1..WORD'LENGTH); 894 BEGIN 895 FOR I IN 1.. WORD'LENGTH LOOP 896 TEMP(I) := CHARACTER'VAL(CHARACTER'POS(WORD(I)) - 32); 897 END LOOP; 898 RETURN TEMP; 899 END TO_UPPER_CASE; 900## -- FUNCTION YYLEX WILL GO HERE, THE FOLLWING LINES AFTER YYLEX 901BEGIN -- SAMPLE 902 903 EXAMPLE_IO.OPEN_INPUT (ADA.COMMAND_LINE.ARGUMENT (1)); 904 905 READ_INPUT : 906 LOOP 907 TOK := YYLEX; 908 EXIT READ_INPUT 909 WHEN TOK = END_OF_INPUT; 910 END LOOP READ_INPUT; 911 912 TEXT_IO.NEW_LINE; 913 TEXT_IO.PUT_LINE("NUMBER OF LOWERCASE WORDS IS => " & 914 INTEGER'IMAGE(LOWER_CASE)); 915 TEXT_IO.PUT_LINE("NUMBER OF UPPERCASE WORDS IS => " & 916 INTEGER'IMAGE(UPPER_CASE)); 917END SAMPLE; 918 919Number of lowercase words is => 199 920Number of uppercase words is => 127 921\end{verbatim} 922\normalsize 923 924 925\newpage 926\section{Appendix B: {\sl Aflex} Dependencies} 927 928This release of {\sl aflex} was successfully compiled by GNAT-3.11p 929running under Linux 2.2.x and glibc-2.0 by the Ada for Linux Team (ALT). 930 931\subsection{Command Line Interface} 932The following files are host dependent : 933\begin{tabbing} 9341234\=1234\=1234\=1234\=1234\=1234 \kill 935 \> \>{\sl command\_lineS.a}\\ 936 \> \>{\sl command\_lineB.a}\\ 937 \> \>{\sl file\_managerS.a}\\ 938 \> \>{\sl file\_managerB.a}\\ 939\end{tabbing} 940The command\_line package function {\sc initialize\_command\_line} 941breaks up the command line into a vector containing 942the arguments passed to the program. Note that modifications may need 943to be made to this file if the host system doesn't allow differentiation 944of upper and lower case on the command line. 945\mysk 946The external\_file\_manager package is host dependent in that it chooses 947the names and suffixes for the generated files. It also sets up the 948file\_type {\sc standard\_error} to allow error output to appear on the 949screen. 950 951\mysk 952If {\sl aflex} is to be rehosted, only these files should need modification. 953For more detailed information see the file PORTING in the {\sl aflex} 954distribution. 955\newpage 956\section{Appendix C: Differences between {\sl Aflex} and {\sl Lex}} 957\label{lexdiff} 958 959Although {\sl aflex} supports most of the 960conventions and features of {\sl lex}, there are some differences 961that the user should be aware of in order to port a {\sl lex} specification 962to an {\sl aflex} specification. 963 964\begin{itemize} 965 \item Source file's format: 966 \small 967 \begin{verbatim} 968 definitions section 969 %% 970 rules section 971 %% 972 user defined section 973 ## 974 user defined section 975 ## 976 user defined section 977 ## 978 user defined section 979 ## 980 user defined section 981 \end{verbatim} 982 \normalsize 983 984 985 \item Although {\sl aflex} supports most {\sl lex}'s constructs, it does not 986 implement the following features of {\sl lex}. 987\begin{tabbing} 9881234\=1234\=1234\=1234\=1234\=1234 \kill 989 \>-- REJECT \\ 990 \>-- \%x \>\>\>--- changes to the internal array sizes, but see below. 991 \end{tabbing} 992 993 \item Ada style comments are supported instead of C style comments. 994 995 \item All template files are internalized. 996 997 \item The input source file name must end with a ``.l" extension. 998 999 \item In start conditions ENTER is used instead of BEGIN. This is 1000 done because BEGIN is a keyword in Ada. 1001\end{itemize} 1002 1003\section{Appendix D: Differences between {\sl Aflex} and {\sl Alex}} 1004\label{alexdiff} 1005While {\sl aflex} is intended to be upwardly compatible with {\sl 1006Alex}, there are a few minor differences. Any major inconsistencies 1007with {\sl alex} should be considered bugs and reported. 1008\begin{itemize} 1009 \item The {\tt ENTER} calls must have parentheses around their 1010arguments. Parentheses were optional in {\sl alex.} 1011 1012 \item It is no longer mandatory to call Open\_Input and Create\_Output 1013before calling YYLex. Previously if output was to be directed to 1014Standard\_Output it was recommended that a call of 1015\begin{verbatim} 1016Create_Output("/dev/tty"); 1017\end{verbatim} 1018be made. This will still work but because of differences in 1019implementation this may cause difficulties in redirecting output using 1020the {\sc unix} shell pipes and redirection. Instead just don't call 1021Open\_Input and output will go to the default {\sc standard\_output.} 1022 1023 \item Compilation order. With GNAT the compilation order of the 1024generated modules doesn't matter. 1025\end{itemize} 1026 1027\newpage 1028\section{Appendix E: Known Bugs and Limitations} 1029\begin{itemize} 1030 1031\item Some trailing context 1032patterns cannot be properly matched and generate 1033warning messages ("Dangerous trailing context"). These are 1034patterns where the ending of the 1035first part of the rule matches the beginning of the second 1036part, such as "zx*/xy*", where the 'x*' matches the 'x' at 1037the beginning of the trailing context. (Lex doesn't get these 1038patterns right either.) 1039 1040\item {\it variable} 1041trailing context (where both the leading and trailing parts do not have 1042a fixed length) entails a substantial performance loss. 1043 1044\item For some trailing context rules, parts which are actually 1045fixed-length are not recognized as such, leading to the abovementioned 1046performance loss. In particular, parts using '|' or {n} are always 1047considered variable-length. 1048 1049\item Nulls are not allowed in aflex inputs or in the inputs to 1050scanners generated by aflex. Their presence generates fatal 1051errors. 1052 1053\item Pushing back definitions enclosed in ()'s can \\result in nasty, 1054difficult-to-understand problems like: 1055\begin{verbatim} 1056 1057 {DIG} [0-9] -- a digit 1058 1059\end{verbatim} 1060In which the pushed-back text is "([0-9] -- a digit)". 1061 1062\item Due to both buffering of input and read-ahead, you cannot intermix 1063calls to text\_io routines, such as, for example, 1064{\bf text\_io.get()} 1065with aflex rules and expect it to work. Call 1066{\bf input()} 1067instead. 1068 1069\item There are still more features that could be 1070implemented (especially REJECT.) Also the speed of the compressed 1071scanners could be improved. 1072 1073\item The utility needs more complete documentation, especially more 1074information on modifying the internals. 1075\end{itemize} 1076 1077\newpage 1078\bibliographystyle{alpha} 1079\bibliography{aflex_user_man} 1080\end{document} 1081