1\documentstyle[12pt]{article}
2%\documentstyle[12pt,my]{article}
3\addtolength{\oddsidemargin}{0in}
4\addtolength{\textwidth}{+0.5in}
5\addtolength{\topmargin}{-0.5in}       % slightly longer page
6\addtolength{\textheight}{0.5in}
7\renewcommand{\topfraction}{0.95}
8\renewcommand{\textfraction}{0.05}  %should make [h] work as desired
9\parindent 0pt
10
11%%%\renewcommand{baselinestretch}{1.2}
12
13\newcommand{\mysk}{\vspace{0.5cm}}
14
15\title{\vspace{2cm}\goodbreak
16\bf {\sl Aflex} \rm -- An Ada Lexical Analyzer Generator
17\\ \vspace{1cm} Version 1.1 \vspace{1cm}
18}
19\author{\large \rm John Self \\
20\ \\
21Arcadia Environment Research Project \\
22Department of Information and Computer Science\\
23University of California, Irvine \\
24\\UCI-90-18\\
25\medskip\\
26Adapted for the GNU Ada Compiler GNAT\\
27by the ``Ada for Linux Team (ALT)'', Feb 1999\\
28\\
29\thanks{This work was supported in
30part by the National Science Foundation under grants CCR--8704311
31and CCR--8451421 with cooperation from the Defense Advanced Research
32Projects Agency, and by the National Science Foundation under Award
33No. CCR-8521398.}
34}
35
36\date{May 1990}
37
38\begin{document}
39
40\maketitle
41
42\begin{titlepage}
43\tableofcontents
44\end{titlepage}
45
46\section{Introduction}
47\label{intro}
48{\sl Aflex} is a lexical analyzer generating tool written in Ada
49designed for lexical
50processing of character input streams.
51It is a successor to the {\sl Alex}\cite{alex} tool from UCI.  {\sl Aflex}
52is upwardly compatible with {\sl alex 1.0}, but is significantly
53faster at generating scanners, and produces smaller scanners for
54equivalent specifications.  Internally {\sl aflex} is patterned after the
55{\sl flex} tool from the GNU project.
56{\sl Aflex} accepts high level rules  written in regular expressions
57for character string matching, and generates Ada source code comprising a
58lexical analyzer along with two auxiliary Ada packages.  The main file
59includes a routine that partitions the input text stream into strings
60matching the expressions.  Associated with each rule is an
61action block composed of program fragments.  Whenever a rule is recognized
62in the input stream, the corresponding program fragment is executed.
63This feature, combined with the powerful string pattern matching capability,
64allows the user to implement a lexical analyzer for any type of application
65efficiently and quickly.
66For instance, {\sl aflex} can be used alone for simple lexical analysis and
67statistics, or with {\sl ayacc} \cite{ayacc} to generate a parser front-end.
68{\sl Ayacc} is an Ada parser generator that accepts context-free grammars.
69
70\mysk
71{\sl Aflex} is a successor to the Arcadia tool {\sl Alex}\cite{alex} which
72was inspired by the popular Unix operating system tool, {\it lex}
73\cite{lex},.  Consequently, most of {\it lex}'s features and conventions are
74retained in {\sl aflex}; however, a few important differences are discussed
75in section \ref{lexdiff}.  There are also a few minor differences
76between {\sl aflex} and {\sl alex} which will be discussed in
77section \ref{alexdiff}.
78
79\mysk
80This paper is intended to serve as both the reference manual and the
81user manual for {\sl aflex}.  Some knowledge of {\it lex}, while not
82required, would be very useful in understanding the use of{\sl aflex.}
83A good introduction to {\it lex}, as well as lexical and syntactic analysis,
84can be found in \cite{dragon}, frequently referred to as ``the Dragon Book.''
85Topics to be covered in this paper include the usage of
86{\sl aflex}, the operators' description, the source file format,
87the generated output, the necessary interfaces with {\sl ayacc},
88and ambiguity among rules.
89The appendices provide a simple example, {\sl aflex} dependencies,
90the differences between {\sl aflex},{\sl alex}, and {\it lex}, known bugs and
91limitations, and references.
92
93\newpage
94\section{Command Line Options}
95Command line options are given in a different format than in the
96old UCI alex.  Aflex options are as follows
97\begin{description}
98\item[-t]
99Write the scanner output to the standard output rather than to a file.
100The default names of the scanner files for base.l are base.ads and base.adb.
101Note that this option is not as useful with aflex because in addition
102to the scanner file there are files for the externally visible dfa functions
103(base-dfa.ad[sb]) and the external IO functions (base-io.ad[sb])
104\item[-b]
105Generate backtracking information to
106{\it} aflex.backtrack.
107This is a list of scanner states which require backtracking
108and the input characters on which they do so.  By adding rules one
109can remove backtracking states.  If all backtracking states
110are eliminated and
111{\bf -f}
112is used, the generated scanner will run faster (see the
113{\bf -p}
114flag).  Only users who wish to squeeze every last cycle out of their
115scanners need worry about this option.
116\item[-d]
117makes the generated scanner run in
118{\it debug}
119mode.  Whenever a pattern is recognized the scanner will
120write to
121{\it stderr}
122a line of the form:
123\begin{verbatim}
124
125    --accepting rule #n
126
127\end{verbatim}
128Rules are numbered sequentially with the first one being 1.  Rule \#0
129is executed when the scanner backtracks; Rule \#(n+1) (where
130{\it n}
131is the number of rules) indicates the default action; Rule \#(n+2) indicates
132that the input buffer is empty and needs to be refilled and then the scan
133restarted.  Rules beyond (n+2) are end-of-file actions.
134\item[-f]
135has the same effect as lex's -f flag (do not compress the scanner
136tables); the mnemonic changes from
137{\it fast compilation}
138to (take your pick)
139{\it full table}
140or
141{\it fast scanner.}
142The actual compilation takes
143{\it longer,}
144since aflex is I/O bound writing out the big table.
145The compilation of the Ada file containing the scanner is also likely
146to take a long time because of the large arrays generated.
147\item[-i]
148instructs aflex to generate a
149{\it case-insensitive}
150scanner.  The case of letters given in the aflex input patterns will
151be ignored, and the rules will be matched regardless of case.  The
152matched text given in
153{\it yytext}
154will have the preserved case (i.e., it will not be folded).
155\item[-p]
156generates a performance report to stderr.  The report
157consists of comments regarding features of the aflex input file
158which will cause a loss of performance in the resulting scanner.
159Note that the use of
160the
161{\bf \verb|^|}
162operator
163and the
164{\bf -I}
165flag entail minor performance penalties.
166\item[-s]
167causes the
168{\it default rule}
169(that unmatched scanner input is echoed to
170{\it stdout)}
171to be suppressed.  If the scanner encounters input that does not
172match any of its rules, it aborts with an error.  This option is
173useful for finding holes in a scanner's rule set.
174\item[-v]
175has the same meaning as for lex (print to
176{\it stderr}
177a summary of statistics of the generated scanner).  Many more statistics
178are printed, though, and the summary spans several lines.  Most
179of the statistics are meaningless to the casual aflex user, but the
180first line identifies the version of aflex, which is useful for figuring
181out where you stand with respect to patches and new releases.
182\item[-E]
183instructs aflex to generate additional information about each token,
184including line and column numbers.  This is needed for the advanced
185automatic error option correction in ayacc.
186\item[-I]
187instructs aflex to generate an
188{\it interactive}
189scanner.  Normally, scanners generated by aflex always look ahead one
190character before deciding that a rule has been matched.  At the cost of
191some scanning overhead, aflex will generate a scanner which only looks ahead
192when needed.  Such scanners are called
193{\it interactive}
194because if you want to write a scanner for an interactive system such as a
195command shell, you will probably want the user's input to be terminated
196with a newline, and without
197{\bf -I}
198the user will have to type a character in addition to the newline in order
199to have the newline recognized.  This leads to dreadful interactive
200performance.
201
202If all this seems to confusing, here's the general rule: if a human will
203be typing in input to your scanner, use
204{\bf -I,}
205otherwise don't; if you don't care about how fast your scanners run and
206don't want to make any assumptions about the input to your scanner,
207always use
208{\bf -I.}
209
210Note,
211{\bf -I}
212cannot be used in conjunction with
213{\it full}
214i.e., the
215{\bf -f}
216flag.
217\item[-L]
218instructs aflex to not generate
219{\bf \#line}
220directives (see below).
221\item[-T]
222makes aflex run in
223{\it trace}
224mode.  It will generate a lot of messages to stdout concerning
225the form of the input and the resultant non-deterministic and deterministic
226finite automatons.  This option is mostly for use in maintaining aflex.
227\item[-Sskeleton\_file]
228overrides the default internal skeleton from which aflex constructs
229its scanners.  You'll probably never need this option unless you are doing
230aflex maintenance or development.
231\end{description}
232\section{{\sl Aflex} Output}
233{\sl Aflex} generates a file containing a lexical analyzer function along
234with two auxiliary packages, all of which are written in Ada.
235The context in which the lexical analyzer function is defined is flexible
236and may be specified by the user.  For instance,  the file may only
237contain the lexical analyzer function as a single compilation unit which
238may be called by {\sl ayacc},
239or it may be placed within a package body or embedded within a driver
240routine.  This scanner function, when invoked, partitions the character stream
241into tokens as specified by the regular expressions defined in the rules
242section of the source file.  The name of the lexical analyzer
243function is {\sl yylex}. Note that it returns values of type {\it token}.
244Type {\it token} must be defined as an enumeration type which contains,
245at a minimum, ({\it End\_of\_Input, Error}). It is up to the user to make
246sure that this type is visible (see Section \ref{alexayacc}). The general
247format of the output file which contains this function is found in
248Figure 3.
249
250\mysk
251The auxiliary packages include a DFA and an IO package.  The DFA
252package contains externally visible functions and variables from the
253scanner.  Many of the variables in this package should not be modified
254by normal user programs, but they are provided here to allow the user to
255modify the internal behavior of aflex to match specific needs.  Only
256the functions YYText and YYLength will be needed by most programs. The
257{\sl GNAT} port of {\sl aflex} generates the DFA package as child packages
258of the base package. For portability and conveniance the previously
259used flat package names  {\sl base}\_DFA and {\sl base}\_IO are generated
260as renames of these child packages.
261
262\mysk
263The IO package contains
264routines which allow {\sl yylex} to scan the input source file.
265These include the unput, input, output, and yywrap functions
266from {\it lex},\\
267plus Open\_Input, Create\_Output, Close\_Input and Close\_Output
268provided for compatibility with {\sl alex.}
269\mysk
270It is also possible to write your own IO and DFA packages.  Redefining
271input is possible by changing the YY\_INPUT procedure.  As an example
272you might wish to take input from an array instead of from a file.  By
273changing the calls to the TEXT\_IO routines to access elements of the
274array you can change the input strategy.  If you change the IO or DFA
275packages you should make a copy of the generated files under a
276different name and change that, because {\sl aflex} will overwrite
277them whenever you rerun {\sl aflex}.
278
279\newpage
280\small
281\begin{tabbing}
2821234\=1234\=1234\=1234\=1234\=1234 \kill
283
284    \>\>    \>{\bf with} $<$rootname$>$\.DFA; \\
285    \>\>    \>{\bf with} $<$rootname$>$\.IO; \\
286    \>\>    \>{\bf with} TEXT\_IO; \\
287\\
288    \>\>    \>\verb|--| User Specified Context\\
289\\
290    \>\>    \>    \>{\bf function} yylex {\bf return} Token {\bf is} \\
291    \>\>    \>    \>{\bf begin} \\
292    \>\>    \>    \>    \>\verb|--| Analysis of expressions \\
293    \>\>    \>    \>    \>\verb|--| Execution of user-defined actions \\
294    \>\>    \>    \>{\bf end} yylex; \\
295\\
296    \>\>  \>\verb|--| User Specified Context\\
297\end{tabbing}
298\centerline{Figure 3: Example of File Containing Lexical Analyzer}
299
300\mysk
301Before showing the general layout of the specification file, we will
302describe the specification language of {\sl aflex}, namely, regular expressions.
303
304
305\section{Regular Expressions}
306{\sl Aflex} distinguishes two types  of  character  sets  used  to
307define regular expressions: text characters and operator characters.
308A regular expression specifies how a set of strings from the input
309string can be recognized.  It contains text characters  (which  match  the
310corresponding characters in  the  strings  being  compared)  and
311operator  characters  (which  specify  repetitions, choices, and
312other features).  The letters of the alphabet  and  the  digits  are
313always text characters.
314
315\mysk
316A rule specifies a  sequence of  characters  to  be  matched. It
317{\bf must} begin in column one.
318The set of {\sl aflex} operators consists of
319the following:
320
321\begin{verbatim}
322      " \ { } [ ] ^ $ < > ? . * + | ( ) /
323\end{verbatim}
324
325The meaning of each operator is summarized below:
326
327\begin{tabbing}
3281234\=1234\=1234\=1234\=1234\=1234 \kill
329  \>\verb|x|     \>\>\verb|--| the character ``x" \\
330  \>\verb|"x"|   \>\>\verb|--| an ``x", even if x is an operator. \\
331  \>\verb|\x|    \>\>\verb|--| an ``x", even if x is an operator. \\
332  \>\verb|^x|    \>\>\verb|--| an x at the beginning of a line. \\
333  \>\verb|x$|    \>\>\verb|--| an x at the end of line. \\
334  \>\verb|x+|    \>\>\verb|--| 1 or more instances of x. \\
335  \>\verb|x*|    \>\>\verb|--| 0 or more instances of x. \\
336  \>\verb|x?|    \>\>\verb|--| an optional x. \\
337  \>\verb|(x)|   \>\>\verb|--| an x. \\
338  \>\verb|.|     \>\>\verb|--| any character but newline. \\
339  \>\verb"x|y"   \>\>\verb|--| an x or y. \\
340  \>\verb|[xy]|  \>\>\verb|--| the character x or the character y. \\
341  \>\verb|[x-z]| \>\>\verb|--| the character x, y or z. \\
342  \>\verb|[^x]|  \>\>\verb|--| any character but x. \\
343  \>\verb|<y>x|  \>\>\verb|--| an x when {\sl aflex} is in start condition y. \\
344  \>\verb|{xx}|  \>\>\verb|--| the translation of xx from the definitions section. \\
345\end{tabbing}
346
347If any of these operators is used in a regular expression as a character
348literal, it must be either preceded by an escape character or surrounded by
349double quotes.  For example, to recognize a dollar sign \verb|$|, the correct
350expression is either \verb|\$| or \verb|"$"|.
351Note a quote cannot be quoted and should therefore be escaped.
352
353\mysk
354A regular expression may {\bf not} contain any spaces
355unless they are within in a quoted string or character class
356or they are preceded by the \verb|"\"| operator.
357
358\mysk
359When in doubt, use parentheses.  When an {\sl aflex} operator needs to be
360embedded in a string, it is often neater to quote the entire string rather
361than just the operator, e.g. the string \verb|"what?"| is more readable
362than both \verb|What"?"|, and \verb|What\?|.
363
364\small
365\begin{verbatim}
366Rules              Interpretations
367-----              ---------------
368a or "a"           The character a
369Begin or "Begin"   The string Begin
370\"Begin\"          The string "Begin"
371^\t or ^"\t"       The tab character \t at the beginning of line.
372\n$                The newline character \n at the end of line.
373\end{verbatim}
374\normalsize
375
376There are a few special characters which can be specified in a regular
377expression:
378\begin{tabbing}
3791234\=1234\=1234\=1234\=1234\=1234 \kill
380  \>\verb|\n|     \>\>\verb|--| newline \\
381  \>\verb|\b|     \>\>\verb|--| backspace \\
382  \>\verb|\t|     \>\>\verb|--| tab \\
383  \>\verb|\r|     \>\>\verb|--| carriage return \\
384  \>\verb|\f|     \>\>\verb|--| form feed \\
385  \>\verb|\ddd|   \>\>\verb|--| octal ASCII code \\
386\end{tabbing}
387Here is the precedence of the above  operators  that  have  precedence.
388\begin{tabbing}
3891234\=1234\=1234\=1234\=1234\=1234 \kill
390  \>\verb|"  []  ()|            \>\>\>\>Highest \\
391  \>\verb|+  *  ?|		\>\>\>\>\hspace{0.5cm}$\vdots$ \\
392  \>\verb|concatenation|	\>\>\>\>\hspace{0.5cm}$\vdots$ \\
393  \>\verb"|"                    \>\>\>\>Lowest \\
394\end{tabbing}
395
396\begin{description}
397  \item[Character Classes:]  Classes of characters can be specified using
398      the  operator pair {\bf []}.  Within these square brackets, the operator
399      meanings are ignored except for three special characters:  \verb|\|
400      and $-$ and \verb|^|.
401
402\small
403\begin{verbatim}
404Rules       Interpretations
405-----       ---------------
406[^abc]      Any character except a, b, or c.
407[abc]       The single character a, b, or c.
408[-+0-9]     The - or + sign or any digit from 0 to 9.
409[\t\n\b]    The tab, newline, or backspace character.
410\end{verbatim}
411\normalsize
412
413  \item[Arbitrary and Optional Characters:]  The dot, ``$.$", operator
414  matches all characters except newline.  The  operator  ?  indicates  an
415  optional character of an expression.
416
417\small
418\begin{verbatim}
419Rules     Interpretations
420-----     ---------------
421ab?c      Matches either abc or ac.
422ab.c      Matches all strings of length 4 having a, b and
423          c as the first, second and fourth letter where the
424          third character is not a newline.
425\end{verbatim}
426\normalsize
427
428
429  \item[Repeated Expressions:]  Repetitions of classes are  indicated  by
430      the operators $*$ and $+$.
431
432\small
433\begin{verbatim}
434Rules                  Interpretations
435-----                  ---------------
436[a-z]+                 Matches all strings of lower case letters.
437[A-Za-z][A-Za-z0-9]*   Indicates all alphanumeric strings with a
438                       leading alphabetic character.
439\end{verbatim}
440\normalsize
441
442
443  \item[Alternation and Grouping:]  The operator \verb"|" indicates  alternation
444      and parentheses are used for grouping complex expressions.
445
446\small
447\begin{verbatim}
448Rules           Interpretations
449-----           ---------------
450ab|cd           Matches either ab or cd.
451(ab|cd+)?(ef)*  Matches such strings as abefef, efefef, cdef,
452                or cddd; but not abc, abcd, or abcdef.
453\end{verbatim}
454\normalsize
455
456
457  \item[Context Sensitivity:]  {\sl aflex} will recognize a small amount of
458      surrounding context.  Two simple operators for this are \verb|^| and
459      \$.  If the first character of an expression is \verb|^|, the expression
460      will  only  be  matched at the beginning of a line.  If the very
461      last character is \$, the expression will only be matched at  the
462      end  of  a line.
463
464\small
465\begin{verbatim}
466Rules          Interpretations
467-----          ---------------
468^ab            Matches ab at the beginning of line.
469ab$            Matches ab at the end of line.
470\end{verbatim}
471\normalsize
472
473
474  \item[Definitions:]  The operators  \{ \}  enclosing a name
475      specify a macro definition expansion.
476
477\small
478\begin{verbatim}
479Rules          Interpretations
480-----          ---------------
481{INTEGER}      If INTEGER is defined in the macro definition
482               section, then it will be expanded here.
483\end{verbatim}
484\normalsize
485\end{description}
486
487
488\subsection{Predefined Variables \& Routines}
489\label {routines}
490Once a token is matched, the textual string representation of the token
491may be obtained by a call to the function {\sl yytext} which is located
492in the {\sl dfa package}.  This function returns type string.
493
494\mysk
495The IO package contains
496routines which allow {\sl yylex} to scan the input source file.
497These include the input, output, unput and yywrap functions
498from lex,\\ plus Open\_Input, Create\_Output, Close\_Input and Close\_Output
499provided for compatibility with {\sl alex.}  Note that in {\sl alex
5001.0} it was mandatory to call the {\it Open\_Input} and {\it
501Create\_Output} routines before calling {\it YYLex.}  This is not
502required in {\sl Aflex.}  The default input and output are attached to
503the files that Ada considers to be the {\sc standard\_input} and
504{\sc standard\_output. }
505
506The following routines must be used in lieu of the normal {\sc
507text\_io} routines because of internal buffering and read-ahead done by
508{\sl aflex.}
509
510\begin{description}
511\item[input] function input return character -- inputs a character from the
512current {\sl aflex} input stream.
513\item[unput] procedure unput(c : character) -- returns a character
514already read by input to the input stream.  Note that attempting to
515push back more than one character at a time can cause {\sl aflex} to\\
516raise the exception {\sc pushback\_overflow.}
517\item[output] procedure output(c : character) -- outputs a character to the
518current {\sl aflex} output stream.
519\item[yywrap] function yywrap return boolean -- This function is
520called when {\sl aflex} reaches the end of file.  If {\it yywrap}
521returns true, {\sl aflex} continues with normal wrapup at end of
522input.  If you wish to arrange for more input to arrive from a new
523source then you provide a yywrap which returns false.  The default
524yywrap return true.
525\item[Open\_Input] Open\_Input(fname : in String) -- Uses the file named
526fname as the source for input to {\it YYLex.}  If this function is not
527called then the default input is the Ada {\sc standard\_input.}
528\item[Open\_Input] Create\_Output(fname : in String) -- Uses the file named
529fname as output for {\it YYLex.}  If this function is not
530called then the default output is the Ada {\sc standard\_output}.
531\item[Close\_Input and Close\_Output]  These functions have null
532bodies in {\sl aflex} and are provided only for compatibility with
533{\sl alex.}
534\end{description}
535
536\mysk
537There are a few predefined subroutines that may be used once a token
538is matched.  In many lexical processing applications, the printing of
539the string returned by {\sl yytext}, i.e. {\tt put(yytext)}, is desired
540and this action is so common that it may be written as {\tt ECHO}.
541
542\newpage
543\section{{\sl Aflex} Source Specification}
544\label {specformat}
545
546The general format of the source file is
547
548\small
549\begin{verbatim}
550    definitions section
551    %%
552    rules section
553    %%
554    user defined section copied before package statement of SPEC file
555    ##
556    user defined section copied after package statement of SPEC file
557    ##
558    user defined section copied before package statement of BODY file
559    ##
560    user defined section copied after  package statement of BODY file
561    but before YYLex
562    ##  --  here goes YYLex
563    section copied after YYLex and before end of BODY package
564\end{verbatim}
565\normalsize
566
567where \verb|%%| is used as a delimiter between sections  and \verb|##|
568indicates where the user supplied code and where function {\sl yylex}
569will be placed.  Both \verb|%%|
570and \verb|##| {\it must} occur in column one.
571
572\mysk
573The definitions section is used to define macros which appear in the rules
574section and also to define start conditions.  The rules section defines the
575regular expressions with their corresponding actions.  These regular
576expressions, in turn, define the tokens to be identified by the scanner.
577The user defined sections allows the user to define the context in which the
578{\sl yylex} function will be located.  The user can include routines which
579may be executed when a certain token or condition is recognized.
580
581
582\subsection{Definitions Section}
583
584The definitions section may contain both macro definitions and
585start condition definitions.  Macro and start condition definitions
586must begin in column one and may be interspersed.
587
588\subsubsection{Macros}
589Macro definitions take the form:
590
591\small
592\begin{verbatim}
593    name   expression
594\end{verbatim}
595\normalsize
596
597where {\tt name} must begin with a letter and contain only letters,
598digits and underscores, and {\tt expression} is
599any string of characters that {\tt name} will be textually substituted to
600if found in the rule section.  At least one space must separate {\tt name}
601from {\tt expression} in the definition.  No syntax checking is done in
602the expression, instead the whole rule is parsed after expansion.
603The macro facility is very useful in writing regular expressions which
604have common substrings, and in defining often-used ranges like {\it digit}
605and {\it letter}.
606Perhaps its best advantage is to give a mnemonic name to a rather strange
607regular expression -- making it easier for the programmer to debug the
608expressions.  These macros, once defined, can be used in the
609regular expression by surrounding them with \{ and \}, e.g., \verb|{DIGIT}|.
610For example, the rule
611
612\small
613\begin{verbatim}
614[a-zA-Z]([0-9a-zA-Z])*   {put_line ("Found an identifier");}
615[0-9]+                   {put_line ("Found a number");}
616\end{verbatim}
617\normalsize
618
619defines identifiers and integer numbers.  With macros, the source file is
620
621\small
622\begin{verbatim}
623LETTER [a-zA-Z]
624DIGIT  [0-9]
625%%
626{LETTER}({DIGIT}|{LETTER})*   {put_line ("Found an identifier");}
627{DIGIT}+                      {put_line ("Found a number");}
628\end{verbatim}
629\normalsize
630
631\mysk
632It is customary, although not necessary, to use all capital letters
633for macro names. This allows macros to be easily identified in complex rules.
634Macro names are case sensitive, e.g., \verb|{DIGIT}| and \verb|{Digit}| are
635two different macro names.
636
637\subsubsection{Start Conditions}
638Left context is handled in {\sl aflex} by start conditions that are defined
639in the macro definition section.  Start conditions are declared as follows,
640
641\begin{verbatim}
642      %Start cond1 cond2 ...
643\end{verbatim}
644
645where cond1 and cond2 indicate start conditions.
646Note that \%Start may be abbreviated as \%S or \%s.
647
648\mysk
649A condition is set only when  the  {\sl aflex}  command  {\tt ENTER}  in  the
650action  part  is executed, e.g. {\tt ENTER cond1};. Thus the expression
651which  has  the  form \verb|<condition>rule| will only be matched
652when {\tt condition} is set.  Note that {\sl aflex} uses {\tt ENTER}
653instead of {\tt BEGIN} which is used in {\it lex.}  This is done
654because {\tt BEGIN} is a keyword in Ada.  The {\tt ENTER} command must
655have parentheses surrounding its argument.
656\begin{verbatim}
657      ENTER(cond1);
658\end{verbatim}
659
660{\sl Aflex} also provides {\it exclusive start conditions.}  These are
661similar to normal start conditions except they have the property that
662when they are active no other rules are active.  Exclusive start
663conditions are declared and used like normal start conditions except
664that the declaration is done with \%x instead of \%s.
665
666\subsection{Rules Section}
667
668Contained in the rule section are regular expressions which define the
669format of each token to be recognized by the scanner.
670Each rule has the following format:
671
672\begin{verbatim}
673pattern  {action}
674\end{verbatim}
675
676where {\tt pattern} is a regular expression and {\tt action} is an Ada
677code fragment enclosed between \{ and \}.  A {\tt pattern} must
678always begin in column one.
679
680\mysk
681While a pattern defines the format of the token, the action portion
682defines
683the operation to be performed by the scanner each time the corresponding
684token is recognized.  Therefore, the user must provide a syntactically
685correct Ada code fragment.  {\sl aflex} does not check for the validity of the
686program portion, but rather copies it to the output package and leaves it to
687the Ada compiler to detect syntax and semantics errors.  There can be more
688than one Ada statement in the code fragment.  For example, the rule
689
690\small
691\begin{verbatim}
692%%
693begin|BEGIN     {copy (yytext, buffer);
694                 Install (yytext,symbol_table);
695                 return RESERVED;}
696\end{verbatim}
697\normalsize
698
699recognizes the reserved word ``begin" or ``BEGIN", copies the
700token string into the buffer, inserts it in the symbol table and returns
701the value, RESERVED.
702
703Note that the user must provide the procedures
704{\tt copy} and {\tt install} along with all necessary types and variables
705in the user defined sections.
706
707\subsection{User Defined Sections}
708The user defined sections allows the user to specify the context surrounding
709the {\sl yylex} function.  \verb|##| is used to separate the various parts
710of user defined code and where the {\sl yylex} function should be placed.
711It must be present in this section and must occur in the first column.
712Any text following \verb|##| on the same line is ignored. This method of
713using multiple user defined sections that go to specific places in the
714generated .ads and .adb files is specific for the {\sl GNAT} port of
715{\sl aflex}.
716
717\section{Ambiguous Source Rules}
718When a set of regular expressions is ambiguous, {\sl aflex} uses the
719following rules to choose among the regular expressions that match
720the input.
721\begin{enumerate}
722    \item The longest string is matched.
723    \item If the strings are of the same length, the rule given
724	  {\bf first} is matched.
725\end{enumerate}
726
727For example, if input \verb|"aabb"| matches both \verb|"a*"| and
728\verb|"aab*"| the action associated with \verb|"aab*"| is executed
729because it matches four as opposed to two characters.
730
731\section{{\sl Aflex} and {\sl Ayacc}}
732\label{alexayacc}
733As briefly mentioned in Section \ref{intro}, {\sl aflex} can be integrated with
734{\sl ayacc} to produce a parser.
735
736\mysk
737Since the parser generated by {\sl ayacc} expects a value of type {\it token},
738each {\sl aflex} rule should end with
739
740\begin{verbatim}
741     return (token_val);
742\end{verbatim}
743
744to return the appropriate token value.  {\sl Ayacc} creates a package
745defining this token type from its specification file, which in turn
746should be {\it with}'ed at the beginning of the user defined section.
747Thus, this token package must be compiled before the lexical analyzer.
748The user is encouraged to read the Ayacc User Manual \cite{ayacc} for
749more information on the interaction between {\sl aflex} and {\sl ayacc}.
750
751
752\newpage
753\section{Appendix A: A Detailed Example}
754
755This section shows a complete {\sl aflex} specification file for translating all
756characters to uppercase.  The following file,
757{\it example.l}, defines rules for recognizing lowercase and uppercase words.
758If a word is in lowercase, the scanner converts it to uppercase.
759In addition, the frequencies of lower and uppercase words
760are retained in the two variables defined in the global section.
761All other characters (spaces, tabs, punctuation) remain the same.
762
763\small
764\begin{verbatim}
765LOWER    [a-z]
766UPPER    [A-Z]
767
768%%
769
770{LOWER}+       { Lower_Case := Lower_Case + 1;
771                 TEXT_IO.PUT(To_Upper_Case(Example_DFA.YYText)); }
772
773		  -- convert all alphabetic words in lower case
774		  -- to upper case
775
776{UPPER}+       { Upper_Case := Upper_Case + 1;
777                 TEXT_IO.PUT(Example_DFA.YYText); }
778
779		  -- write uppercase word as is
780
781\n             { TEXT_IO.NEW_LINE;}
782
783.              { TEXT_IO.PUT(Example_DFA.YYText); }
784                 -- write anything else as is
785
786%% -- The next section will go to example.ads before the package statement
787with Ada.Command_Line;
788## -- The next section will go to example.ads after  the package statement
789procedure Sample;
790## -- The next section will go to example.adb before the package statement
791
792## -- The next section will go to example.adb after  the package statement
793procedure Sample is
794
795  type Token is (End_of_Input, Error);
796
797  Tok        : Token;
798  Lower_Case : NATURAL := 0;   -- frequency of lower case words
799  Upper_Case : NATURAL := 0;   -- frequency of upper case words
800
801  function To_Upper_Case (Word : STRING) return STRING is
802  Temp : STRING(1..Word'LENGTH);
803  begin
804    for i in 1.. Word'LENGTH loop
805      Temp(i) := CHARACTER'VAL(CHARACTER'POS(Word(i)) - 32);
806    end loop;
807    return Temp;
808  end To_Upper_Case;
809##  -- function YYLex will go here, the follwing lines after YYLex
810begin  -- Sample
811
812  Example_IO.Open_Input     (Ada.Command_Line.Argument (1));
813
814  Read_Input :
815  loop
816    Tok := YYLex;
817    exit Read_Input
818      when Tok = End_of_Input;
819  end loop Read_Input;
820
821  TEXT_IO.NEW_LINE;
822  TEXT_IO.PUT_LINE("Number of lowercase words is => " &
823		   INTEGER'IMAGE(Lower_Case));
824  TEXT_IO.PUT_LINE("Number of uppercase words is => " &
825		   INTEGER'IMAGE(Upper_Case));
826end Sample;
827\end{verbatim}
828\normalsize
829
830This source file is run through {\sl aflex} using the command
831
832\small
833\begin{verbatim}
834% aflex example.l
835\end{verbatim}
836\normalsize
837
838{\sl aflex} produces output files called {\it example.ads} and {\it example.adb}
839along with two packages, {\it example\-dfa.ads}, {\it example\-dfa.adb},
840{\it example\-io.ads} and {\it example\-io.adb}.
841Assuming that the main procedure, {\sl Sample}, is used to construct
842an object file called {\it sample}, the Unix command
843
844\small
845\begin{verbatim}
846% sample example.l
847\end{verbatim}
848\normalsize
849
850prints to the screen the exact file {\it example.l} with letters in
851uppercase, i.e. the output to the screen is
852
853\newpage
854\small
855\begin{verbatim}
856LOWER    [A-Z]
857UPPER    [A-Z]
858
859%%
860
861{LOWER}+       { LOWER_CASE := LOWER_CASE + 1;
862                 TEXT_IO.PUT(TO_UPPER_CASE(EXAMPLE_DFA.YYTEXT)); }
863
864		  -- CONVERT ALL ALPHABETIC WORDS IN LOWER CASE
865		  -- TO UPPER CASE
866
867{UPPER}+       { UPPER_CASE := UPPER_CASE + 1;
868                 TEXT_IO.PUT(EXAMPLE_DFA.YYTEXT); }
869
870		  -- WRITE UPPERCASE WORD AS IS
871
872\N             { TEXT_IO.NEW_LINE;}
873
874.              { TEXT_IO.PUT(EXAMPLE_DFA.YYTEXT); }
875                 -- WRITE ANYTHING ELSE AS IS
876
877%% -- THE NEXT SECTION WILL GO TO EXAMPLE.ADS BEFORE THE PACKAGE STATEMENT
878WITH ADA.COMMAND_LINE;
879## -- THE NEXT SECTION WILL GO TO EXAMPLE.ADS AFTER  THE PACKAGE STATEMENT
880PROCEDURE SAMPLE;
881## -- THE NEXT SECTION WILL GO TO EXAMPLE.ADB BEFORE THE PACKAGE STATEMENT
882
883## -- THE NEXT SECTION WILL GO TO EXAMPLE.ADB AFTER  THE PACKAGE STATEMENT
884PROCEDURE SAMPLE IS
885
886  TYPE TOKEN IS (END_OF_INPUT, ERROR);
887
888  TOK        : TOKEN;
889  LOWER_CASE : NATURAL := 0;   -- FREQUENCY OF LOWER CASE WORDS
890  UPPER_CASE : NATURAL := 0;   -- FREQUENCY OF UPPER CASE WORDS
891
892  FUNCTION TO_UPPER_CASE (WORD : STRING) RETURN STRING IS
893  TEMP : STRING(1..WORD'LENGTH);
894  BEGIN
895    FOR I IN 1.. WORD'LENGTH LOOP
896      TEMP(I) := CHARACTER'VAL(CHARACTER'POS(WORD(I)) - 32);
897    END LOOP;
898    RETURN TEMP;
899  END TO_UPPER_CASE;
900##  -- FUNCTION YYLEX WILL GO HERE, THE FOLLWING LINES AFTER YYLEX
901BEGIN  -- SAMPLE
902
903  EXAMPLE_IO.OPEN_INPUT     (ADA.COMMAND_LINE.ARGUMENT (1));
904
905  READ_INPUT :
906  LOOP
907    TOK := YYLEX;
908    EXIT READ_INPUT
909      WHEN TOK = END_OF_INPUT;
910  END LOOP READ_INPUT;
911
912  TEXT_IO.NEW_LINE;
913  TEXT_IO.PUT_LINE("NUMBER OF LOWERCASE WORDS IS => " &
914		   INTEGER'IMAGE(LOWER_CASE));
915  TEXT_IO.PUT_LINE("NUMBER OF UPPERCASE WORDS IS => " &
916		   INTEGER'IMAGE(UPPER_CASE));
917END SAMPLE;
918
919Number of lowercase words is =>  199
920Number of uppercase words is =>  127
921\end{verbatim}
922\normalsize
923
924
925\newpage
926\section{Appendix B: {\sl Aflex} Dependencies}
927
928This release of {\sl aflex} was successfully compiled by GNAT-3.11p
929running under Linux 2.2.x and glibc-2.0 by the Ada for Linux Team (ALT).
930
931\subsection{Command Line Interface}
932The following files are host dependent :
933\begin{tabbing}
9341234\=1234\=1234\=1234\=1234\=1234 \kill
935    \>    \>{\sl command\_lineS.a}\\
936    \>    \>{\sl command\_lineB.a}\\
937    \>    \>{\sl file\_managerS.a}\\
938    \>    \>{\sl file\_managerB.a}\\
939\end{tabbing}
940The command\_line package function {\sc initialize\_command\_line}
941breaks up the command line into a vector containing
942the arguments passed to the program.  Note that modifications may need
943to be made to this file if the host system doesn't allow differentiation
944of upper and lower case on the command line.
945\mysk
946The external\_file\_manager package is host dependent in that it chooses
947the names and suffixes for the generated files.  It also sets up the
948file\_type {\sc standard\_error} to allow error output to appear on the
949screen.
950
951\mysk
952If {\sl aflex} is to be rehosted, only these files should need modification.
953For more detailed information see the file PORTING in the {\sl aflex}
954distribution.
955\newpage
956\section{Appendix C: Differences between {\sl Aflex} and {\sl Lex}}
957\label{lexdiff}
958
959Although {\sl aflex} supports most of the
960conventions and features of {\sl lex}, there are some differences
961that the user should be aware of in order to port a {\sl lex} specification
962to an {\sl aflex} specification.
963
964\begin{itemize}
965 \item Source file's format:
966   \small
967   \begin{verbatim}
968       definitions section
969       %%
970       rules section
971       %%
972       user defined section
973       ##
974       user defined section
975       ##
976       user defined section
977       ##
978       user defined section
979       ##
980       user defined section
981   \end{verbatim}
982   \normalsize
983
984
985 \item Although {\sl aflex} supports most {\sl lex}'s constructs, it does not
986	implement the following features of {\sl lex}.
987\begin{tabbing}
9881234\=1234\=1234\=1234\=1234\=1234 \kill
989   \>-- REJECT \\
990   \>-- \%x \>\>\>---  changes to the internal array sizes, but see below.
991 \end{tabbing}
992
993 \item  Ada style comments are supported instead of C style comments.
994
995 \item  All template files are internalized.
996
997 \item  The input source file name must end with a ``.l" extension.
998
999 \item	In start conditions ENTER is used instead of BEGIN.  This is
1000	done because BEGIN is a keyword in Ada.
1001\end{itemize}
1002
1003\section{Appendix D: Differences between {\sl Aflex} and {\sl Alex}}
1004\label{alexdiff}
1005While {\sl aflex} is intended to be upwardly compatible with {\sl
1006Alex}, there are a few minor differences.  Any major inconsistencies
1007with {\sl alex} should be considered bugs and reported.
1008\begin{itemize}
1009 \item  The {\tt ENTER} calls must have parentheses around their
1010arguments.  Parentheses were optional in {\sl alex.}
1011
1012 \item  It is no longer mandatory to call Open\_Input and Create\_Output
1013before calling YYLex.  Previously if output was to be directed to
1014Standard\_Output it was recommended that a call of
1015\begin{verbatim}
1016Create_Output("/dev/tty");
1017\end{verbatim}
1018be made.  This will still work but because of differences in
1019implementation this may cause difficulties in redirecting output using
1020the {\sc unix} shell pipes and redirection.  Instead just don't call
1021Open\_Input and output will go to the default {\sc standard\_output.}
1022
1023 \item	Compilation order.  With GNAT the compilation order of the
1024generated modules doesn't matter.
1025\end{itemize}
1026
1027\newpage
1028\section{Appendix E: Known Bugs and Limitations}
1029\begin{itemize}
1030
1031\item Some trailing context
1032patterns cannot be properly matched and generate
1033warning messages ("Dangerous trailing context").  These are
1034patterns where the ending of the
1035first part of the rule matches the beginning of the second
1036part, such as "zx*/xy*", where the 'x*' matches the 'x' at
1037the beginning of the trailing context.  (Lex doesn't get these
1038patterns right either.)
1039
1040\item {\it variable}
1041trailing context (where both the leading and trailing parts do not have
1042a fixed length) entails a substantial performance loss.
1043
1044\item For some trailing context rules, parts which are actually
1045fixed-length are not recognized as such, leading to the abovementioned
1046performance loss.  In particular, parts using '|' or {n} are always
1047considered variable-length.
1048
1049\item Nulls are not allowed in aflex inputs or in the inputs to
1050scanners generated by aflex.  Their presence generates fatal
1051errors.
1052
1053\item Pushing back definitions enclosed in ()'s can \\result in nasty,
1054difficult-to-understand problems like:
1055\begin{verbatim}
1056
1057	{DIG}  [0-9] -- a digit
1058
1059\end{verbatim}
1060In which the pushed-back text is "([0-9] -- a digit)".
1061
1062\item Due to both buffering of input and read-ahead, you cannot intermix
1063calls to text\_io routines, such as, for example,
1064{\bf text\_io.get()}
1065with aflex rules and expect it to work.  Call
1066{\bf input()}
1067instead.
1068
1069\item There are still more features that could be
1070implemented (especially REJECT.)  Also the speed of the compressed
1071scanners could be improved.
1072
1073\item The utility needs more complete documentation, especially more
1074information on modifying the internals.
1075\end{itemize}
1076
1077\newpage
1078\bibliographystyle{alpha}
1079\bibliography{aflex_user_man}
1080\end{document}
1081