1
2\documentstyle[11pt]{article}
3
4\title{TP Lex and Yacc -- The Compiler Writer's Tools for Turbo Pascal\\
5       Version 4.1 User Manual}
6
7\author{Albert Gr\"af\\
8        Department of Musicinformatics\\
9        Johannes Gutenberg-University Mainz\\
10        \\
11        ag@muwiinfa.geschichte.uni-mainz.de}
12
13\date{April 1998}
14
15\setlength{\topmargin}{0cm}
16\setlength{\oddsidemargin}{0cm}
17\setlength{\evensidemargin}{0cm}
18\setlength{\textwidth}{16cm}
19\setlength{\textheight}{21cm}
20%\setlength{\parindent}{0pt}
21\parskip=4pt plus 1pt minus 1pt
22\itemsep=0pt
23\renewcommand{\baselinestretch}{1.1}
24\unitlength=1mm
25%\tolerance=500
26%\parskip=0.1cm
27\leftmargini 1.5em
28\leftmarginii 1.5em \leftmarginiii 1.5em \leftmarginiv 1.5em \leftmarginv 1.5em
29\leftmarginvi 1.5em
30
31\begin{document}
32
33\maketitle
34
35\tableofcontents
36
37\section{Introduction}
38
39This document describes the TP Lex and Yacc compiler generator toolset.
40These tools are designed especially to help you prepare compilers and
41similar programs like text processing utilities and command language
42interpreters with the Turbo Pascal (TM) programming language.
43
44TP Lex and Yacc are Turbo Pascal adaptions of the well-known UNIX (TM)
45utilities Lex and Yacc, which were written by M.E. Lesk and S.C. Johnson
46at Bell Laboratories, and are used with the C programming language. TP Lex
47and Yacc are intended to be approximately ``compatible'' with these programs.
48However, they are an independent development of the author, based on the
49techniques described in the famous ``dragon book'' of Aho, Sethi and Ullman
50(Aho, Sethi, Ullman: {\em Compilers : principles, techniques and tools,\/}
51Reading (Mass.), Addison-Wesley, 1986).
52
53Version 4.1 of TP Lex and Yacc works with all recent flavours of Turbo/Borland
54Pascal, including Delphi, and with the Free Pascal Compiler, a free Turbo
55Pascal-compatible compiler which currently runs on DOS and Linux (other ports
56are under development). Recent information about TP Lex/Yacc, and the sources
57are available from the TPLY homepage:
58\begin{quote}\begin{verbatim}
59   http://www.musikwissenschaft.uni-mainz.de/~ag/tply
60\end{verbatim}\end{quote}
61For information about the Free Pascal Compiler, please refer to:
62\begin{quote}\begin{verbatim}
63   http://www.freepascal.org
64\end{verbatim}\end{quote}
65
66TP Lex and Yacc, like any other tools of this kind, are not intended for
67novices or casual programmers; they require extensive programming experience
68as well as a thorough understanding of the principles of parser design and
69implementation to be put to work successfully. But if you are a seasoned
70Turbo Pascal programmer with some background in compiler design and formal
71language theory, you will almost certainly find TP Lex and Yacc to be a
72powerful extension of your Turbo Pascal toolset.
73
74This manual tells you how to get started with the TP Lex and Yacc programs
75and provides a short description of these programs. Some knowledge about
76the C versions of Lex and Yacc will be useful, although not strictly
77necessary. For further reading, you may also refer to:
78
79\begin{itemize}
80   \item
81      Aho, Sethi and Ullman: {\em Compilers : principles, techniques and
82      tools.\/} Reading (Mass.), Addison-Wesley, 1986.
83   \item
84      Johnson, S.C.: {\em Yacc -- yet another compiler-compiler.\/} CSTR-32,
85      Bell Telephone Laboratories, 1974.
86   \item
87      Lesk, M.E.: {\em Lex -- a lexical analyser generator.\/} CSTR-39, Bell
88      Telephone Laboratories, 1975.
89   \item
90      Schreiner, Friedman: {\em Introduction to compiler construction with
91      UNIX.\/} Prentice-Hall, 1985.
92   \item
93      The Unix Programmer's Manual, Sections `Lex' and `Yacc'.
94\end{itemize}
95
96
97\subsection*{Credits}
98
99I would like to thank Berend de Boer (berend@pobox.com), who adapted TP Lex
100and Yacc to take advantage of the large memory models in Borland Pascal 7.0
101and Delphi, and Michael Van Canneyt (Michael.VanCanneyt@fys.kuleuven.ac.be),
102the maintainer of the Linux version of the Free Pascal compiler, who is
103responsible for the Free Pascal port. And of course thanks are due to the many
104TP Lex/Yacc users all over the world for their support and comments which
105helped to improve these programs.
106
107
108\subsection*{Getting Started}
109
110Instructions on how to compile and install TP Lex and Yacc on all supported
111platforms can be found in the \verb"README" file contained in the
112distribution.
113
114Once you have installed TP Lex and Yacc on your system, you can compile your
115first TP Lex and Yacc program \verb"expr". \verb"Expr" is a simple desktop
116calculator program contained in the distribution, which consists of a lexical
117analyzer in the TP Lex source file \verb"exprlex.l" and the parser and main
118program in the TP Yacc source file \verb"expr.y". To compile these programs,
119issue the commands
120\begin{quote}\begin{verbatim}
121   lex exprlex
122   yacc expr
123\end{verbatim}\end{quote}
124That's it! You now have the Turbo Pascal sources (\verb"exprlex.pas" and
125\verb"expr.pas") for the \verb"expr" program. Use the Turbo Pascal
126compiler to compile these programs as follows:
127\begin{quote}\begin{verbatim}
128   tpc expr
129\end{verbatim}\end{quote}
130
131(Of course, the precise compilation command depends on the type of compiler
132you are using. Thus you may have to replace \verb"tpc" with \verb"bpc",
133\verb"dcc" or \verb"dcc32", depending on the version of the
134Turbo/Borland/Delphi compiler you have, and with \verb"ppc386" for the Free
135Pascal compiler. If you are using TP Lex and Yacc with Free Pascal under
136Linux, the corresponding commands are:
137\begin{quote}\begin{verbatim}
138   plex exprlex
139   pyacc expr
140   ppc386 expr
141\end{verbatim}\end{quote}
142Note that in the Linux version, the programs are named \verb"plex" and
143\verb"pyacc" to avoid name clashes with the corresponding UNIX utilities.)
144
145Having compiled \verb"expr.pas", you can execute the \verb"expr" program and
146type some expressions to see it work (terminate the program with an empty
147line).  There is a number of other sample TP Lex and Yacc programs (\verb".l"
148and \verb".y" files) in the distribution, including a TP Yacc cross reference
149utility and a complete parser for Standard Pascal.
150
151The TP Lex and Yacc programs recognize some options which may be specified
152anywhere on the command line. E.g.,
153\begin{quote}\begin{verbatim}
154   lex -o exprlex
155\end{verbatim}\end{quote}
156runs TP Lex with ``DFA optimization'' and
157\begin{quote}\begin{verbatim}
158   yacc -v expr
159\end{verbatim}\end{quote}
160runs TP Yacc in ``verbose'' mode (TP Yacc generates a readable description
161of the generated parser).
162
163The TP Lex and Yacc programs use the following default filename extensions:
164\begin{itemize}
165   \item \verb".l": TP Lex input files
166   \item \verb".y": TP Yacc input files
167   \item \verb".pas": TP Lex and Yacc output files
168\end{itemize}
169As usual, you may overwrite default filename extensions by explicitly
170specifying suffixes.
171
172If you ever forget how to run TP Lex and Yacc, you can issue the command
173\verb"lex" or \verb"yacc" (resp.\ \verb"plex" or \verb"pyacc")
174without arguments to get a short summary of the command line syntax.
175
176\section{TP Lex}
177
178This section describes the TP Lex lexical analyzer generator.
179
180\subsection{Usage}
181
182\begin{quote}\begin{verbatim}
183lex [options] lex-file[.l] [output-file[.pas]]
184\end{verbatim}\end{quote}
185
186\subsection{Options}
187
188\begin{description}
189   \item[\verb"-v"]
190      ``Verbose:'' Lex generates a readable description of the generated
191      lexical analyzer, written to lex-file with new extension \verb".lst".
192   \item[\verb"-o"]
193      ``Optimize:'' Lex optimizes DFA tables to produce a minimal DFA.
194\end{description}
195
196\subsection{Description}
197
198TP Lex is a program generator that is used to generate the Turbo Pascal
199source code for a lexical analyzer subroutine from the specification
200of an input language by a regular expression grammar.
201
202TP Lex parses the source grammar contained in \verb"lex-file" (with default
203suffix \verb".l") and writes the constructed lexical analyzer subroutine
204to the specified \verb"output-file" (with default suffix \verb".pas"); if no
205output file is specified, output goes to \verb"lex-file" with new suffix
206\verb".pas." If any errors are found during compilation, error messages are
207written to the list file (\verb"lex-file" with new suffix \verb".lst").
208
209The generated output file contains a lexical analyzer routine, \verb"yylex",
210implemented as:
211\begin{quote}\begin{verbatim}
212   function yylex : Integer;
213\end{verbatim}\end{quote}
214
215This routine has to be called by your main program to execute the lexical
216analyzer. The return value of the \verb"yylex" routine usually denotes the
217number of a token recognized by the lexical analyzer (see the \verb"return"
218routine in the \verb"LexLib" unit). At end-of-file the \verb"yylex" routine
219normally returns \verb"0".
220
221The code template for the \verb"yylex" routine may be found in the
222\verb"yylex.cod" file. This file is needed by TP Lex when it constructs the
223output file. It must be present either in the current directory or in the
224directory from which TP Lex was executed (TP Lex searches these directories in
225the indicated order). (NB: For the Linux/Free Pascal version, the code
226template is searched in some directory defined at compile-time instead of the
227execution path, usually /usr/lib/fpc/lexyacc.)
228
229The TP Lex library (\verb"LexLib") unit is required by programs using
230Lex-generated lexical analyzers; you will therefore have to put an appropriate
231\verb"uses" clause into your program or unit that contains the lexical
232analyzer routine. The \verb"LexLib" unit also provides various useful utility
233routines; see the file \verb"lexlib.pas" for further information.
234
235\subsection{Lex Source}
236
237A TP Lex program consists of three sections separated with the \verb"%%"
238delimiter:
239
240\begin{quote}\begin{verbatim}
241definitions
242%%
243rules
244%%
245auxiliary procedures
246\end{verbatim}\end{quote}
247
248All sections may be empty. The TP Lex language is line-oriented; definitions
249and rules are separated by line breaks. There is no special notation for
250comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
251fragments (see below).
252
253The definitions section may contain the following elements:
254\begin{itemize}
255   \item
256      regular definitions in the format:
257      \begin{quote}\begin{verbatim}
258   name   substitution
259      \end{verbatim}\end{quote}
260      which serve to abbreviate common subexpressions. The \verb"{name}"
261      notation causes the corresponding substitution from the definitions
262      section to be inserted into a regular expression. The name must be
263      a legal identifier (letter followed by a sequence of letters and digits;
264      the underscore counts as a letter; upper- and lowercase are distinct).
265      Regular definitions must be non-recursive.
266   \item
267      start state definitions in the format:
268      \begin{quote}\begin{verbatim}
269   %start name ...
270      \end{verbatim}\end{quote}
271      which are used in specifying start conditions on rules (described
272      below). The \verb"%start" keyword may also be abbreviated as \verb"%s"
273      or \verb"%S".
274   \item
275      Turbo Pascal declarations enclosed between \verb"%{" and \verb"%}".
276      These will be inserted into the output file (at global scope). Also,
277      any line that does not look like a Lex definition (e.g., starts with
278      blank or tab) will be treated as Turbo Pascal code. (In particular,
279      this also allows you to include Turbo Pascal comments in your Lex
280      program.)
281\end{itemize}
282
283The rules section of a TP Lex program contains the actual specification of
284the lexical analyzer routine. It may be thought of as a big \verb"CASE"
285statement discriminating over the different patterns to be matched and listing the
286corresponding statements (actions) to be executed. Each rule consists of a
287regular expression describing the strings to be matched in the input, and a
288corresponding action, a Turbo Pascal statement to be executed when the
289expression matches. Expression and statement are delimited with whitespace
290(blanks and/or tabs). Thus the format of a Lex grammar rule is:
291
292\begin{quote}\begin{verbatim}
293   expression      statement;
294\end{verbatim}\end{quote}
295
296Note that the action must be a single Turbo Pascal statement terminated
297with a semicolon (use \verb"begin ... end" for compound statements). The
298statement may span multiple lines if the successor lines are indented with
299at least one blank or tab. The action may also be replaced by the \verb"|"
300character, indicating that the action for this rule is the same as that for
301the next one.
302
303The TP Lex library unit provides various variables and routines which are
304useful in the programming of actions. In particular, the \verb"yytext" string
305variable holds the text of the matched string, and the \verb"yyleng" Byte
306variable its length.
307
308Regular expressions are used to describe the strings to be matched in a
309grammar rule. They are built from the usual constructs describing character
310classes and sequences, and operators specifying repetitions and alternatives.
311The precise format of regular expressions is described in the next section.
312
313The rules section may also start with some Turbo Pascal declarations
314(enclosed in \verb"%{ %}") which are treated as local declarations of the
315actions routine.
316
317Finally, the auxiliary procedures section may contain arbitrary Turbo
318Pascal code (such as supporting routines or a main program) which is
319simply tacked on to the end of the output file. The auxiliary procedures
320section is optional.
321
322\subsection{Regular Expressions}
323
324Table \ref{tab1} summarizes the format of the regular expressions
325recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig.\ 3.48).
326$c$ stands for a single character, $s$ for a string, $r$ for a regular
327expression, and $n,m$ for nonnegative integers.
328
329\begin{table*}\centering
330   \begin{tabular}{c|c|c}
331      \hline\hline
332      {\sc Expression}& {\sc Matches}& {\sc Example}\\
333      \hline
334      $c$& any non-operator character $c$& \verb"a"\\
335      \verb"\"$c$& character $c$ literally& \verb"\*"\\
336      \verb'"'$s$\verb'"'& string $s$ literally& \verb'"**"'\\
337      \verb"."& any character but newline& \verb"a.*b"\\
338      \verb"^"& beginning of line& \verb"^abc"\\
339      \verb"$"& end of line& \verb"abc$"\\
340      \verb"["$s$\verb"]"& any character in $s$& \verb"[abc]"\\
341      \verb"[^"$s$\verb"]"& any character not in $s$& \verb"[^abc]"\\
342      $r$\verb"*"& zero or more $r$'s& \verb"a*"\\
343      $r$\verb"+"& one or more $r$'s& \verb"a+"\\
344      $r$\verb"?"& zero or one $r$& \verb"a?"\\
345      $r$\verb"{"$m$\verb","$n$\verb"}"& $m$ to $n$ occurrences of $r$& \verb"a{1,5}"\\
346      $r$\verb"{"$m$\verb"}"& $m$ occurrences of $r$& \verb"a{5}"\\
347      $r_1r_2$& $r_1$ then $r_2$& \verb"ab"\\
348      $r_1$\verb"|"$r_2$& $r_1$ or $r_2$& \verb"a|b"\\
349      \verb"("$r$\verb")"& $r$& \verb"(a|b)"\\
350      $r_1$\verb"/"$r_2$& $r_1$ when followed by $r_2$& \verb"a/b"\\
351      \verb"<"$x$\verb">"$r$& $r$ when in start condition $x$& \verb"<x>abc"\\
352      \hline
353   \end{tabular}
354   \caption{Regular expressions.}
355   \label{tab1}
356\end{table*}
357
358The operators \verb"*", \verb"+", \verb"?" and \verb"{}" have highest
359precedence, followed by concatenation. The \verb"|" operator has lowest
360precedence. Parentheses \verb"()" may be used to group expressions and
361overwrite default precedences. The \verb"<>" and \verb"/" operators may only
362occur once in an expression.
363
364The usual C-like escapes are recognized:
365\begin{itemize}
366   \item \verb"\n"     denotes newline
367   \item \verb"\r"     denotes carriage return
368   \item \verb"\t"     denotes tab
369   \item \verb"\b"     denotes backspace
370   \item \verb"\f"     denotes form feed
371   \item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base
372\end{itemize}
373
374You can also use the \verb"\" character to quote characters which would
375otherwise be interpreted as operator symbols. In character classes, you may
376use the \verb"-" character to denote ranges of characters. For instance,
377\verb"[a-z]" denotes the class of all lowercase letters.
378
379The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
380which match more than one rule. In such a case, the lexical analyzer prefers
381the longest match and, if it still has the choice between different rules,
382it picks the first of these. If no rule matches, the lexical analyzer
383executes a default action which consists of copying the input character
384to the output unchanged. Thus, if the purpose of a lexical analyzer is
385to translate some parts of the input, and leave the rest unchanged, you
386only have to specify the patterns which have to be treated specially. If,
387however, the lexical analyzer has to absorb its whole input, you will have
388to provide rules that match everything. E.g., you might use the rules
389\begin{quote}\begin{verbatim}
390   .   |
391   \n  ;
392\end{verbatim}\end{quote}
393which match ``any other character'' (and ignore it).
394
395Sometimes certain patterns have to be analyzed differently depending on some
396amount of context in which the pattern appears. In such a case the \verb"/"
397operator is useful. For instance, the expression \verb"a/b" matches \verb"a",
398but only if followed by \verb"b". Note that the \verb"b" does not belong to
399the match; rather, the lexical analyzer, when matching an \verb"a", will look
400ahead in the input to see whether it is followed by a \verb"b", before it
401declares that it has matched an \verb"a". Such lookahead may be arbitrarily
402complex (up to the size of the \verb"LexLib" input buffer). E.g., the pattern
403\verb"a/.*b" matches an \verb"a" which is followed by a \verb"b" somewhere on
404the same input line. TP Lex also has a means to specify left context which is
405described in the next section.
406
407\subsection{Start Conditions}
408
409TP Lex provides some features which make it possible to handle left context.
410The \verb"^" character at the beginning of a regular expression may be used
411to denote the beginning of the line. More distant left context can be described
412conveniently by using start conditions on rules.
413
414Any rule which is prefixed with the \verb"<>" construct is only valid if the
415lexical analyzer is in the denoted start state. For instance, the expression
416\verb"<x>a" can only be matched if the lexical analyzer is in start state
417\verb"x". You can have multiple start states in a rule; e.g., \verb"<x,y>a"
418can be matched in start states \verb"x" or \verb"y".
419
420Start states have to be declared in the definitions section by means of
421one or more start state definitions (see above). The lexical analyzer enters
422a start state through a call to the \verb"LexLib" routine \verb"start". E.g.,
423you may write:
424
425\begin{quote}\begin{verbatim}
426%start x y
427%%
428<x>a    start(y);
429<y>b    start(x);
430%%
431begin
432  start(x); if yylex=0 then ;
433end.
434\end{verbatim}\end{quote}
435
436Upon initialization, the lexical analyzer is put into state \verb"x". It then
437proceeds in state \verb"x" until it matches an \verb"a" which puts it into
438state \verb"y". In state \verb"y" it may match a \verb"b" which puts it into
439state \verb"x" again, etc.
440
441Start conditions are useful when certain constructs have to be analyzed
442differently depending on some left context (such as a special character
443at the beginning of the line), and if multiple lexical analyzers have to
444work in concert. If a rule is not prefixed with a start condition, it is
445valid in all user-defined start states, as well as in the lexical analyzer's
446default start state.
447
448\subsection{Lex Library}
449
450The TP Lex library (\verb"LexLib") unit provides various variables and
451routines which are used by Lex-generated lexical analyzers and application
452programs. It provides the input and output streams and other internal data
453structures used by the lexical analyzer routine, and supplies some variables
454and utility routines which may be used by actions and application programs.
455Refer to the file \verb"lexlib.pas" for a closer description.
456
457You can also modify the Lex library unit (and/or the code template in the
458\verb"yylex.cod" file) to customize TP Lex to your target applications. E.g.,
459you might wish to optimize the code of the lexical analyzer for some
460special application, make the analyzer read from/write to memory instead
461of files, etc.
462
463\subsection{Implementation Restrictions}
464
465Internal table sizes and the main memory available limit the complexity of
466source grammars that TP Lex can handle. There is currently no possibility to
467change internal table sizes (apart from modifying the sources of TP Lex
468itself), but the maximum table sizes provided by TP Lex seem to be large
469enough to handle most realistic applications. The actual table sizes depend on
470the particular implementation (they are much larger than the defaults if TP
471Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
472Free Pascal), and are shown in the statistics printed by TP Lex when a
473compilation is finished. The units given there are ``p'' (positions, i.e.
474items in the position table used to construct the DFA), ``s'' (DFA states) and
475``t'' (transitions of the generated DFA).
476
477As implemented, the generated DFA table is stored as a typed array constant
478which is inserted into the \verb"yylex.cod" code template. The transitions in
479each state are stored in order. Of course it would have been more efficient to
480generate a big \verb"CASE" statement instead, but I found that this may cause
481problems with the encoding of large DFA tables because Turbo Pascal has
482a quite rigid limit on the code size of individual procedures. I decided to
483use a scheme in which transitions on different symbols to the same state are
484merged into one single transition (specifying a character set and the
485corresponding next state). This keeps the number of transitions in each state
486quite small and still allows a fairly efficient access to the transition
487table.
488
489The TP Lex program has an option (\verb"-o") to optimize DFA tables. This
490causes a minimal DFA to be generated, using the algorithm described in Aho,
491Sethi, Ullman (1986). Although the absolute limit on the number of DFA states
492that TP Lex can handle is at least 300, TP Lex poses an additional restriction
493(100) on the number of states in the initial partition of the DFA optimization
494algorithm. Thus, you may get a fatal \verb"integer set overflow" message when
495using the \verb"-o" option even when TP Lex is able to generate an unoptimized
496DFA. In such cases you will just have to be content with the unoptimized DFA.
497(Hopefully, this will be fixed in a future version. Anyhow, using the merged
498transitions scheme described above, TP Lex usually constructs unoptimized
499DFA's which are not far from being optimal, and thus in most cases DFA
500optimization won't have a great impact on DFA table sizes.)
501
502\subsection{Differences from UNIX Lex}
503
504Major differences between TP Lex and UNIX Lex are listed below.
505
506\begin{itemize}
507   \item
508      TP Lex produces output code for Turbo Pascal, rather than for C.
509   \item
510      Character tables (\verb"%T") are not supported; neither are any
511      directives to determine internal table sizes (\verb"%p", \verb"%n",
512      etc.).
513   \item
514      Library routines are named differently from the UNIX version (e.g.,
515      the \verb"start" routine takes the place of the \verb"BEGIN" macro of
516      UNIX Lex), and, of course, all macros of UNIX Lex (\verb"ECHO",
517      \verb"REJECT", etc.) had to be implemented as procedures.
518    \item
519      The TP Lex library unit starts counting line numbers at 0, incrementing
520      the count {\em before\/} a line is read (in contrast, UNIX Lex
521      initializes \verb"yylineno" to 1 and increments it {\em after\/} the
522      line end has been read). This is motivated by the way in which TP Lex
523      maintains the current line, and will not affect your programs unless you
524      explicitly reset the \verb"yylineno" value (e.g., when opening a new
525      input file). In such a case you should set \verb"yylineno" to 0 rather
526      than 1.
527\end{itemize}
528
529\section{TP Yacc}
530
531This section describes the TP Yacc compiler compiler.
532
533\subsection{Usage}
534
535\begin{quote}\begin{verbatim}
536yacc [options] yacc-file[.y] [output-file[.pas]]
537\end{verbatim}\end{quote}
538
539\subsection{Options}
540
541\begin{description}
542   \item[\verb"-v"]
543      ``Verbose:'' TP Yacc generates a readable description of the generated
544      parser, written to \verb"yacc-file" with new extension \verb".lst".
545   \item[\verb"-d"]
546      ``Debug:'' TP Yacc generates parser with debugging output.
547\end{description}
548
549\subsection{Description}
550
551TP Yacc is a program that lets you prepare parsers from the description
552of input languages by BNF-like grammars. You simply specify the grammar
553for your target language, augmented with the Turbo Pascal code necessary
554to process the syntactic constructs, and TP Yacc translates your grammar
555into the Turbo Pascal code for a corresponding parser subroutine named
556\verb"yyparse".
557
558TP Yacc parses the source grammar contained in \verb"yacc-file" (with default
559suffix \verb".y") and writes the constructed parser subroutine to the
560specified \verb"output-file" (with default suffix \verb".pas"); if no output
561file is specified, output goes to \verb"yacc-file" with new suffix
562\verb".pas". If any errors are found during compilation, error messages are
563written to the list file (\verb"yacc-file" with new suffix \verb".lst").
564
565The generated parser routine, \verb"yyparse", is declared as:
566
567\begin{quote}\begin{verbatim}
568   function yyparse : Integer;
569\end{verbatim}\end{quote}
570
571This routine may be called by your main program to execute the parser.
572The return value of the \verb"yyparse" routine denotes success or failure of
573the parser (possible return values: 0 = success, 1 = unrecoverable syntax
574error or parse stack overflow).
575
576Similar to TP Lex, the code template for the \verb"yyparse" routine may be
577found in the \verb"yyparse.cod" file. The rules for locating this file are
578analogous to those of TP Lex (see Section {\em TP Lex\/}).
579
580The TP Yacc library (\verb"YaccLib") unit is required by programs using Yacc-
581generated parsers; you will therefore have to put an appropriate \verb"uses"
582clause into your program or unit that contains the parser routine. The
583\verb"YaccLib" unit also provides some routines which may be used to control
584the actions of the parser. See the file \verb"yacclib.pas" for further
585information.
586
587\subsection{Yacc Source}
588
589A TP Yacc program consists of three sections separated with the \verb"%%"
590delimiter:
591
592\begin{quote}\begin{verbatim}
593definitions
594%%
595rules
596%%
597auxiliary procedures
598\end{verbatim}\end{quote}
599
600The TP Yacc language is free-format: whitespace (blanks, tabs and newlines)
601is ignored, except if it serves as a delimiter. Comments have the C-like
602format \verb"/* ... */". They are treated as whitespace. Grammar symbols are
603denoted by identifiers which have the usual form (letter, including
604underscore, followed by a sequence of letters and digits; upper- and
605lowercase is distinct). The TP Yacc language also has some keywords which
606always start with the \verb"%" character. Literals are denoted by characters
607enclosed in single quotes. The usual C-like escapes are recognized:
608
609\begin{itemize}
610   \item \verb"\n"     denotes newline
611   \item \verb"\r"     denotes carriage return
612   \item \verb"\t"     denotes tab
613   \item \verb"\b"     denotes backspace
614   \item \verb"\f"     denotes form feed
615   \item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base
616\end{itemize}
617
618\subsection{Definitions}
619
620The first section of a TP Yacc grammar serves to define the symbols used in
621the grammar. It may contain the following types of definitions:
622
623\begin{itemize}
624   \item
625      start symbol definition: A definition of the form
626      \begin{quote}\begin{verbatim}
627   %start symbol
628      \end{verbatim}\end{quote}
629      declares the start nonterminal of the grammar (if this definition is
630      omitted, TP Yacc assumes the left-hand side nonterminal of the first
631      grammar rule as the start symbol of the grammar).
632   \item
633      terminal definitions: Definitions of the form
634      \begin{quote}\begin{verbatim}
635   %token symbol ...
636      \end{verbatim}\end{quote}
637      are used to declare the terminal symbols (``tokens'') of the target
638      language. Any identifier not introduced in a \verb"%token" definition
639      will be treated as a nonterminal symbol.
640
641      As far as TP Yacc is concerned, tokens are atomic symbols which do not
642      have an innert structure. A lexical analyzer must be provided which
643      takes on the task of tokenizing the input stream and return the
644      individual tokens and literals to the parser (see Section {\em Lexical
645      Analysis\/}).
646   \item
647      precedence definitions: Operator symbols (terminals) may be associated
648      with a precedence by means of a precedence definition which may have
649      one of the following forms
650      \begin{quote}\begin{verbatim}
651   %left symbol ...
652   %right symbol ...
653   %nonassoc symbol ...
654      \end{verbatim}\end{quote}
655      which are used to declare left-, right- and nonassociative operators,
656      respectively. Each precedence definition introduces a new precedence
657      level, lowest precedence first. E.g., you may write:
658      \begin{quote}\begin{verbatim}
659   %nonassoc '<' '>' '=' GEQ LEQ NEQ
660      /* relational operators */
661   %left     '+' '-'  OR
662      /* addition operators */
663   %left     '*' '/' AND
664     /* multiplication operators */
665   %right    NOT UMINUS
666     /* unary operators */
667      \end{verbatim}\end{quote}
668
669      A terminal identifier introduced in a precedence definition may, but
670      need not, appear in a \verb"%token" definition as well.
671   \item
672      type definitions: Any (terminal or nonterminal) grammar symbol may be
673      associated with a type identifier which is used in the processing of
674      semantic values. Type tags of the form \verb"<name>" may be used in
675      token and precedence definitions to declare the type of a terminal
676      symbol, e.g.:
677      \begin{quote}\begin{verbatim}
678   %token <Real>  NUM
679   %left  <AddOp> '+' '-'
680      \end{verbatim}\end{quote}
681
682      To declare the type of a nonterminal symbol, use a type definition of
683      the form:
684      \begin{quote}\begin{verbatim}
685   %type <name> symbol ...
686      \end{verbatim}\end{quote}
687      e.g.:
688      \begin{quote}\begin{verbatim}
689   %type <Real> expr
690      \end{verbatim}\end{quote}
691
692      In a \verb"%type" definition, you may also omit the nonterminals, i.e.
693      you may write:
694      \begin{quote}\begin{verbatim}
695   %type <name>
696      \end{verbatim}\end{quote}
697
698      This is useful when a given type is only used with type casts (see
699      Section {\em Grammar Rules and Actions\/}), and is not associated with
700      a specific nonterminal.
701   \item
702      Turbo Pascal declarations: You may also include arbitrary Turbo Pascal
703      code in the definitions section, enclosed in \verb"%{ %}". This code
704      will be inserted as global declarations into the output file, unchanged.
705\end{itemize}
706
707\subsection{Grammar Rules and Actions}
708
709The second part of a TP Yacc grammar contains the grammar rules for the
710target language. Grammar rules have the format
711
712\begin{quote}\begin{verbatim}
713   name : symbol ... ;
714\end{verbatim}\end{quote}
715
716The left-hand side of a rule must be an identifier (which denotes a
717nonterminal symbol). The right-hand side may be an arbitrary (possibly
718empty) sequence of nonterminal and terminal symbols (including literals
719enclosed in single quotes). The terminating semicolon may also be omitted.
720Different rules for the same left-hand side symbols may be written using
721the \verb"|" character to separate the different alternatives:
722
723\begin{quote}\begin{verbatim}
724   name : symbol ...
725        | symbol ...
726        ...
727        ;
728\end{verbatim}\end{quote}
729
730For instance, to specify a simple grammar for arithmetic expressions, you
731may write:
732
733\begin{quote}\begin{verbatim}
734%left '+' '-'
735%left '*' '/'
736%token NUM
737%%
738expr : expr '+' expr
739     | expr '-' expr
740     | expr '*' expr
741     | expr '/' expr
742     | '(' expr ')'
743     | NUM
744     ;
745\end{verbatim}\end{quote}
746
747(The \verb"%left" definitions at the beginning of the grammar are needed to
748specify the precedence and associativity of the operator symbols. This will be
749discussed in more detail in Section {\em Ambigious Grammars\/}.)
750
751Grammar rules may contain actions -- Turbo Pascal statements enclosed in
752\verb"{ }" -- to be executed as the corresponding rules are recognized.
753Furthermore, rules may return values, and access values returned by other
754rules. These ``semantic'' values are written as \verb"$$" (value of the
755left-hand side nonterminal) and \verb"$i" (value of the $i$th right-hand
756side symbol). They are kept on a special value stack which is maintained
757automatically by the parser.
758
759Values associated with terminal symbols must be set by the lexical analyzer
760(more about this in Section {\em Lexical Analysis\/}). Actions of the form
761\verb"$$ := $1" can frequently be omitted, since it is the default action
762assumed by TP Yacc for any rule that does not have an explicit action.
763
764By default, the semantic value type provided by Yacc is \verb"Integer". You
765can also put a declaration like
766\begin{quote}\begin{verbatim}
767   %{
768   type YYSType = Real;
769   %}
770\end{verbatim}\end{quote}
771into the definitions section of your Yacc grammar to change the default value
772type. However, if you have different value types, the preferred method is to
773use type definitions as discussed in Section {\em Definitions\/}. When such
774type definitions are given, TP Yacc handles all the necessary details of the
775\verb"YYSType" definition and also provides a fair amount of type checking
776which makes it easier to find type errors in the grammar.
777
778For instance, we may declare the symbols \verb"NUM" and \verb"expr" in the
779example above to be of type \verb"Real", and then use these values to
780evaluate an expression as it is parsed.
781
782\begin{quote}\begin{verbatim}
783%left '+' '-'
784%left '*' '/'
785%token <Real> NUM
786%type  <Real> expr
787%%
788expr : expr '+' expr   { $$ := $1+$3; }
789     | expr '-' expr   { $$ := $1-$3; }
790     | expr '*' expr   { $$ := $1*$3; }
791     | expr '/' expr   { $$ := $1/$3; }
792     | '(' expr ')'    { $$ := $2;    }
793     | NUM
794     ;
795\end{verbatim}\end{quote}
796
797(Note that we omitted the action of the last rule. The ``copy action''
798\verb"$$ := $1" required by this rule is automatically added by TP Yacc.)
799
800Actions may not only appear at the end, but also in the middle of a rule
801which is useful to perform some processing before a rule is fully parsed.
802Such actions inside a rule are treated as special nonterminals which are
803associated with an empty right-hand side. Thus, a rule like
804\begin{quote}\begin{verbatim}
805   x : y { action; } z
806\end{verbatim}\end{quote}
807will be treated as:
808\begin{quote}\begin{verbatim}
809  x : y $act z
810  $act : { action; }
811\end{verbatim}\end{quote}
812
813Actions inside a rule may also access values to the left of the action,
814and may return values by assigning to the \verb"$$" value. The value returned
815by such an action can then be accessed by other actions using the usual
816\verb"$i" notation. E.g., we may write:
817\begin{quote}\begin{verbatim}
818   x : y { $$ := 2*$1; } z { $$ := $2+$3; }
819\end{verbatim}\end{quote}
820which has the effect of setting the value of \verb"x" to
821\begin{quote}\begin{verbatim}
822   2*(the value of y)+(the value of z).
823\end{verbatim}\end{quote}
824
825Sometimes it is desirable to access values in enclosing rules. This can be
826done using the notation \verb"$i" with $i\leq 0$. \verb"$0" refers to the
827first value ``to the left'' of the current rule, \verb"$-1" to the second,
828and so on. Note that in this case the referenced value depends on the actual
829contents of the parse stack, so you have to make sure that the requested
830values are always where you expect them.
831
832There are some situations in which TP Yacc cannot easily determine the
833type of values (when a typed parser is used). This is true, in particular,
834for values in enclosing rules and for the \verb"$$" value in an action inside
835a rule. In such cases you may use a type cast to explicitly specify the type
836of a value. The format for such type casts is \verb"$<name>$" (for left-hand
837side values) and \verb"$<name>i" (for right-hand side values) where
838\verb"name" is a type identifier (which must occur in a \verb"%token",
839precedence or \verb"%type" definition).
840
841\subsection{Auxiliary Procedures}
842
843The third section of a TP Yacc program is optional. If it is present, it
844may contain any Turbo Pascal code (such as supporting routines or a main
845program) which is tacked on to the end of the output file.
846
847\subsection{Lexical Analysis}
848
849For any TP Yacc-generated parser, the programmer must supply a lexical
850analyzer routine named \verb"yylex" which performs the lexical analysis for
851the parser. This routine must be declared as
852
853\begin{quote}\begin{verbatim}
854   function yylex : Integer;
855\end{verbatim}\end{quote}
856
857The \verb"yylex" routine may either be prepared by hand, or by using the
858lexical analyzer generator TP Lex (see Section {\em TP Lex\/}).
859
860The lexical analyzer must be included in your main program behind the
861parser subroutine (the \verb"yyparse" code template includes a forward
862definition of the \verb"yylex" routine such that the parser can access the
863lexical analyzer). For instance, you may put the lexical analyzer
864routine into the auxiliary procedures section of your TP Yacc grammar,
865either directly, or by using the the Turbo Pascal include directive
866(\verb"$I").
867
868The parser repeatedly calls the \verb"yylex" routine to tokenize the input
869stream and obtain the individual lexical items in the input. For any
870literal character, the \verb"yylex" routine has to return the corresponding
871character code. For the other, symbolic, terminals of the input language,
872the lexical analyzer must return corresponding integer codes. These are
873assigned automatically by TP Yacc in the order in which token definitions
874appear in the definitions section of the source grammar. The lexical
875analyzer can access these values through corresponding integer constants
876which are declared by TP Yacc in the output file.
877
878For instance, if
879\begin{quote}\begin{verbatim}
880   %token NUM
881\end{verbatim}\end{quote}
882is the first definition in the Yacc grammar, then TP Yacc will create
883a corresponding constant declaration
884\begin{quote}\begin{verbatim}
885   const NUM = 257;
886\end{verbatim}\end{quote}
887in the output file (TP Yacc automatically assigns symbolic token numbers
888starting at 257; 1 thru 255 are reserved for character literals, 0 denotes
889end-of-file, and 256 is reserved for the special error token which will be
890discussed in Section {\em Error Handling\/}). This definition may then be
891used, e.g., in a corresponding TP Lex program as follows:
892\begin{quote}\begin{verbatim}
893   [0-9]+   return(NUM);
894\end{verbatim}\end{quote}
895
896You can also explicitly assign token numbers in the grammar. For this
897purpose, the first occurrence of a token identifier in the definitions
898section may be followed by an unsigned integer. E.g. you may write:
899\begin{quote}\begin{verbatim}
900   %token NUM 299
901\end{verbatim}\end{quote}
902
903Besides the return value of \verb"yylex", the lexical analyzer routine may
904also return an additional semantic value for the recognized token. This value
905is assigned to a variable named \verb"yylval" and may then be accessed in
906actions through the \verb"$i" notation (see above, Section {\em Grammar
907Rules and Actions\/}). The \verb"yylval" variable is of type \verb"YYSType"
908(the semantic value type, \verb"Integer" by default); its declaration may be
909found in the \verb"yyparse.cod" file.
910
911For instance, to assign an \verb"Integer" value to a \verb"NUM" token in the
912above example, we may write:
913
914\begin{quote}\begin{verbatim}
915   [0-9]+   begin
916              val(yytext, yylval, code);
917              return(NUM);
918            end;
919\end{verbatim}\end{quote}
920
921This assigns \verb"yylval" the value of the \verb"NUM" token (using the Turbo
922Pascal standard procedure \verb"val").
923
924If a parser uses tokens of different types (via a \verb"%token <name>"
925definition), then the \verb"yylval" variable will not be of type
926\verb"Integer", but instead of a corresponding variant record type which is
927capable of holding all the different value types declared in the TP Yacc
928grammar. In this case, the lexical analyzer must assign a semantic value to
929the corresponding record component which is named \verb"yy"{\em name\/}
930(where {\em name\/} stands for the corresponding type identifier).
931
932E.g., if token \verb"NUM" is declared \verb"Real":
933\begin{quote}\begin{verbatim}
934   %token <Real> NUM
935\end{verbatim}\end{quote}
936then the value for token \verb"NUM" must be assigned to \verb"yylval.yyReal".
937
938\subsection{How The Parser Works}
939
940TP Yacc uses the LALR(1) technique developed by Donald E.\ Knuth and F.\
941DeRemer to construct a simple, efficient, non-backtracking bottom-up
942parser for the source grammar. The LALR parsing technique is described
943in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a
944look at the parser description TP Yacc generates from a small sample
945grammar, to get an idea of how the LALR parsing algorithm works. We
946consider the following simplified version of the arithmetic expression
947grammar:
948
949\begin{quote}\begin{verbatim}
950%token NUM
951%left '+'
952%left '*'
953%%
954expr : expr '+' expr
955     | expr '*' expr
956     | '(' expr ')'
957     | NUM
958     ;
959\end{verbatim}\end{quote}
960
961When run with the \verb"-v" option on the above grammar, TP Yacc generates
962the parser description listed below.
963
964\begin{quote}\begin{verbatim}
965state 0:
966
967        $accept : _ expr $end
968
969        '('     shift 2
970        NUM     shift 3
971        .       error
972
973        expr    goto 1
974
975state 1:
976
977        $accept : expr _ $end
978        expr : expr _ '+' expr
979        expr : expr _ '*' expr
980
981        $end    accept
982        '*'     shift 4
983        '+'     shift 5
984        .       error
985
986state 2:
987
988        expr : '(' _ expr ')'
989
990        '('     shift 2
991        NUM     shift 3
992        .       error
993
994        expr    goto 6
995
996state 3:
997
998        expr : NUM _    (4)
999
1000        .       reduce 4
1001
1002state 4:
1003
1004        expr : expr '*' _ expr
1005
1006        '('     shift 2
1007        NUM     shift 3
1008        .       error
1009
1010        expr    goto 7
1011
1012state 5:
1013
1014        expr : expr '+' _ expr
1015
1016        '('     shift 2
1017        NUM     shift 3
1018        .       error
1019
1020        expr    goto 8
1021
1022state 6:
1023
1024        expr : '(' expr _ ')'
1025        expr : expr _ '+' expr
1026        expr : expr _ '*' expr
1027
1028        ')'     shift 9
1029        '*'     shift 4
1030        '+'     shift 5
1031        .       error
1032
1033state 7:
1034
1035        expr : expr '*' expr _  (2)
1036        expr : expr _ '+' expr
1037        expr : expr _ '*' expr
1038
1039        .       reduce 2
1040
1041state 8:
1042
1043        expr : expr '+' expr _  (1)
1044        expr : expr _ '+' expr
1045        expr : expr _ '*' expr
1046
1047        '*'     shift 4
1048        $end    reduce 1
1049        ')'     reduce 1
1050        '+'     reduce 1
1051        .       error
1052
1053state 9:
1054
1055        expr : '(' expr ')' _   (3)
1056
1057        .       reduce 3
1058\end{verbatim}\end{quote}
1059
1060Each state of the parser corresponds to a certain prefix of the input
1061which has already been seen. The parser description lists the grammar
1062rules wich are parsed in each state, and indicates the portion of each
1063rule which has already been parsed by an underscore. In state 0, the
1064start state of the parser, the parsed rule is
1065\begin{quote}\begin{verbatim}
1066        $accept : expr $end
1067\end{verbatim}\end{quote}
1068
1069This is not an actual grammar rule, but a starting rule automatically
1070added by TP Yacc. In general, it has the format
1071\begin{quote}\begin{verbatim}
1072        $accept : X $end
1073\end{verbatim}\end{quote}
1074where \verb"X" is the start nonterminal of the grammar, and \verb"$end" is
1075a pseudo token denoting end-of-input (the \verb"$end" symbol is used by the
1076parser to determine when it has successfully parsed the input).
1077
1078The description of the start rule in state 0,
1079\begin{quote}\begin{verbatim}
1080        $accept : _ expr $end
1081\end{verbatim}\end{quote}
1082with the underscore positioned before the \verb"expr" symbol, indicates that
1083we are at the beginning of the parse and are ready to parse an expression
1084(nonterminal \verb"expr").
1085
1086The parser maintains a stack to keep track of states visited during the
1087parse. There are two basic kinds of actions in each state: {\em shift\/},
1088which reads an input symbol and pushes the corresponding next state on top of
1089the stack, and {\em reduce\/} which pops a number of states from the stack
1090(corresponding to the number of right-hand side symbols of the rule used
1091in the reduction) and consults the {\em goto\/} entries of the uncovered
1092state to find the transition corresponding to the left-hand side symbol of the
1093reduced rule.
1094
1095In each step of the parse, the parser is in a given state (the state on
1096top of its stack) and may consult the current {\em lookahead symbol\/}, the
1097next symbol in the input, to determine the parse action -- shift or reduce --
1098to perform. The parser terminates as soon as it reaches state 1 and reads
1099in the endmarker, indicated by the {\em accept\/} action on \verb"$end" in
1100state 1.
1101
1102Sometimes the parser may also carry out an action without inspecting the
1103current lookahead token. This is the case, e.g., in state 3 where the
1104only action is reduction by rule 4:
1105\begin{quote}\begin{verbatim}
1106        .       reduce 4
1107\end{verbatim}\end{quote}
1108
1109The default action in a state can also be {\em error\/} indicating that any
1110other input represents a syntax error. (In case of such an error the
1111parser will start syntactic error recovery, as described in Section
1112{\em Error Handling\/}.)
1113
1114Now let us see how the parser responds to a given input. We consider the
1115input string \verb"2+5*3" which is presented to the parser as the token
1116sequence:
1117\begin{quote}\begin{verbatim}
1118   NUM + NUM * NUM
1119\end{verbatim}\end{quote}
1120
1121Table \ref{tab2} traces the corresponding actions of the parser. We also
1122show the current state in each move, and the remaining states on the stack.
1123
1124\begin{table*}\centering
1125   \begin{tabular}{l|l|l|p{8cm}}
1126      \hline\hline
1127      {\sc State}& {\sc Stack}& {\sc Lookahead}& {\sc Action}\\
1128      \hline
11290 &               &    \verb"NUM"    &    shift state 3\\
11303 &     0         &                  &    reduce rule 4 (pop 1 state, uncovering state
1131                                          0, then goto state 1 on symbol \verb"expr")\\
11321 &     0         &    \verb"+"      &    shift state 5\\
11335 &     1 0       &    \verb"NUM"    &    shift state 3\\
11343 &     5 1 0     &                  &    reduce rule 4 (pop 1 state, uncovering state
1135                                          5, then goto state 8 on symbol \verb"expr")\\
11368 &     5 1 0     &    \verb"*"      &    shift 4\\
11374 &     8 5 1 0   &    \verb"NUM"    &    shift 3\\
11383 &     4 8 5 1 0 &                  &    reduce rule 4 (pop 1 state, uncovering state
1139                                          4, then goto state 7 on symbol \verb"expr")\\
11407 &     4 8 5 1 0 &                  &    reduce rule 2 (pop 3 states, uncovering state
1141                                          5, then goto state 8 on symbol \verb"expr")\\
11428 &     5 1 0     &    \verb"$end"   &    reduce rule 1 (pop 3 states, uncovering state
1143                                          0, then goto state 1 on symbol \verb"expr")\\
11441 &     0         &    \verb"$end"   &    accept\\
1145      \hline
1146   \end{tabular}
1147   \caption{Parse of \protect\verb"NUM + NUM * NUM".}
1148   \label{tab2}
1149\end{table*}
1150
1151It is also instructive to see how the parser responds to illegal inputs.
1152E.g., you may try to figure out what the parser does when confronted with:
1153\begin{quote}\begin{verbatim}
1154   NUM + )
1155\end{verbatim}\end{quote}
1156or:
1157\begin{quote}\begin{verbatim}
1158   ( NUM * NUM
1159\end{verbatim}\end{quote}
1160
1161You will find that the parser, sooner or later, will always run into an
1162error action when confronted with errorneous inputs. An LALR parser will
1163never shift an invalid symbol and thus will always find syntax errors as
1164soon as it is possible during a left-to-right scan of the input.
1165
1166TP Yacc provides a debugging option (\verb"-d") that may be used to trace
1167the actions performed by the parser. When a grammar is compiled with the
1168\verb"-d" option, the generated parser will print out the actions as it
1169parses its input.
1170
1171\subsection{Ambigious Grammars}
1172
1173There are situations in which TP Yacc will not produce a valid parser for
1174a given input language. LALR(1) parsers are restricted to one-symbol
1175lookahead on which they have to base their parsing decisions. If a
1176grammar is ambigious, or cannot be parsed unambigiously using one-symbol
1177lookahead, TP Yacc will generate parsing conflicts when constructing the
1178parse table. There are two types of such conflicts: {\em shift/reduce
1179conflicts\/} (when there is both a shift and a reduce action for a given
1180input symbol in a given state), and {\em reduce/reduce\/} conflicts (if
1181there is more than one reduce action for a given input symbol in a given
1182state). Note that there never will be a shift/shift conflict.
1183
1184When a grammar generates parsing conflicts, TP Yacc prints out the number
1185of shift/reduce and reduce/reduce conflicts it encountered when constructing
1186the parse table. However, TP Yacc will still generate the output code for the
1187parser. To resolve parsing conflicts, TP Yacc uses the following built-in
1188disambiguating rules:
1189
1190\begin{itemize}
1191   \item
1192      in a shift/reduce conflict, TP Yacc chooses the shift action.
1193   \item
1194      in a reduce/reduce conflict, TP Yacc chooses reduction of the first
1195      grammar rule.
1196\end{itemize}
1197
1198The shift/reduce disambiguating rule correctly resolves a type of
1199ambiguity known as the ``dangling-else ambiguity'' which arises in the
1200syntax of conditional statements of many programming languages (as in
1201Pascal):
1202
1203\begin{quote}\begin{verbatim}
1204%token IF THEN ELSE
1205%%
1206stmt : IF expr THEN stmt
1207     | IF expr THEN stmt ELSE stmt
1208     ;
1209\end{verbatim}\end{quote}
1210
1211This grammar is ambigious, because a nested construct like
1212\begin{quote}\begin{verbatim}
1213   IF expr-1 THEN IF expr-2 THEN stmt-1
1214     ELSE stmt-2
1215\end{verbatim}\end{quote}
1216can be parsed two ways, either as:
1217\begin{quote}\begin{verbatim}
1218   IF expr-1 THEN ( IF expr-2 THEN stmt-1
1219     ELSE stmt-2 )
1220\end{verbatim}\end{quote}
1221or as:
1222\begin{quote}\begin{verbatim}
1223   IF expr-1 THEN ( IF expr-2 THEN stmt-1 )
1224     ELSE stmt-2
1225\end{verbatim}\end{quote}
1226
1227The first interpretation makes an \verb"ELSE" belong to the last unmatched
1228\verb"IF" which also is the interpretation chosen in most programming
1229languages. This is also the way that a TP Yacc-generated parser will parse
1230the construct since the shift/reduce disambiguating rule has the effect of
1231neglecting the reduction of \verb"IF expr-2 THEN stmt-1"; instead, the parser
1232will shift the \verb"ELSE" symbol which eventually leads to the reduction of
1233\verb"IF expr-2 THEN stmt-1 ELSE stmt-2".
1234
1235The reduce/reduce disambiguating rule is used to resolve conflicts that
1236arise when there is more than one grammar rule matching a given construct.
1237Such ambiguities are often caused by ``special case constructs'' which may be
1238given priority by simply listing the more specific rules ahead of the more
1239general ones.
1240
1241For instance, the following is an excerpt from the grammar describing the
1242input language of the UNIX equation formatter EQN:
1243
1244\begin{quote}\begin{verbatim}
1245%right SUB SUP
1246%%
1247expr : expr SUB expr SUP expr
1248     | expr SUB expr
1249     | expr SUP expr
1250     ;
1251\end{verbatim}\end{quote}
1252
1253Here, the \verb"SUB" and \verb"SUP" operator symbols denote sub- and
1254superscript, respectively. The rationale behind this example is that
1255an expression involving both sub- and superscript is often set differently
1256from a superscripted subscripted expression (compare $x_i^n$ to ${x_i}^n$).
1257This special case is therefore caught by the first rule in the above example
1258which causes a reduce/reduce conflict with rule 3 in expressions like
1259\verb"expr-1 SUB expr-2 SUP expr-3". The conflict is resolved in favour of
1260the first rule.
1261
1262In both cases discussed above, the ambiguities could also be eliminated
1263by rewriting the grammar accordingly (although this yields more complicated
1264and less readable grammars). This may not always be the case. Often
1265ambiguities are also caused by design errors in the grammar. Hence, if
1266TP Yacc reports any parsing conflicts when constructing the parser, you
1267should use the \verb"-v" option to generate the parser description
1268(\verb".lst" file) and check whether TP Yacc resolved the conflicts correctly.
1269
1270There is one type of syntactic constructs for which one often deliberately
1271uses an ambigious grammar as a more concise representation for a language
1272that could also be specified unambigiously: the syntax of expressions.
1273For instance, the following is an unambigious grammar for simple arithmetic
1274expressions:
1275
1276\begin{quote}\begin{verbatim}
1277%token NUM
1278
1279%%
1280
1281expr    : term
1282        | expr '+' term
1283        ;
1284
1285term    : factor
1286        | term '*' factor
1287        ;
1288
1289factor  : '(' expr ')'
1290        | NUM
1291        ;
1292\end{verbatim}\end{quote}
1293
1294You may check yourself that this grammar gives \verb"*" a higher precedence
1295than \verb"+" and makes both operators left-associative. The same effect can
1296be achieved with the following ambigious grammar using precedence definitions:
1297
1298\begin{quote}\begin{verbatim}
1299%token NUM
1300%left '+'
1301%left '*'
1302%%
1303expr : expr '+' expr
1304     | expr '*' expr
1305     | '(' expr ')'
1306     | NUM
1307     ;
1308\end{verbatim}\end{quote}
1309
1310Without the precedence definitions, this is an ambigious grammar causing
1311a number of shift/reduce conflicts. The precedence definitions are used
1312to correctly resolve these conflicts (conflicts resolved using precedence
1313will not be reported by TP Yacc).
1314
1315Each precedence definition introduces a new precedence level (lowest
1316precedence first) and specifies whether the corresponding operators
1317should be left-, right- or nonassociative (nonassociative operators
1318cannot be combined at all; example: relational operators in Pascal).
1319
1320TP Yacc uses precedence information to resolve shift/reduce conflicts as
1321follows. Precedences are associated with each terminal occuring in a
1322precedence definition. Furthermore, each grammar rule is given the
1323precedence of its rightmost terminal (this default choice can be
1324overwritten using a \verb"%prec" tag; see below). To resolve a shift/reduce
1325conflict using precedence, both the symbol and the rule involved must
1326have been assigned precedences. TP Yacc then chooses the parse action
1327as follows:
1328
1329\begin{itemize}
1330   \item
1331      If the symbol has higher precedence than the rule: shift.
1332   \item
1333      If the rule has higher precedence than the symbol: reduce.
1334   \item
1335      If symbol and rule have the same precedence, the associativity of the
1336      symbol determines the parse action: if the symbol is left-associative:
1337      reduce; if the symbol is right-associative: shift; if the symbol is
1338      non-associative: error.
1339\end{itemize}
1340
1341To give you an idea of how this works, let us consider our ambigious
1342arithmetic expression grammar (without precedences):
1343
1344\begin{quote}\begin{verbatim}
1345%token NUM
1346%%
1347expr : expr '+' expr
1348     | expr '*' expr
1349     | '(' expr ')'
1350     | NUM
1351     ;
1352\end{verbatim}\end{quote}
1353
1354This grammar generates four shift/reduce conflicts. The description
1355of state 8 reads as follows:
1356
1357\begin{quote}\begin{verbatim}
1358state 8:
1359
1360        *** conflicts:
1361
1362        shift 4, reduce 1 on '*'
1363        shift 5, reduce 1 on '+'
1364
1365        expr : expr '+' expr _  (1)
1366        expr : expr _ '+' expr
1367        expr : expr _ '*' expr
1368
1369        '*'     shift 4
1370        '+'     shift 5
1371        $end    reduce 1
1372        ')'     reduce 1
1373        .       error
1374\end{verbatim}\end{quote}
1375
1376In this state, we have successfully parsed a \verb"+" expression (rule 1).
1377When the next symbol is \verb"+" or \verb"*", we have the choice between the
1378reduction and shifting the symbol. Using the default shift/reduce
1379disambiguating rule, TP Yacc has resolved these conflicts in favour of shift.
1380
1381Now let us assume the above precedence definition:
1382\begin{quote}\begin{verbatim}
1383   %left '+'
1384   %left '*'
1385\end{verbatim}\end{quote}
1386which gives \verb"*" higher precedence than \verb"+" and makes both operators
1387left-associative. The rightmost terminal in rule 1 is \verb"+". Hence, given
1388these precedence definitions, the first conflict will be resolved in favour
1389of shift (\verb"*" has higher precedence than \verb"+"), while the second one
1390is resolved in favour of reduce (\verb"+" is left-associative).
1391
1392Similar conflicts arise in state 7:
1393
1394\begin{quote}\begin{verbatim}
1395state 7:
1396
1397        *** conflicts:
1398
1399        shift 4, reduce 2 on '*'
1400        shift 5, reduce 2 on '+'
1401
1402        expr : expr '*' expr _  (2)
1403        expr : expr _ '+' expr
1404        expr : expr _ '*' expr
1405
1406        '*'     shift 4
1407        '+'     shift 5
1408        $end    reduce 2
1409        ')'     reduce 2
1410        .       error
1411\end{verbatim}\end{quote}
1412
1413Here, we have successfully parsed a \verb"*" expression which may be followed
1414by another \verb"+" or \verb"*" operator. Since \verb"*" is left-associative
1415and has higher precedence than \verb"+", both conflicts will be resolved in
1416favour of reduce.
1417
1418Of course, you can also have different operators on the same precedence
1419level. For instance, consider the following extended version of the
1420arithmetic expression grammar:
1421
1422\begin{quote}\begin{verbatim}
1423%token NUM
1424%left '+' '-'
1425%left '*' '/'
1426%%
1427expr    : expr '+' expr
1428        | expr '-' expr
1429        | expr '*' expr
1430        | expr '/' expr
1431        | '(' expr ')'
1432        | NUM
1433        ;
1434\end{verbatim}\end{quote}
1435
1436This puts all ``addition'' operators on the first and all ``multiplication''
1437operators on the second precedence level. All operators are left-associative;
1438for instance, \verb"5+3-2" will be parsed as \verb"(5+3)-2".
1439
1440By default, TP Yacc assigns each rule the precedence of its rightmost
1441terminal. This is a sensible decision in most cases. Occasionally, it
1442may be necessary to overwrite this default choice and explicitly assign
1443a precedence to a rule. This can be done by putting a precedence tag
1444of the form
1445\begin{quote}\begin{verbatim}
1446   %prec symbol
1447\end{verbatim}\end{quote}
1448at the end of the corresponding rule which gives the rule the precedence
1449of the specified symbol. For instance, to extend the expression grammar
1450with a unary minus operator, giving it highest precedence, you may write:
1451
1452\begin{quote}\begin{verbatim}
1453%token NUM
1454%left '+' '-'
1455%left '*' '/'
1456%right UMINUS
1457%%
1458expr    : expr '+' expr
1459        | expr '-' expr
1460        | expr '*' expr
1461        | expr '/' expr
1462        | '-' expr      %prec UMINUS
1463        | '(' expr ')'
1464        | NUM
1465        ;
1466\end{verbatim}\end{quote}
1467
1468Note the use of the \verb"UMINUS" token which is not an actual input symbol
1469but whose sole purpose it is to give unary minus its proper precedence. If
1470we omitted the precedence tag, both unary and binary minus would have the
1471same precedence because they are represented by the same input symbol.
1472
1473\subsection{Error Handling}
1474
1475Syntactic error handling is a difficult area in the design of user-friendly
1476parsers. Usually, you will not like to have the parser give up upon the
1477first occurrence of an errorneous input symbol. Instead, the parser should
1478recover from a syntax error, that is, it should try to find a place in the
1479input where it can resume the parse.
1480
1481TP Yacc provides a general mechanism to implement parsers with error
1482recovery. A special predefined \verb"error" token may be used in grammar rules
1483to indicate positions where syntax errors might occur. When the parser runs
1484into an error action (i.e., reads an errorneous input symbol) it prints out
1485an error message and starts error recovery by popping its stack until it
1486uncovers a state in which there is a shift action on the \verb"error" token.
1487If there is no such state, the parser terminates with return value 1,
1488indicating an unrecoverable syntax error. If there is such a state, the
1489parser takes the shift on the \verb"error" token (pretending it has seen
1490an imaginary \verb"error" token in the input), and resumes parsing in a
1491special ``error mode.''
1492
1493While in error mode, the parser quietly skips symbols until it can again
1494perform a legal shift action. To prevent a cascade of error messages, the
1495parser returns to its normal mode of operation only after it has seen
1496and shifted three legal input symbols. Any additional error found after
1497the first shifted symbol restarts error recovery, but no error message
1498is printed. The TP Yacc library routine \verb"yyerrok" may be used to reset
1499the parser to its normal mode of operation explicitly.
1500
1501For a simple example, consider the rule
1502\begin{quote}\begin{verbatim}
1503   stmt : error ';' { yyerrok; }
1504\end{verbatim}\end{quote}
1505and assume a syntax error occurs while a statement (nonterminal \verb"stmt")
1506is parsed. The parser prints an error message, then pops its stack until it
1507can shift the token \verb"error" of the error rule. Proceeding in error mode,
1508it will skip symbols until it finds a semicolon, then reduces by the error
1509rule. The call to \verb"yyerrok" tells the parser that we have recovered from
1510the error and that it should proceed with the normal parse. This kind of
1511``panic mode'' error recovery scheme works well when statements are always
1512terminated with a semicolon. The parser simply skips the ``bad'' statement
1513and then resumes the parse.
1514
1515Implementing a good error recovery scheme can be a difficult task; see
1516Aho/Sethi/Ullman (1986) for a more comprehensive treatment of this topic.
1517Schreiner and Friedman have developed a systematic technique to implement
1518error recovery with Yacc which I found quite useful (I used it myself
1519to implement error recovery in the TP Yacc parser); see Schreiner/Friedman
1520(1985).
1521
1522\subsection{Yacc Library}
1523
1524The TP Yacc library (\verb"YaccLib") unit provides some global declarations
1525used by the parser routine \verb"yyparse", and some variables and utility
1526routines which may be used to control the actions of the parser and to
1527implement error recovery. See the file \verb"yacclib.pas" for a description
1528of these variables and routines.
1529
1530You can also modify the Yacc library unit (and/or the code template in the
1531\verb"yyparse.cod" file) to customize TP Yacc to your target applications.
1532
1533\subsection{Other Features}
1534
1535TP Yacc supports all additional language elements entitled as ``Old Features
1536Supported But not Encouraged'' in the UNIX manual, which are provided for
1537backward compatibility with older versions of (UNIX) Yacc:
1538
1539\begin{itemize}
1540   \item
1541      literals delimited by double quotes.
1542   \item
1543      multiple-character literals. Note that these are not treated as
1544      character sequences but represent single tokens which are given a
1545      symbolic integer code just like any other token identifier. However,
1546      they will not be declared in the output file, so you have to make sure
1547      yourself that the lexical analyzer returns the correct codes for these
1548      symbols. E.g., you might explicitly assign token numbers by using a
1549      definition like
1550      \begin{quote}\begin{verbatim}
1551   %token ':=' 257
1552      \end{verbatim}\end{quote}
1553      at the beginning of the Yacc grammar.
1554   \item
1555      \verb"\" may be used instead of \verb"%", i.e. \verb"\\" means
1556      \verb"%%", \verb"\left" is the same as \verb"%left", etc.
1557   \item
1558      other synonyms:
1559      \begin{itemize}
1560         \item \verb"%<"                    for \verb"%left"
1561         \item \verb"%>"                    for \verb"%right"
1562         \item \verb"%binary" or \verb"%2"  for \verb"%nonassoc"
1563         \item \verb"%term" or \verb"%0"    for \verb"%token"
1564         \item \verb"%="                    for \verb"%prec"
1565      \end{itemize}
1566   \item
1567      actions may also be written as \verb"= { ... }" or
1568      \verb"= single-statement;"
1569   \item
1570      Turbo Pascal declarations (\verb"%{ ... %}") may be put at the
1571      beginning of the rules section. They will be treated as local
1572      declarations of the actions routine.
1573\end{itemize}
1574
1575\subsection{Implementation Restrictions}
1576
1577As with TP Lex, internal table sizes and the main memory available limit the
1578complexity of source grammars that TP Yacc can handle. However, the maximum
1579table sizes provided by TP Yacc are large enough to handle quite complex
1580grammars (such as the Pascal grammar in the TP Yacc distribution). The actual
1581table sizes are shown in the statistics printed by TP Yacc when a compilation
1582is finished. The given figures are "s" (states), "i" (LR0 kernel items), "t"
1583(shift and goto transitions) and "r" (reductions).
1584
1585The default stack size of the generated parsers is \verb"yymaxdepth = 1024",
1586as declared in the TP Yacc library unit. This should be sufficient for any
1587average application, but you can change the stack size by including a
1588corresponding declaration in the definitions part of the Yacc grammar
1589(or change the value in the \verb"YaccLib" unit). Note that right-recursive
1590grammar rules may increase stack space requirements, so it is a good
1591idea to use left-recursive rules wherever possible.
1592
1593\subsection{Differences from UNIX Yacc}
1594
1595Major differences between TP Yacc and UNIX Yacc are listed below.
1596
1597\begin{itemize}
1598   \item
1599      TP Yacc produces output code for Turbo Pascal, rather than for C.
1600   \item
1601      TP Yacc does not support \verb"%union" definitions. Instead, a value
1602      type is declared by specifying the type identifier itself as the tag of
1603      a \verb"%token" or \verb"%type" definition. TP Yacc will automatically
1604      generate an appropriate variant record type (\verb"YYSType") which is
1605      capable of holding values of any of the types used in \verb"%token" and
1606      \verb"%type".
1607
1608      Type checking is very strict. If you use type definitions, then
1609      any symbol referred to in an action must have a type introduced
1610      in a type definition. Either the symbol must have been assigned a
1611      type in the definitions section, or the \verb"$<type-identifier>"
1612      notation must be used. The syntax of the \verb"%type" definition has
1613      been changed slightly to allow definitions of the form
1614      \begin{quote}\begin{verbatim}
1615   %type <type-identifier>
1616      \end{verbatim}\end{quote}
1617      (omitting the nonterminals) which may be used to declare types which
1618      are not assigned to any grammar symbol, but are used with the
1619      \verb"$<...>" construct.
1620   \item
1621      The parse tables constructed by this Yacc version are slightly greater
1622      than those constructed by UNIX Yacc, since a reduce action will only be
1623      chosen as the default action if it is the only action in the state.
1624      In difference, UNIX Yacc chooses a reduce action as the default action
1625      whenever it is the only reduce action of the state (even if there are
1626      other shift actions).
1627
1628      This solves a bug in UNIX Yacc that makes the generated parser start
1629      error recovery too late with certain types of error productions (see
1630      also Schreiner/Friedman, {\em Introduction to compiler construction with
1631      UNIX,\/} 1985). Also, errors will be caught sooner in most cases where
1632      UNIX Yacc would carry out an additional (default) reduction before
1633      detecting the error.
1634   \item
1635      Library routines are named differently from the UNIX version (e.g.,
1636      the \verb"yyerrlab" routine takes the place of the \verb"YYERROR"
1637      macro of UNIX Yacc), and, of course, all macros of UNIX Yacc
1638      (\verb"YYERROR", \verb"YYACCEPT", etc.) had to be implemented as
1639      procedures.
1640\end{itemize}
1641
1642\end{document}
1643