1 2\documentstyle[11pt]{article} 3 4\title{TP Lex and Yacc -- The Compiler Writer's Tools for Turbo Pascal\\ 5 Version 4.1 User Manual} 6 7\author{Albert Gr\"af\\ 8 Department of Musicinformatics\\ 9 Johannes Gutenberg-University Mainz\\ 10 \\ 11 ag@muwiinfa.geschichte.uni-mainz.de} 12 13\date{April 1998} 14 15\setlength{\topmargin}{0cm} 16\setlength{\oddsidemargin}{0cm} 17\setlength{\evensidemargin}{0cm} 18\setlength{\textwidth}{16cm} 19\setlength{\textheight}{21cm} 20%\setlength{\parindent}{0pt} 21\parskip=4pt plus 1pt minus 1pt 22\itemsep=0pt 23\renewcommand{\baselinestretch}{1.1} 24\unitlength=1mm 25%\tolerance=500 26%\parskip=0.1cm 27\leftmargini 1.5em 28\leftmarginii 1.5em \leftmarginiii 1.5em \leftmarginiv 1.5em \leftmarginv 1.5em 29\leftmarginvi 1.5em 30 31\begin{document} 32 33\maketitle 34 35\tableofcontents 36 37\section{Introduction} 38 39This document describes the TP Lex and Yacc compiler generator toolset. 40These tools are designed especially to help you prepare compilers and 41similar programs like text processing utilities and command language 42interpreters with the Turbo Pascal (TM) programming language. 43 44TP Lex and Yacc are Turbo Pascal adaptions of the well-known UNIX (TM) 45utilities Lex and Yacc, which were written by M.E. Lesk and S.C. Johnson 46at Bell Laboratories, and are used with the C programming language. TP Lex 47and Yacc are intended to be approximately ``compatible'' with these programs. 48However, they are an independent development of the author, based on the 49techniques described in the famous ``dragon book'' of Aho, Sethi and Ullman 50(Aho, Sethi, Ullman: {\em Compilers : principles, techniques and tools,\/} 51Reading (Mass.), Addison-Wesley, 1986). 52 53Version 4.1 of TP Lex and Yacc works with all recent flavours of Turbo/Borland 54Pascal, including Delphi, and with the Free Pascal Compiler, a free Turbo 55Pascal-compatible compiler which currently runs on DOS and Linux (other ports 56are under development). Recent information about TP Lex/Yacc, and the sources 57are available from the TPLY homepage: 58\begin{quote}\begin{verbatim} 59 http://www.musikwissenschaft.uni-mainz.de/~ag/tply 60\end{verbatim}\end{quote} 61For information about the Free Pascal Compiler, please refer to: 62\begin{quote}\begin{verbatim} 63 http://www.freepascal.org 64\end{verbatim}\end{quote} 65 66TP Lex and Yacc, like any other tools of this kind, are not intended for 67novices or casual programmers; they require extensive programming experience 68as well as a thorough understanding of the principles of parser design and 69implementation to be put to work successfully. But if you are a seasoned 70Turbo Pascal programmer with some background in compiler design and formal 71language theory, you will almost certainly find TP Lex and Yacc to be a 72powerful extension of your Turbo Pascal toolset. 73 74This manual tells you how to get started with the TP Lex and Yacc programs 75and provides a short description of these programs. Some knowledge about 76the C versions of Lex and Yacc will be useful, although not strictly 77necessary. For further reading, you may also refer to: 78 79\begin{itemize} 80 \item 81 Aho, Sethi and Ullman: {\em Compilers : principles, techniques and 82 tools.\/} Reading (Mass.), Addison-Wesley, 1986. 83 \item 84 Johnson, S.C.: {\em Yacc -- yet another compiler-compiler.\/} CSTR-32, 85 Bell Telephone Laboratories, 1974. 86 \item 87 Lesk, M.E.: {\em Lex -- a lexical analyser generator.\/} CSTR-39, Bell 88 Telephone Laboratories, 1975. 89 \item 90 Schreiner, Friedman: {\em Introduction to compiler construction with 91 UNIX.\/} Prentice-Hall, 1985. 92 \item 93 The Unix Programmer's Manual, Sections `Lex' and `Yacc'. 94\end{itemize} 95 96 97\subsection*{Credits} 98 99I would like to thank Berend de Boer (berend@pobox.com), who adapted TP Lex 100and Yacc to take advantage of the large memory models in Borland Pascal 7.0 101and Delphi, and Michael Van Canneyt (Michael.VanCanneyt@fys.kuleuven.ac.be), 102the maintainer of the Linux version of the Free Pascal compiler, who is 103responsible for the Free Pascal port. And of course thanks are due to the many 104TP Lex/Yacc users all over the world for their support and comments which 105helped to improve these programs. 106 107 108\subsection*{Getting Started} 109 110Instructions on how to compile and install TP Lex and Yacc on all supported 111platforms can be found in the \verb"README" file contained in the 112distribution. 113 114Once you have installed TP Lex and Yacc on your system, you can compile your 115first TP Lex and Yacc program \verb"expr". \verb"Expr" is a simple desktop 116calculator program contained in the distribution, which consists of a lexical 117analyzer in the TP Lex source file \verb"exprlex.l" and the parser and main 118program in the TP Yacc source file \verb"expr.y". To compile these programs, 119issue the commands 120\begin{quote}\begin{verbatim} 121 lex exprlex 122 yacc expr 123\end{verbatim}\end{quote} 124That's it! You now have the Turbo Pascal sources (\verb"exprlex.pas" and 125\verb"expr.pas") for the \verb"expr" program. Use the Turbo Pascal 126compiler to compile these programs as follows: 127\begin{quote}\begin{verbatim} 128 tpc expr 129\end{verbatim}\end{quote} 130 131(Of course, the precise compilation command depends on the type of compiler 132you are using. Thus you may have to replace \verb"tpc" with \verb"bpc", 133\verb"dcc" or \verb"dcc32", depending on the version of the 134Turbo/Borland/Delphi compiler you have, and with \verb"ppc386" for the Free 135Pascal compiler. If you are using TP Lex and Yacc with Free Pascal under 136Linux, the corresponding commands are: 137\begin{quote}\begin{verbatim} 138 plex exprlex 139 pyacc expr 140 ppc386 expr 141\end{verbatim}\end{quote} 142Note that in the Linux version, the programs are named \verb"plex" and 143\verb"pyacc" to avoid name clashes with the corresponding UNIX utilities.) 144 145Having compiled \verb"expr.pas", you can execute the \verb"expr" program and 146type some expressions to see it work (terminate the program with an empty 147line). There is a number of other sample TP Lex and Yacc programs (\verb".l" 148and \verb".y" files) in the distribution, including a TP Yacc cross reference 149utility and a complete parser for Standard Pascal. 150 151The TP Lex and Yacc programs recognize some options which may be specified 152anywhere on the command line. E.g., 153\begin{quote}\begin{verbatim} 154 lex -o exprlex 155\end{verbatim}\end{quote} 156runs TP Lex with ``DFA optimization'' and 157\begin{quote}\begin{verbatim} 158 yacc -v expr 159\end{verbatim}\end{quote} 160runs TP Yacc in ``verbose'' mode (TP Yacc generates a readable description 161of the generated parser). 162 163The TP Lex and Yacc programs use the following default filename extensions: 164\begin{itemize} 165 \item \verb".l": TP Lex input files 166 \item \verb".y": TP Yacc input files 167 \item \verb".pas": TP Lex and Yacc output files 168\end{itemize} 169As usual, you may overwrite default filename extensions by explicitly 170specifying suffixes. 171 172If you ever forget how to run TP Lex and Yacc, you can issue the command 173\verb"lex" or \verb"yacc" (resp.\ \verb"plex" or \verb"pyacc") 174without arguments to get a short summary of the command line syntax. 175 176\section{TP Lex} 177 178This section describes the TP Lex lexical analyzer generator. 179 180\subsection{Usage} 181 182\begin{quote}\begin{verbatim} 183lex [options] lex-file[.l] [output-file[.pas]] 184\end{verbatim}\end{quote} 185 186\subsection{Options} 187 188\begin{description} 189 \item[\verb"-v"] 190 ``Verbose:'' Lex generates a readable description of the generated 191 lexical analyzer, written to lex-file with new extension \verb".lst". 192 \item[\verb"-o"] 193 ``Optimize:'' Lex optimizes DFA tables to produce a minimal DFA. 194\end{description} 195 196\subsection{Description} 197 198TP Lex is a program generator that is used to generate the Turbo Pascal 199source code for a lexical analyzer subroutine from the specification 200of an input language by a regular expression grammar. 201 202TP Lex parses the source grammar contained in \verb"lex-file" (with default 203suffix \verb".l") and writes the constructed lexical analyzer subroutine 204to the specified \verb"output-file" (with default suffix \verb".pas"); if no 205output file is specified, output goes to \verb"lex-file" with new suffix 206\verb".pas." If any errors are found during compilation, error messages are 207written to the list file (\verb"lex-file" with new suffix \verb".lst"). 208 209The generated output file contains a lexical analyzer routine, \verb"yylex", 210implemented as: 211\begin{quote}\begin{verbatim} 212 function yylex : Integer; 213\end{verbatim}\end{quote} 214 215This routine has to be called by your main program to execute the lexical 216analyzer. The return value of the \verb"yylex" routine usually denotes the 217number of a token recognized by the lexical analyzer (see the \verb"return" 218routine in the \verb"LexLib" unit). At end-of-file the \verb"yylex" routine 219normally returns \verb"0". 220 221The code template for the \verb"yylex" routine may be found in the 222\verb"yylex.cod" file. This file is needed by TP Lex when it constructs the 223output file. It must be present either in the current directory or in the 224directory from which TP Lex was executed (TP Lex searches these directories in 225the indicated order). (NB: For the Linux/Free Pascal version, the code 226template is searched in some directory defined at compile-time instead of the 227execution path, usually /usr/lib/fpc/lexyacc.) 228 229The TP Lex library (\verb"LexLib") unit is required by programs using 230Lex-generated lexical analyzers; you will therefore have to put an appropriate 231\verb"uses" clause into your program or unit that contains the lexical 232analyzer routine. The \verb"LexLib" unit also provides various useful utility 233routines; see the file \verb"lexlib.pas" for further information. 234 235\subsection{Lex Source} 236 237A TP Lex program consists of three sections separated with the \verb"%%" 238delimiter: 239 240\begin{quote}\begin{verbatim} 241definitions 242%% 243rules 244%% 245auxiliary procedures 246\end{verbatim}\end{quote} 247 248All sections may be empty. The TP Lex language is line-oriented; definitions 249and rules are separated by line breaks. There is no special notation for 250comments, but (Turbo Pascal style) comments may be included as Turbo Pascal 251fragments (see below). 252 253The definitions section may contain the following elements: 254\begin{itemize} 255 \item 256 regular definitions in the format: 257 \begin{quote}\begin{verbatim} 258 name substitution 259 \end{verbatim}\end{quote} 260 which serve to abbreviate common subexpressions. The \verb"{name}" 261 notation causes the corresponding substitution from the definitions 262 section to be inserted into a regular expression. The name must be 263 a legal identifier (letter followed by a sequence of letters and digits; 264 the underscore counts as a letter; upper- and lowercase are distinct). 265 Regular definitions must be non-recursive. 266 \item 267 start state definitions in the format: 268 \begin{quote}\begin{verbatim} 269 %start name ... 270 \end{verbatim}\end{quote} 271 which are used in specifying start conditions on rules (described 272 below). The \verb"%start" keyword may also be abbreviated as \verb"%s" 273 or \verb"%S". 274 \item 275 Turbo Pascal declarations enclosed between \verb"%{" and \verb"%}". 276 These will be inserted into the output file (at global scope). Also, 277 any line that does not look like a Lex definition (e.g., starts with 278 blank or tab) will be treated as Turbo Pascal code. (In particular, 279 this also allows you to include Turbo Pascal comments in your Lex 280 program.) 281\end{itemize} 282 283The rules section of a TP Lex program contains the actual specification of 284the lexical analyzer routine. It may be thought of as a big \verb"CASE" 285statement discriminating over the different patterns to be matched and listing the 286corresponding statements (actions) to be executed. Each rule consists of a 287regular expression describing the strings to be matched in the input, and a 288corresponding action, a Turbo Pascal statement to be executed when the 289expression matches. Expression and statement are delimited with whitespace 290(blanks and/or tabs). Thus the format of a Lex grammar rule is: 291 292\begin{quote}\begin{verbatim} 293 expression statement; 294\end{verbatim}\end{quote} 295 296Note that the action must be a single Turbo Pascal statement terminated 297with a semicolon (use \verb"begin ... end" for compound statements). The 298statement may span multiple lines if the successor lines are indented with 299at least one blank or tab. The action may also be replaced by the \verb"|" 300character, indicating that the action for this rule is the same as that for 301the next one. 302 303The TP Lex library unit provides various variables and routines which are 304useful in the programming of actions. In particular, the \verb"yytext" string 305variable holds the text of the matched string, and the \verb"yyleng" Byte 306variable its length. 307 308Regular expressions are used to describe the strings to be matched in a 309grammar rule. They are built from the usual constructs describing character 310classes and sequences, and operators specifying repetitions and alternatives. 311The precise format of regular expressions is described in the next section. 312 313The rules section may also start with some Turbo Pascal declarations 314(enclosed in \verb"%{ %}") which are treated as local declarations of the 315actions routine. 316 317Finally, the auxiliary procedures section may contain arbitrary Turbo 318Pascal code (such as supporting routines or a main program) which is 319simply tacked on to the end of the output file. The auxiliary procedures 320section is optional. 321 322\subsection{Regular Expressions} 323 324Table \ref{tab1} summarizes the format of the regular expressions 325recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig.\ 3.48). 326$c$ stands for a single character, $s$ for a string, $r$ for a regular 327expression, and $n,m$ for nonnegative integers. 328 329\begin{table*}\centering 330 \begin{tabular}{c|c|c} 331 \hline\hline 332 {\sc Expression}& {\sc Matches}& {\sc Example}\\ 333 \hline 334 $c$& any non-operator character $c$& \verb"a"\\ 335 \verb"\"$c$& character $c$ literally& \verb"\*"\\ 336 \verb'"'$s$\verb'"'& string $s$ literally& \verb'"**"'\\ 337 \verb"."& any character but newline& \verb"a.*b"\\ 338 \verb"^"& beginning of line& \verb"^abc"\\ 339 \verb"$"& end of line& \verb"abc$"\\ 340 \verb"["$s$\verb"]"& any character in $s$& \verb"[abc]"\\ 341 \verb"[^"$s$\verb"]"& any character not in $s$& \verb"[^abc]"\\ 342 $r$\verb"*"& zero or more $r$'s& \verb"a*"\\ 343 $r$\verb"+"& one or more $r$'s& \verb"a+"\\ 344 $r$\verb"?"& zero or one $r$& \verb"a?"\\ 345 $r$\verb"{"$m$\verb","$n$\verb"}"& $m$ to $n$ occurrences of $r$& \verb"a{1,5}"\\ 346 $r$\verb"{"$m$\verb"}"& $m$ occurrences of $r$& \verb"a{5}"\\ 347 $r_1r_2$& $r_1$ then $r_2$& \verb"ab"\\ 348 $r_1$\verb"|"$r_2$& $r_1$ or $r_2$& \verb"a|b"\\ 349 \verb"("$r$\verb")"& $r$& \verb"(a|b)"\\ 350 $r_1$\verb"/"$r_2$& $r_1$ when followed by $r_2$& \verb"a/b"\\ 351 \verb"<"$x$\verb">"$r$& $r$ when in start condition $x$& \verb"<x>abc"\\ 352 \hline 353 \end{tabular} 354 \caption{Regular expressions.} 355 \label{tab1} 356\end{table*} 357 358The operators \verb"*", \verb"+", \verb"?" and \verb"{}" have highest 359precedence, followed by concatenation. The \verb"|" operator has lowest 360precedence. Parentheses \verb"()" may be used to group expressions and 361overwrite default precedences. The \verb"<>" and \verb"/" operators may only 362occur once in an expression. 363 364The usual C-like escapes are recognized: 365\begin{itemize} 366 \item \verb"\n" denotes newline 367 \item \verb"\r" denotes carriage return 368 \item \verb"\t" denotes tab 369 \item \verb"\b" denotes backspace 370 \item \verb"\f" denotes form feed 371 \item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base 372\end{itemize} 373 374You can also use the \verb"\" character to quote characters which would 375otherwise be interpreted as operator symbols. In character classes, you may 376use the \verb"-" character to denote ranges of characters. For instance, 377\verb"[a-z]" denotes the class of all lowercase letters. 378 379The expressions in a TP Lex program may be ambigious, i.e. there may be inputs 380which match more than one rule. In such a case, the lexical analyzer prefers 381the longest match and, if it still has the choice between different rules, 382it picks the first of these. If no rule matches, the lexical analyzer 383executes a default action which consists of copying the input character 384to the output unchanged. Thus, if the purpose of a lexical analyzer is 385to translate some parts of the input, and leave the rest unchanged, you 386only have to specify the patterns which have to be treated specially. If, 387however, the lexical analyzer has to absorb its whole input, you will have 388to provide rules that match everything. E.g., you might use the rules 389\begin{quote}\begin{verbatim} 390 . | 391 \n ; 392\end{verbatim}\end{quote} 393which match ``any other character'' (and ignore it). 394 395Sometimes certain patterns have to be analyzed differently depending on some 396amount of context in which the pattern appears. In such a case the \verb"/" 397operator is useful. For instance, the expression \verb"a/b" matches \verb"a", 398but only if followed by \verb"b". Note that the \verb"b" does not belong to 399the match; rather, the lexical analyzer, when matching an \verb"a", will look 400ahead in the input to see whether it is followed by a \verb"b", before it 401declares that it has matched an \verb"a". Such lookahead may be arbitrarily 402complex (up to the size of the \verb"LexLib" input buffer). E.g., the pattern 403\verb"a/.*b" matches an \verb"a" which is followed by a \verb"b" somewhere on 404the same input line. TP Lex also has a means to specify left context which is 405described in the next section. 406 407\subsection{Start Conditions} 408 409TP Lex provides some features which make it possible to handle left context. 410The \verb"^" character at the beginning of a regular expression may be used 411to denote the beginning of the line. More distant left context can be described 412conveniently by using start conditions on rules. 413 414Any rule which is prefixed with the \verb"<>" construct is only valid if the 415lexical analyzer is in the denoted start state. For instance, the expression 416\verb"<x>a" can only be matched if the lexical analyzer is in start state 417\verb"x". You can have multiple start states in a rule; e.g., \verb"<x,y>a" 418can be matched in start states \verb"x" or \verb"y". 419 420Start states have to be declared in the definitions section by means of 421one or more start state definitions (see above). The lexical analyzer enters 422a start state through a call to the \verb"LexLib" routine \verb"start". E.g., 423you may write: 424 425\begin{quote}\begin{verbatim} 426%start x y 427%% 428<x>a start(y); 429<y>b start(x); 430%% 431begin 432 start(x); if yylex=0 then ; 433end. 434\end{verbatim}\end{quote} 435 436Upon initialization, the lexical analyzer is put into state \verb"x". It then 437proceeds in state \verb"x" until it matches an \verb"a" which puts it into 438state \verb"y". In state \verb"y" it may match a \verb"b" which puts it into 439state \verb"x" again, etc. 440 441Start conditions are useful when certain constructs have to be analyzed 442differently depending on some left context (such as a special character 443at the beginning of the line), and if multiple lexical analyzers have to 444work in concert. If a rule is not prefixed with a start condition, it is 445valid in all user-defined start states, as well as in the lexical analyzer's 446default start state. 447 448\subsection{Lex Library} 449 450The TP Lex library (\verb"LexLib") unit provides various variables and 451routines which are used by Lex-generated lexical analyzers and application 452programs. It provides the input and output streams and other internal data 453structures used by the lexical analyzer routine, and supplies some variables 454and utility routines which may be used by actions and application programs. 455Refer to the file \verb"lexlib.pas" for a closer description. 456 457You can also modify the Lex library unit (and/or the code template in the 458\verb"yylex.cod" file) to customize TP Lex to your target applications. E.g., 459you might wish to optimize the code of the lexical analyzer for some 460special application, make the analyzer read from/write to memory instead 461of files, etc. 462 463\subsection{Implementation Restrictions} 464 465Internal table sizes and the main memory available limit the complexity of 466source grammars that TP Lex can handle. There is currently no possibility to 467change internal table sizes (apart from modifying the sources of TP Lex 468itself), but the maximum table sizes provided by TP Lex seem to be large 469enough to handle most realistic applications. The actual table sizes depend on 470the particular implementation (they are much larger than the defaults if TP 471Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or 472Free Pascal), and are shown in the statistics printed by TP Lex when a 473compilation is finished. The units given there are ``p'' (positions, i.e. 474items in the position table used to construct the DFA), ``s'' (DFA states) and 475``t'' (transitions of the generated DFA). 476 477As implemented, the generated DFA table is stored as a typed array constant 478which is inserted into the \verb"yylex.cod" code template. The transitions in 479each state are stored in order. Of course it would have been more efficient to 480generate a big \verb"CASE" statement instead, but I found that this may cause 481problems with the encoding of large DFA tables because Turbo Pascal has 482a quite rigid limit on the code size of individual procedures. I decided to 483use a scheme in which transitions on different symbols to the same state are 484merged into one single transition (specifying a character set and the 485corresponding next state). This keeps the number of transitions in each state 486quite small and still allows a fairly efficient access to the transition 487table. 488 489The TP Lex program has an option (\verb"-o") to optimize DFA tables. This 490causes a minimal DFA to be generated, using the algorithm described in Aho, 491Sethi, Ullman (1986). Although the absolute limit on the number of DFA states 492that TP Lex can handle is at least 300, TP Lex poses an additional restriction 493(100) on the number of states in the initial partition of the DFA optimization 494algorithm. Thus, you may get a fatal \verb"integer set overflow" message when 495using the \verb"-o" option even when TP Lex is able to generate an unoptimized 496DFA. In such cases you will just have to be content with the unoptimized DFA. 497(Hopefully, this will be fixed in a future version. Anyhow, using the merged 498transitions scheme described above, TP Lex usually constructs unoptimized 499DFA's which are not far from being optimal, and thus in most cases DFA 500optimization won't have a great impact on DFA table sizes.) 501 502\subsection{Differences from UNIX Lex} 503 504Major differences between TP Lex and UNIX Lex are listed below. 505 506\begin{itemize} 507 \item 508 TP Lex produces output code for Turbo Pascal, rather than for C. 509 \item 510 Character tables (\verb"%T") are not supported; neither are any 511 directives to determine internal table sizes (\verb"%p", \verb"%n", 512 etc.). 513 \item 514 Library routines are named differently from the UNIX version (e.g., 515 the \verb"start" routine takes the place of the \verb"BEGIN" macro of 516 UNIX Lex), and, of course, all macros of UNIX Lex (\verb"ECHO", 517 \verb"REJECT", etc.) had to be implemented as procedures. 518 \item 519 The TP Lex library unit starts counting line numbers at 0, incrementing 520 the count {\em before\/} a line is read (in contrast, UNIX Lex 521 initializes \verb"yylineno" to 1 and increments it {\em after\/} the 522 line end has been read). This is motivated by the way in which TP Lex 523 maintains the current line, and will not affect your programs unless you 524 explicitly reset the \verb"yylineno" value (e.g., when opening a new 525 input file). In such a case you should set \verb"yylineno" to 0 rather 526 than 1. 527\end{itemize} 528 529\section{TP Yacc} 530 531This section describes the TP Yacc compiler compiler. 532 533\subsection{Usage} 534 535\begin{quote}\begin{verbatim} 536yacc [options] yacc-file[.y] [output-file[.pas]] 537\end{verbatim}\end{quote} 538 539\subsection{Options} 540 541\begin{description} 542 \item[\verb"-v"] 543 ``Verbose:'' TP Yacc generates a readable description of the generated 544 parser, written to \verb"yacc-file" with new extension \verb".lst". 545 \item[\verb"-d"] 546 ``Debug:'' TP Yacc generates parser with debugging output. 547\end{description} 548 549\subsection{Description} 550 551TP Yacc is a program that lets you prepare parsers from the description 552of input languages by BNF-like grammars. You simply specify the grammar 553for your target language, augmented with the Turbo Pascal code necessary 554to process the syntactic constructs, and TP Yacc translates your grammar 555into the Turbo Pascal code for a corresponding parser subroutine named 556\verb"yyparse". 557 558TP Yacc parses the source grammar contained in \verb"yacc-file" (with default 559suffix \verb".y") and writes the constructed parser subroutine to the 560specified \verb"output-file" (with default suffix \verb".pas"); if no output 561file is specified, output goes to \verb"yacc-file" with new suffix 562\verb".pas". If any errors are found during compilation, error messages are 563written to the list file (\verb"yacc-file" with new suffix \verb".lst"). 564 565The generated parser routine, \verb"yyparse", is declared as: 566 567\begin{quote}\begin{verbatim} 568 function yyparse : Integer; 569\end{verbatim}\end{quote} 570 571This routine may be called by your main program to execute the parser. 572The return value of the \verb"yyparse" routine denotes success or failure of 573the parser (possible return values: 0 = success, 1 = unrecoverable syntax 574error or parse stack overflow). 575 576Similar to TP Lex, the code template for the \verb"yyparse" routine may be 577found in the \verb"yyparse.cod" file. The rules for locating this file are 578analogous to those of TP Lex (see Section {\em TP Lex\/}). 579 580The TP Yacc library (\verb"YaccLib") unit is required by programs using Yacc- 581generated parsers; you will therefore have to put an appropriate \verb"uses" 582clause into your program or unit that contains the parser routine. The 583\verb"YaccLib" unit also provides some routines which may be used to control 584the actions of the parser. See the file \verb"yacclib.pas" for further 585information. 586 587\subsection{Yacc Source} 588 589A TP Yacc program consists of three sections separated with the \verb"%%" 590delimiter: 591 592\begin{quote}\begin{verbatim} 593definitions 594%% 595rules 596%% 597auxiliary procedures 598\end{verbatim}\end{quote} 599 600The TP Yacc language is free-format: whitespace (blanks, tabs and newlines) 601is ignored, except if it serves as a delimiter. Comments have the C-like 602format \verb"/* ... */". They are treated as whitespace. Grammar symbols are 603denoted by identifiers which have the usual form (letter, including 604underscore, followed by a sequence of letters and digits; upper- and 605lowercase is distinct). The TP Yacc language also has some keywords which 606always start with the \verb"%" character. Literals are denoted by characters 607enclosed in single quotes. The usual C-like escapes are recognized: 608 609\begin{itemize} 610 \item \verb"\n" denotes newline 611 \item \verb"\r" denotes carriage return 612 \item \verb"\t" denotes tab 613 \item \verb"\b" denotes backspace 614 \item \verb"\f" denotes form feed 615 \item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base 616\end{itemize} 617 618\subsection{Definitions} 619 620The first section of a TP Yacc grammar serves to define the symbols used in 621the grammar. It may contain the following types of definitions: 622 623\begin{itemize} 624 \item 625 start symbol definition: A definition of the form 626 \begin{quote}\begin{verbatim} 627 %start symbol 628 \end{verbatim}\end{quote} 629 declares the start nonterminal of the grammar (if this definition is 630 omitted, TP Yacc assumes the left-hand side nonterminal of the first 631 grammar rule as the start symbol of the grammar). 632 \item 633 terminal definitions: Definitions of the form 634 \begin{quote}\begin{verbatim} 635 %token symbol ... 636 \end{verbatim}\end{quote} 637 are used to declare the terminal symbols (``tokens'') of the target 638 language. Any identifier not introduced in a \verb"%token" definition 639 will be treated as a nonterminal symbol. 640 641 As far as TP Yacc is concerned, tokens are atomic symbols which do not 642 have an innert structure. A lexical analyzer must be provided which 643 takes on the task of tokenizing the input stream and return the 644 individual tokens and literals to the parser (see Section {\em Lexical 645 Analysis\/}). 646 \item 647 precedence definitions: Operator symbols (terminals) may be associated 648 with a precedence by means of a precedence definition which may have 649 one of the following forms 650 \begin{quote}\begin{verbatim} 651 %left symbol ... 652 %right symbol ... 653 %nonassoc symbol ... 654 \end{verbatim}\end{quote} 655 which are used to declare left-, right- and nonassociative operators, 656 respectively. Each precedence definition introduces a new precedence 657 level, lowest precedence first. E.g., you may write: 658 \begin{quote}\begin{verbatim} 659 %nonassoc '<' '>' '=' GEQ LEQ NEQ 660 /* relational operators */ 661 %left '+' '-' OR 662 /* addition operators */ 663 %left '*' '/' AND 664 /* multiplication operators */ 665 %right NOT UMINUS 666 /* unary operators */ 667 \end{verbatim}\end{quote} 668 669 A terminal identifier introduced in a precedence definition may, but 670 need not, appear in a \verb"%token" definition as well. 671 \item 672 type definitions: Any (terminal or nonterminal) grammar symbol may be 673 associated with a type identifier which is used in the processing of 674 semantic values. Type tags of the form \verb"<name>" may be used in 675 token and precedence definitions to declare the type of a terminal 676 symbol, e.g.: 677 \begin{quote}\begin{verbatim} 678 %token <Real> NUM 679 %left <AddOp> '+' '-' 680 \end{verbatim}\end{quote} 681 682 To declare the type of a nonterminal symbol, use a type definition of 683 the form: 684 \begin{quote}\begin{verbatim} 685 %type <name> symbol ... 686 \end{verbatim}\end{quote} 687 e.g.: 688 \begin{quote}\begin{verbatim} 689 %type <Real> expr 690 \end{verbatim}\end{quote} 691 692 In a \verb"%type" definition, you may also omit the nonterminals, i.e. 693 you may write: 694 \begin{quote}\begin{verbatim} 695 %type <name> 696 \end{verbatim}\end{quote} 697 698 This is useful when a given type is only used with type casts (see 699 Section {\em Grammar Rules and Actions\/}), and is not associated with 700 a specific nonterminal. 701 \item 702 Turbo Pascal declarations: You may also include arbitrary Turbo Pascal 703 code in the definitions section, enclosed in \verb"%{ %}". This code 704 will be inserted as global declarations into the output file, unchanged. 705\end{itemize} 706 707\subsection{Grammar Rules and Actions} 708 709The second part of a TP Yacc grammar contains the grammar rules for the 710target language. Grammar rules have the format 711 712\begin{quote}\begin{verbatim} 713 name : symbol ... ; 714\end{verbatim}\end{quote} 715 716The left-hand side of a rule must be an identifier (which denotes a 717nonterminal symbol). The right-hand side may be an arbitrary (possibly 718empty) sequence of nonterminal and terminal symbols (including literals 719enclosed in single quotes). The terminating semicolon may also be omitted. 720Different rules for the same left-hand side symbols may be written using 721the \verb"|" character to separate the different alternatives: 722 723\begin{quote}\begin{verbatim} 724 name : symbol ... 725 | symbol ... 726 ... 727 ; 728\end{verbatim}\end{quote} 729 730For instance, to specify a simple grammar for arithmetic expressions, you 731may write: 732 733\begin{quote}\begin{verbatim} 734%left '+' '-' 735%left '*' '/' 736%token NUM 737%% 738expr : expr '+' expr 739 | expr '-' expr 740 | expr '*' expr 741 | expr '/' expr 742 | '(' expr ')' 743 | NUM 744 ; 745\end{verbatim}\end{quote} 746 747(The \verb"%left" definitions at the beginning of the grammar are needed to 748specify the precedence and associativity of the operator symbols. This will be 749discussed in more detail in Section {\em Ambigious Grammars\/}.) 750 751Grammar rules may contain actions -- Turbo Pascal statements enclosed in 752\verb"{ }" -- to be executed as the corresponding rules are recognized. 753Furthermore, rules may return values, and access values returned by other 754rules. These ``semantic'' values are written as \verb"$$" (value of the 755left-hand side nonterminal) and \verb"$i" (value of the $i$th right-hand 756side symbol). They are kept on a special value stack which is maintained 757automatically by the parser. 758 759Values associated with terminal symbols must be set by the lexical analyzer 760(more about this in Section {\em Lexical Analysis\/}). Actions of the form 761\verb"$$ := $1" can frequently be omitted, since it is the default action 762assumed by TP Yacc for any rule that does not have an explicit action. 763 764By default, the semantic value type provided by Yacc is \verb"Integer". You 765can also put a declaration like 766\begin{quote}\begin{verbatim} 767 %{ 768 type YYSType = Real; 769 %} 770\end{verbatim}\end{quote} 771into the definitions section of your Yacc grammar to change the default value 772type. However, if you have different value types, the preferred method is to 773use type definitions as discussed in Section {\em Definitions\/}. When such 774type definitions are given, TP Yacc handles all the necessary details of the 775\verb"YYSType" definition and also provides a fair amount of type checking 776which makes it easier to find type errors in the grammar. 777 778For instance, we may declare the symbols \verb"NUM" and \verb"expr" in the 779example above to be of type \verb"Real", and then use these values to 780evaluate an expression as it is parsed. 781 782\begin{quote}\begin{verbatim} 783%left '+' '-' 784%left '*' '/' 785%token <Real> NUM 786%type <Real> expr 787%% 788expr : expr '+' expr { $$ := $1+$3; } 789 | expr '-' expr { $$ := $1-$3; } 790 | expr '*' expr { $$ := $1*$3; } 791 | expr '/' expr { $$ := $1/$3; } 792 | '(' expr ')' { $$ := $2; } 793 | NUM 794 ; 795\end{verbatim}\end{quote} 796 797(Note that we omitted the action of the last rule. The ``copy action'' 798\verb"$$ := $1" required by this rule is automatically added by TP Yacc.) 799 800Actions may not only appear at the end, but also in the middle of a rule 801which is useful to perform some processing before a rule is fully parsed. 802Such actions inside a rule are treated as special nonterminals which are 803associated with an empty right-hand side. Thus, a rule like 804\begin{quote}\begin{verbatim} 805 x : y { action; } z 806\end{verbatim}\end{quote} 807will be treated as: 808\begin{quote}\begin{verbatim} 809 x : y $act z 810 $act : { action; } 811\end{verbatim}\end{quote} 812 813Actions inside a rule may also access values to the left of the action, 814and may return values by assigning to the \verb"$$" value. The value returned 815by such an action can then be accessed by other actions using the usual 816\verb"$i" notation. E.g., we may write: 817\begin{quote}\begin{verbatim} 818 x : y { $$ := 2*$1; } z { $$ := $2+$3; } 819\end{verbatim}\end{quote} 820which has the effect of setting the value of \verb"x" to 821\begin{quote}\begin{verbatim} 822 2*(the value of y)+(the value of z). 823\end{verbatim}\end{quote} 824 825Sometimes it is desirable to access values in enclosing rules. This can be 826done using the notation \verb"$i" with $i\leq 0$. \verb"$0" refers to the 827first value ``to the left'' of the current rule, \verb"$-1" to the second, 828and so on. Note that in this case the referenced value depends on the actual 829contents of the parse stack, so you have to make sure that the requested 830values are always where you expect them. 831 832There are some situations in which TP Yacc cannot easily determine the 833type of values (when a typed parser is used). This is true, in particular, 834for values in enclosing rules and for the \verb"$$" value in an action inside 835a rule. In such cases you may use a type cast to explicitly specify the type 836of a value. The format for such type casts is \verb"$<name>$" (for left-hand 837side values) and \verb"$<name>i" (for right-hand side values) where 838\verb"name" is a type identifier (which must occur in a \verb"%token", 839precedence or \verb"%type" definition). 840 841\subsection{Auxiliary Procedures} 842 843The third section of a TP Yacc program is optional. If it is present, it 844may contain any Turbo Pascal code (such as supporting routines or a main 845program) which is tacked on to the end of the output file. 846 847\subsection{Lexical Analysis} 848 849For any TP Yacc-generated parser, the programmer must supply a lexical 850analyzer routine named \verb"yylex" which performs the lexical analysis for 851the parser. This routine must be declared as 852 853\begin{quote}\begin{verbatim} 854 function yylex : Integer; 855\end{verbatim}\end{quote} 856 857The \verb"yylex" routine may either be prepared by hand, or by using the 858lexical analyzer generator TP Lex (see Section {\em TP Lex\/}). 859 860The lexical analyzer must be included in your main program behind the 861parser subroutine (the \verb"yyparse" code template includes a forward 862definition of the \verb"yylex" routine such that the parser can access the 863lexical analyzer). For instance, you may put the lexical analyzer 864routine into the auxiliary procedures section of your TP Yacc grammar, 865either directly, or by using the the Turbo Pascal include directive 866(\verb"$I"). 867 868The parser repeatedly calls the \verb"yylex" routine to tokenize the input 869stream and obtain the individual lexical items in the input. For any 870literal character, the \verb"yylex" routine has to return the corresponding 871character code. For the other, symbolic, terminals of the input language, 872the lexical analyzer must return corresponding integer codes. These are 873assigned automatically by TP Yacc in the order in which token definitions 874appear in the definitions section of the source grammar. The lexical 875analyzer can access these values through corresponding integer constants 876which are declared by TP Yacc in the output file. 877 878For instance, if 879\begin{quote}\begin{verbatim} 880 %token NUM 881\end{verbatim}\end{quote} 882is the first definition in the Yacc grammar, then TP Yacc will create 883a corresponding constant declaration 884\begin{quote}\begin{verbatim} 885 const NUM = 257; 886\end{verbatim}\end{quote} 887in the output file (TP Yacc automatically assigns symbolic token numbers 888starting at 257; 1 thru 255 are reserved for character literals, 0 denotes 889end-of-file, and 256 is reserved for the special error token which will be 890discussed in Section {\em Error Handling\/}). This definition may then be 891used, e.g., in a corresponding TP Lex program as follows: 892\begin{quote}\begin{verbatim} 893 [0-9]+ return(NUM); 894\end{verbatim}\end{quote} 895 896You can also explicitly assign token numbers in the grammar. For this 897purpose, the first occurrence of a token identifier in the definitions 898section may be followed by an unsigned integer. E.g. you may write: 899\begin{quote}\begin{verbatim} 900 %token NUM 299 901\end{verbatim}\end{quote} 902 903Besides the return value of \verb"yylex", the lexical analyzer routine may 904also return an additional semantic value for the recognized token. This value 905is assigned to a variable named \verb"yylval" and may then be accessed in 906actions through the \verb"$i" notation (see above, Section {\em Grammar 907Rules and Actions\/}). The \verb"yylval" variable is of type \verb"YYSType" 908(the semantic value type, \verb"Integer" by default); its declaration may be 909found in the \verb"yyparse.cod" file. 910 911For instance, to assign an \verb"Integer" value to a \verb"NUM" token in the 912above example, we may write: 913 914\begin{quote}\begin{verbatim} 915 [0-9]+ begin 916 val(yytext, yylval, code); 917 return(NUM); 918 end; 919\end{verbatim}\end{quote} 920 921This assigns \verb"yylval" the value of the \verb"NUM" token (using the Turbo 922Pascal standard procedure \verb"val"). 923 924If a parser uses tokens of different types (via a \verb"%token <name>" 925definition), then the \verb"yylval" variable will not be of type 926\verb"Integer", but instead of a corresponding variant record type which is 927capable of holding all the different value types declared in the TP Yacc 928grammar. In this case, the lexical analyzer must assign a semantic value to 929the corresponding record component which is named \verb"yy"{\em name\/} 930(where {\em name\/} stands for the corresponding type identifier). 931 932E.g., if token \verb"NUM" is declared \verb"Real": 933\begin{quote}\begin{verbatim} 934 %token <Real> NUM 935\end{verbatim}\end{quote} 936then the value for token \verb"NUM" must be assigned to \verb"yylval.yyReal". 937 938\subsection{How The Parser Works} 939 940TP Yacc uses the LALR(1) technique developed by Donald E.\ Knuth and F.\ 941DeRemer to construct a simple, efficient, non-backtracking bottom-up 942parser for the source grammar. The LALR parsing technique is described 943in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a 944look at the parser description TP Yacc generates from a small sample 945grammar, to get an idea of how the LALR parsing algorithm works. We 946consider the following simplified version of the arithmetic expression 947grammar: 948 949\begin{quote}\begin{verbatim} 950%token NUM 951%left '+' 952%left '*' 953%% 954expr : expr '+' expr 955 | expr '*' expr 956 | '(' expr ')' 957 | NUM 958 ; 959\end{verbatim}\end{quote} 960 961When run with the \verb"-v" option on the above grammar, TP Yacc generates 962the parser description listed below. 963 964\begin{quote}\begin{verbatim} 965state 0: 966 967 $accept : _ expr $end 968 969 '(' shift 2 970 NUM shift 3 971 . error 972 973 expr goto 1 974 975state 1: 976 977 $accept : expr _ $end 978 expr : expr _ '+' expr 979 expr : expr _ '*' expr 980 981 $end accept 982 '*' shift 4 983 '+' shift 5 984 . error 985 986state 2: 987 988 expr : '(' _ expr ')' 989 990 '(' shift 2 991 NUM shift 3 992 . error 993 994 expr goto 6 995 996state 3: 997 998 expr : NUM _ (4) 999 1000 . reduce 4 1001 1002state 4: 1003 1004 expr : expr '*' _ expr 1005 1006 '(' shift 2 1007 NUM shift 3 1008 . error 1009 1010 expr goto 7 1011 1012state 5: 1013 1014 expr : expr '+' _ expr 1015 1016 '(' shift 2 1017 NUM shift 3 1018 . error 1019 1020 expr goto 8 1021 1022state 6: 1023 1024 expr : '(' expr _ ')' 1025 expr : expr _ '+' expr 1026 expr : expr _ '*' expr 1027 1028 ')' shift 9 1029 '*' shift 4 1030 '+' shift 5 1031 . error 1032 1033state 7: 1034 1035 expr : expr '*' expr _ (2) 1036 expr : expr _ '+' expr 1037 expr : expr _ '*' expr 1038 1039 . reduce 2 1040 1041state 8: 1042 1043 expr : expr '+' expr _ (1) 1044 expr : expr _ '+' expr 1045 expr : expr _ '*' expr 1046 1047 '*' shift 4 1048 $end reduce 1 1049 ')' reduce 1 1050 '+' reduce 1 1051 . error 1052 1053state 9: 1054 1055 expr : '(' expr ')' _ (3) 1056 1057 . reduce 3 1058\end{verbatim}\end{quote} 1059 1060Each state of the parser corresponds to a certain prefix of the input 1061which has already been seen. The parser description lists the grammar 1062rules wich are parsed in each state, and indicates the portion of each 1063rule which has already been parsed by an underscore. In state 0, the 1064start state of the parser, the parsed rule is 1065\begin{quote}\begin{verbatim} 1066 $accept : expr $end 1067\end{verbatim}\end{quote} 1068 1069This is not an actual grammar rule, but a starting rule automatically 1070added by TP Yacc. In general, it has the format 1071\begin{quote}\begin{verbatim} 1072 $accept : X $end 1073\end{verbatim}\end{quote} 1074where \verb"X" is the start nonterminal of the grammar, and \verb"$end" is 1075a pseudo token denoting end-of-input (the \verb"$end" symbol is used by the 1076parser to determine when it has successfully parsed the input). 1077 1078The description of the start rule in state 0, 1079\begin{quote}\begin{verbatim} 1080 $accept : _ expr $end 1081\end{verbatim}\end{quote} 1082with the underscore positioned before the \verb"expr" symbol, indicates that 1083we are at the beginning of the parse and are ready to parse an expression 1084(nonterminal \verb"expr"). 1085 1086The parser maintains a stack to keep track of states visited during the 1087parse. There are two basic kinds of actions in each state: {\em shift\/}, 1088which reads an input symbol and pushes the corresponding next state on top of 1089the stack, and {\em reduce\/} which pops a number of states from the stack 1090(corresponding to the number of right-hand side symbols of the rule used 1091in the reduction) and consults the {\em goto\/} entries of the uncovered 1092state to find the transition corresponding to the left-hand side symbol of the 1093reduced rule. 1094 1095In each step of the parse, the parser is in a given state (the state on 1096top of its stack) and may consult the current {\em lookahead symbol\/}, the 1097next symbol in the input, to determine the parse action -- shift or reduce -- 1098to perform. The parser terminates as soon as it reaches state 1 and reads 1099in the endmarker, indicated by the {\em accept\/} action on \verb"$end" in 1100state 1. 1101 1102Sometimes the parser may also carry out an action without inspecting the 1103current lookahead token. This is the case, e.g., in state 3 where the 1104only action is reduction by rule 4: 1105\begin{quote}\begin{verbatim} 1106 . reduce 4 1107\end{verbatim}\end{quote} 1108 1109The default action in a state can also be {\em error\/} indicating that any 1110other input represents a syntax error. (In case of such an error the 1111parser will start syntactic error recovery, as described in Section 1112{\em Error Handling\/}.) 1113 1114Now let us see how the parser responds to a given input. We consider the 1115input string \verb"2+5*3" which is presented to the parser as the token 1116sequence: 1117\begin{quote}\begin{verbatim} 1118 NUM + NUM * NUM 1119\end{verbatim}\end{quote} 1120 1121Table \ref{tab2} traces the corresponding actions of the parser. We also 1122show the current state in each move, and the remaining states on the stack. 1123 1124\begin{table*}\centering 1125 \begin{tabular}{l|l|l|p{8cm}} 1126 \hline\hline 1127 {\sc State}& {\sc Stack}& {\sc Lookahead}& {\sc Action}\\ 1128 \hline 11290 & & \verb"NUM" & shift state 3\\ 11303 & 0 & & reduce rule 4 (pop 1 state, uncovering state 1131 0, then goto state 1 on symbol \verb"expr")\\ 11321 & 0 & \verb"+" & shift state 5\\ 11335 & 1 0 & \verb"NUM" & shift state 3\\ 11343 & 5 1 0 & & reduce rule 4 (pop 1 state, uncovering state 1135 5, then goto state 8 on symbol \verb"expr")\\ 11368 & 5 1 0 & \verb"*" & shift 4\\ 11374 & 8 5 1 0 & \verb"NUM" & shift 3\\ 11383 & 4 8 5 1 0 & & reduce rule 4 (pop 1 state, uncovering state 1139 4, then goto state 7 on symbol \verb"expr")\\ 11407 & 4 8 5 1 0 & & reduce rule 2 (pop 3 states, uncovering state 1141 5, then goto state 8 on symbol \verb"expr")\\ 11428 & 5 1 0 & \verb"$end" & reduce rule 1 (pop 3 states, uncovering state 1143 0, then goto state 1 on symbol \verb"expr")\\ 11441 & 0 & \verb"$end" & accept\\ 1145 \hline 1146 \end{tabular} 1147 \caption{Parse of \protect\verb"NUM + NUM * NUM".} 1148 \label{tab2} 1149\end{table*} 1150 1151It is also instructive to see how the parser responds to illegal inputs. 1152E.g., you may try to figure out what the parser does when confronted with: 1153\begin{quote}\begin{verbatim} 1154 NUM + ) 1155\end{verbatim}\end{quote} 1156or: 1157\begin{quote}\begin{verbatim} 1158 ( NUM * NUM 1159\end{verbatim}\end{quote} 1160 1161You will find that the parser, sooner or later, will always run into an 1162error action when confronted with errorneous inputs. An LALR parser will 1163never shift an invalid symbol and thus will always find syntax errors as 1164soon as it is possible during a left-to-right scan of the input. 1165 1166TP Yacc provides a debugging option (\verb"-d") that may be used to trace 1167the actions performed by the parser. When a grammar is compiled with the 1168\verb"-d" option, the generated parser will print out the actions as it 1169parses its input. 1170 1171\subsection{Ambigious Grammars} 1172 1173There are situations in which TP Yacc will not produce a valid parser for 1174a given input language. LALR(1) parsers are restricted to one-symbol 1175lookahead on which they have to base their parsing decisions. If a 1176grammar is ambigious, or cannot be parsed unambigiously using one-symbol 1177lookahead, TP Yacc will generate parsing conflicts when constructing the 1178parse table. There are two types of such conflicts: {\em shift/reduce 1179conflicts\/} (when there is both a shift and a reduce action for a given 1180input symbol in a given state), and {\em reduce/reduce\/} conflicts (if 1181there is more than one reduce action for a given input symbol in a given 1182state). Note that there never will be a shift/shift conflict. 1183 1184When a grammar generates parsing conflicts, TP Yacc prints out the number 1185of shift/reduce and reduce/reduce conflicts it encountered when constructing 1186the parse table. However, TP Yacc will still generate the output code for the 1187parser. To resolve parsing conflicts, TP Yacc uses the following built-in 1188disambiguating rules: 1189 1190\begin{itemize} 1191 \item 1192 in a shift/reduce conflict, TP Yacc chooses the shift action. 1193 \item 1194 in a reduce/reduce conflict, TP Yacc chooses reduction of the first 1195 grammar rule. 1196\end{itemize} 1197 1198The shift/reduce disambiguating rule correctly resolves a type of 1199ambiguity known as the ``dangling-else ambiguity'' which arises in the 1200syntax of conditional statements of many programming languages (as in 1201Pascal): 1202 1203\begin{quote}\begin{verbatim} 1204%token IF THEN ELSE 1205%% 1206stmt : IF expr THEN stmt 1207 | IF expr THEN stmt ELSE stmt 1208 ; 1209\end{verbatim}\end{quote} 1210 1211This grammar is ambigious, because a nested construct like 1212\begin{quote}\begin{verbatim} 1213 IF expr-1 THEN IF expr-2 THEN stmt-1 1214 ELSE stmt-2 1215\end{verbatim}\end{quote} 1216can be parsed two ways, either as: 1217\begin{quote}\begin{verbatim} 1218 IF expr-1 THEN ( IF expr-2 THEN stmt-1 1219 ELSE stmt-2 ) 1220\end{verbatim}\end{quote} 1221or as: 1222\begin{quote}\begin{verbatim} 1223 IF expr-1 THEN ( IF expr-2 THEN stmt-1 ) 1224 ELSE stmt-2 1225\end{verbatim}\end{quote} 1226 1227The first interpretation makes an \verb"ELSE" belong to the last unmatched 1228\verb"IF" which also is the interpretation chosen in most programming 1229languages. This is also the way that a TP Yacc-generated parser will parse 1230the construct since the shift/reduce disambiguating rule has the effect of 1231neglecting the reduction of \verb"IF expr-2 THEN stmt-1"; instead, the parser 1232will shift the \verb"ELSE" symbol which eventually leads to the reduction of 1233\verb"IF expr-2 THEN stmt-1 ELSE stmt-2". 1234 1235The reduce/reduce disambiguating rule is used to resolve conflicts that 1236arise when there is more than one grammar rule matching a given construct. 1237Such ambiguities are often caused by ``special case constructs'' which may be 1238given priority by simply listing the more specific rules ahead of the more 1239general ones. 1240 1241For instance, the following is an excerpt from the grammar describing the 1242input language of the UNIX equation formatter EQN: 1243 1244\begin{quote}\begin{verbatim} 1245%right SUB SUP 1246%% 1247expr : expr SUB expr SUP expr 1248 | expr SUB expr 1249 | expr SUP expr 1250 ; 1251\end{verbatim}\end{quote} 1252 1253Here, the \verb"SUB" and \verb"SUP" operator symbols denote sub- and 1254superscript, respectively. The rationale behind this example is that 1255an expression involving both sub- and superscript is often set differently 1256from a superscripted subscripted expression (compare $x_i^n$ to ${x_i}^n$). 1257This special case is therefore caught by the first rule in the above example 1258which causes a reduce/reduce conflict with rule 3 in expressions like 1259\verb"expr-1 SUB expr-2 SUP expr-3". The conflict is resolved in favour of 1260the first rule. 1261 1262In both cases discussed above, the ambiguities could also be eliminated 1263by rewriting the grammar accordingly (although this yields more complicated 1264and less readable grammars). This may not always be the case. Often 1265ambiguities are also caused by design errors in the grammar. Hence, if 1266TP Yacc reports any parsing conflicts when constructing the parser, you 1267should use the \verb"-v" option to generate the parser description 1268(\verb".lst" file) and check whether TP Yacc resolved the conflicts correctly. 1269 1270There is one type of syntactic constructs for which one often deliberately 1271uses an ambigious grammar as a more concise representation for a language 1272that could also be specified unambigiously: the syntax of expressions. 1273For instance, the following is an unambigious grammar for simple arithmetic 1274expressions: 1275 1276\begin{quote}\begin{verbatim} 1277%token NUM 1278 1279%% 1280 1281expr : term 1282 | expr '+' term 1283 ; 1284 1285term : factor 1286 | term '*' factor 1287 ; 1288 1289factor : '(' expr ')' 1290 | NUM 1291 ; 1292\end{verbatim}\end{quote} 1293 1294You may check yourself that this grammar gives \verb"*" a higher precedence 1295than \verb"+" and makes both operators left-associative. The same effect can 1296be achieved with the following ambigious grammar using precedence definitions: 1297 1298\begin{quote}\begin{verbatim} 1299%token NUM 1300%left '+' 1301%left '*' 1302%% 1303expr : expr '+' expr 1304 | expr '*' expr 1305 | '(' expr ')' 1306 | NUM 1307 ; 1308\end{verbatim}\end{quote} 1309 1310Without the precedence definitions, this is an ambigious grammar causing 1311a number of shift/reduce conflicts. The precedence definitions are used 1312to correctly resolve these conflicts (conflicts resolved using precedence 1313will not be reported by TP Yacc). 1314 1315Each precedence definition introduces a new precedence level (lowest 1316precedence first) and specifies whether the corresponding operators 1317should be left-, right- or nonassociative (nonassociative operators 1318cannot be combined at all; example: relational operators in Pascal). 1319 1320TP Yacc uses precedence information to resolve shift/reduce conflicts as 1321follows. Precedences are associated with each terminal occuring in a 1322precedence definition. Furthermore, each grammar rule is given the 1323precedence of its rightmost terminal (this default choice can be 1324overwritten using a \verb"%prec" tag; see below). To resolve a shift/reduce 1325conflict using precedence, both the symbol and the rule involved must 1326have been assigned precedences. TP Yacc then chooses the parse action 1327as follows: 1328 1329\begin{itemize} 1330 \item 1331 If the symbol has higher precedence than the rule: shift. 1332 \item 1333 If the rule has higher precedence than the symbol: reduce. 1334 \item 1335 If symbol and rule have the same precedence, the associativity of the 1336 symbol determines the parse action: if the symbol is left-associative: 1337 reduce; if the symbol is right-associative: shift; if the symbol is 1338 non-associative: error. 1339\end{itemize} 1340 1341To give you an idea of how this works, let us consider our ambigious 1342arithmetic expression grammar (without precedences): 1343 1344\begin{quote}\begin{verbatim} 1345%token NUM 1346%% 1347expr : expr '+' expr 1348 | expr '*' expr 1349 | '(' expr ')' 1350 | NUM 1351 ; 1352\end{verbatim}\end{quote} 1353 1354This grammar generates four shift/reduce conflicts. The description 1355of state 8 reads as follows: 1356 1357\begin{quote}\begin{verbatim} 1358state 8: 1359 1360 *** conflicts: 1361 1362 shift 4, reduce 1 on '*' 1363 shift 5, reduce 1 on '+' 1364 1365 expr : expr '+' expr _ (1) 1366 expr : expr _ '+' expr 1367 expr : expr _ '*' expr 1368 1369 '*' shift 4 1370 '+' shift 5 1371 $end reduce 1 1372 ')' reduce 1 1373 . error 1374\end{verbatim}\end{quote} 1375 1376In this state, we have successfully parsed a \verb"+" expression (rule 1). 1377When the next symbol is \verb"+" or \verb"*", we have the choice between the 1378reduction and shifting the symbol. Using the default shift/reduce 1379disambiguating rule, TP Yacc has resolved these conflicts in favour of shift. 1380 1381Now let us assume the above precedence definition: 1382\begin{quote}\begin{verbatim} 1383 %left '+' 1384 %left '*' 1385\end{verbatim}\end{quote} 1386which gives \verb"*" higher precedence than \verb"+" and makes both operators 1387left-associative. The rightmost terminal in rule 1 is \verb"+". Hence, given 1388these precedence definitions, the first conflict will be resolved in favour 1389of shift (\verb"*" has higher precedence than \verb"+"), while the second one 1390is resolved in favour of reduce (\verb"+" is left-associative). 1391 1392Similar conflicts arise in state 7: 1393 1394\begin{quote}\begin{verbatim} 1395state 7: 1396 1397 *** conflicts: 1398 1399 shift 4, reduce 2 on '*' 1400 shift 5, reduce 2 on '+' 1401 1402 expr : expr '*' expr _ (2) 1403 expr : expr _ '+' expr 1404 expr : expr _ '*' expr 1405 1406 '*' shift 4 1407 '+' shift 5 1408 $end reduce 2 1409 ')' reduce 2 1410 . error 1411\end{verbatim}\end{quote} 1412 1413Here, we have successfully parsed a \verb"*" expression which may be followed 1414by another \verb"+" or \verb"*" operator. Since \verb"*" is left-associative 1415and has higher precedence than \verb"+", both conflicts will be resolved in 1416favour of reduce. 1417 1418Of course, you can also have different operators on the same precedence 1419level. For instance, consider the following extended version of the 1420arithmetic expression grammar: 1421 1422\begin{quote}\begin{verbatim} 1423%token NUM 1424%left '+' '-' 1425%left '*' '/' 1426%% 1427expr : expr '+' expr 1428 | expr '-' expr 1429 | expr '*' expr 1430 | expr '/' expr 1431 | '(' expr ')' 1432 | NUM 1433 ; 1434\end{verbatim}\end{quote} 1435 1436This puts all ``addition'' operators on the first and all ``multiplication'' 1437operators on the second precedence level. All operators are left-associative; 1438for instance, \verb"5+3-2" will be parsed as \verb"(5+3)-2". 1439 1440By default, TP Yacc assigns each rule the precedence of its rightmost 1441terminal. This is a sensible decision in most cases. Occasionally, it 1442may be necessary to overwrite this default choice and explicitly assign 1443a precedence to a rule. This can be done by putting a precedence tag 1444of the form 1445\begin{quote}\begin{verbatim} 1446 %prec symbol 1447\end{verbatim}\end{quote} 1448at the end of the corresponding rule which gives the rule the precedence 1449of the specified symbol. For instance, to extend the expression grammar 1450with a unary minus operator, giving it highest precedence, you may write: 1451 1452\begin{quote}\begin{verbatim} 1453%token NUM 1454%left '+' '-' 1455%left '*' '/' 1456%right UMINUS 1457%% 1458expr : expr '+' expr 1459 | expr '-' expr 1460 | expr '*' expr 1461 | expr '/' expr 1462 | '-' expr %prec UMINUS 1463 | '(' expr ')' 1464 | NUM 1465 ; 1466\end{verbatim}\end{quote} 1467 1468Note the use of the \verb"UMINUS" token which is not an actual input symbol 1469but whose sole purpose it is to give unary minus its proper precedence. If 1470we omitted the precedence tag, both unary and binary minus would have the 1471same precedence because they are represented by the same input symbol. 1472 1473\subsection{Error Handling} 1474 1475Syntactic error handling is a difficult area in the design of user-friendly 1476parsers. Usually, you will not like to have the parser give up upon the 1477first occurrence of an errorneous input symbol. Instead, the parser should 1478recover from a syntax error, that is, it should try to find a place in the 1479input where it can resume the parse. 1480 1481TP Yacc provides a general mechanism to implement parsers with error 1482recovery. A special predefined \verb"error" token may be used in grammar rules 1483to indicate positions where syntax errors might occur. When the parser runs 1484into an error action (i.e., reads an errorneous input symbol) it prints out 1485an error message and starts error recovery by popping its stack until it 1486uncovers a state in which there is a shift action on the \verb"error" token. 1487If there is no such state, the parser terminates with return value 1, 1488indicating an unrecoverable syntax error. If there is such a state, the 1489parser takes the shift on the \verb"error" token (pretending it has seen 1490an imaginary \verb"error" token in the input), and resumes parsing in a 1491special ``error mode.'' 1492 1493While in error mode, the parser quietly skips symbols until it can again 1494perform a legal shift action. To prevent a cascade of error messages, the 1495parser returns to its normal mode of operation only after it has seen 1496and shifted three legal input symbols. Any additional error found after 1497the first shifted symbol restarts error recovery, but no error message 1498is printed. The TP Yacc library routine \verb"yyerrok" may be used to reset 1499the parser to its normal mode of operation explicitly. 1500 1501For a simple example, consider the rule 1502\begin{quote}\begin{verbatim} 1503 stmt : error ';' { yyerrok; } 1504\end{verbatim}\end{quote} 1505and assume a syntax error occurs while a statement (nonterminal \verb"stmt") 1506is parsed. The parser prints an error message, then pops its stack until it 1507can shift the token \verb"error" of the error rule. Proceeding in error mode, 1508it will skip symbols until it finds a semicolon, then reduces by the error 1509rule. The call to \verb"yyerrok" tells the parser that we have recovered from 1510the error and that it should proceed with the normal parse. This kind of 1511``panic mode'' error recovery scheme works well when statements are always 1512terminated with a semicolon. The parser simply skips the ``bad'' statement 1513and then resumes the parse. 1514 1515Implementing a good error recovery scheme can be a difficult task; see 1516Aho/Sethi/Ullman (1986) for a more comprehensive treatment of this topic. 1517Schreiner and Friedman have developed a systematic technique to implement 1518error recovery with Yacc which I found quite useful (I used it myself 1519to implement error recovery in the TP Yacc parser); see Schreiner/Friedman 1520(1985). 1521 1522\subsection{Yacc Library} 1523 1524The TP Yacc library (\verb"YaccLib") unit provides some global declarations 1525used by the parser routine \verb"yyparse", and some variables and utility 1526routines which may be used to control the actions of the parser and to 1527implement error recovery. See the file \verb"yacclib.pas" for a description 1528of these variables and routines. 1529 1530You can also modify the Yacc library unit (and/or the code template in the 1531\verb"yyparse.cod" file) to customize TP Yacc to your target applications. 1532 1533\subsection{Other Features} 1534 1535TP Yacc supports all additional language elements entitled as ``Old Features 1536Supported But not Encouraged'' in the UNIX manual, which are provided for 1537backward compatibility with older versions of (UNIX) Yacc: 1538 1539\begin{itemize} 1540 \item 1541 literals delimited by double quotes. 1542 \item 1543 multiple-character literals. Note that these are not treated as 1544 character sequences but represent single tokens which are given a 1545 symbolic integer code just like any other token identifier. However, 1546 they will not be declared in the output file, so you have to make sure 1547 yourself that the lexical analyzer returns the correct codes for these 1548 symbols. E.g., you might explicitly assign token numbers by using a 1549 definition like 1550 \begin{quote}\begin{verbatim} 1551 %token ':=' 257 1552 \end{verbatim}\end{quote} 1553 at the beginning of the Yacc grammar. 1554 \item 1555 \verb"\" may be used instead of \verb"%", i.e. \verb"\\" means 1556 \verb"%%", \verb"\left" is the same as \verb"%left", etc. 1557 \item 1558 other synonyms: 1559 \begin{itemize} 1560 \item \verb"%<" for \verb"%left" 1561 \item \verb"%>" for \verb"%right" 1562 \item \verb"%binary" or \verb"%2" for \verb"%nonassoc" 1563 \item \verb"%term" or \verb"%0" for \verb"%token" 1564 \item \verb"%=" for \verb"%prec" 1565 \end{itemize} 1566 \item 1567 actions may also be written as \verb"= { ... }" or 1568 \verb"= single-statement;" 1569 \item 1570 Turbo Pascal declarations (\verb"%{ ... %}") may be put at the 1571 beginning of the rules section. They will be treated as local 1572 declarations of the actions routine. 1573\end{itemize} 1574 1575\subsection{Implementation Restrictions} 1576 1577As with TP Lex, internal table sizes and the main memory available limit the 1578complexity of source grammars that TP Yacc can handle. However, the maximum 1579table sizes provided by TP Yacc are large enough to handle quite complex 1580grammars (such as the Pascal grammar in the TP Yacc distribution). The actual 1581table sizes are shown in the statistics printed by TP Yacc when a compilation 1582is finished. The given figures are "s" (states), "i" (LR0 kernel items), "t" 1583(shift and goto transitions) and "r" (reductions). 1584 1585The default stack size of the generated parsers is \verb"yymaxdepth = 1024", 1586as declared in the TP Yacc library unit. This should be sufficient for any 1587average application, but you can change the stack size by including a 1588corresponding declaration in the definitions part of the Yacc grammar 1589(or change the value in the \verb"YaccLib" unit). Note that right-recursive 1590grammar rules may increase stack space requirements, so it is a good 1591idea to use left-recursive rules wherever possible. 1592 1593\subsection{Differences from UNIX Yacc} 1594 1595Major differences between TP Yacc and UNIX Yacc are listed below. 1596 1597\begin{itemize} 1598 \item 1599 TP Yacc produces output code for Turbo Pascal, rather than for C. 1600 \item 1601 TP Yacc does not support \verb"%union" definitions. Instead, a value 1602 type is declared by specifying the type identifier itself as the tag of 1603 a \verb"%token" or \verb"%type" definition. TP Yacc will automatically 1604 generate an appropriate variant record type (\verb"YYSType") which is 1605 capable of holding values of any of the types used in \verb"%token" and 1606 \verb"%type". 1607 1608 Type checking is very strict. If you use type definitions, then 1609 any symbol referred to in an action must have a type introduced 1610 in a type definition. Either the symbol must have been assigned a 1611 type in the definitions section, or the \verb"$<type-identifier>" 1612 notation must be used. The syntax of the \verb"%type" definition has 1613 been changed slightly to allow definitions of the form 1614 \begin{quote}\begin{verbatim} 1615 %type <type-identifier> 1616 \end{verbatim}\end{quote} 1617 (omitting the nonterminals) which may be used to declare types which 1618 are not assigned to any grammar symbol, but are used with the 1619 \verb"$<...>" construct. 1620 \item 1621 The parse tables constructed by this Yacc version are slightly greater 1622 than those constructed by UNIX Yacc, since a reduce action will only be 1623 chosen as the default action if it is the only action in the state. 1624 In difference, UNIX Yacc chooses a reduce action as the default action 1625 whenever it is the only reduce action of the state (even if there are 1626 other shift actions). 1627 1628 This solves a bug in UNIX Yacc that makes the generated parser start 1629 error recovery too late with certain types of error productions (see 1630 also Schreiner/Friedman, {\em Introduction to compiler construction with 1631 UNIX,\/} 1985). Also, errors will be caught sooner in most cases where 1632 UNIX Yacc would carry out an additional (default) reduction before 1633 detecting the error. 1634 \item 1635 Library routines are named differently from the UNIX version (e.g., 1636 the \verb"yyerrlab" routine takes the place of the \verb"YYERROR" 1637 macro of UNIX Yacc), and, of course, all macros of UNIX Yacc 1638 (\verb"YYERROR", \verb"YYACCEPT", etc.) had to be implemented as 1639 procedures. 1640\end{itemize} 1641 1642\end{document} 1643