1 2 3 TP Lex and Yacc - The Compiler Writer's Tools for Turbo Pascal 4 == === === ==== = === ======== ======== ===== === ===== ====== 5 6 Version 4.1 User Manual 7 ======= === ==== ====== 8 9 Albert Graef 10 Department of Musicinformatics 11 Johannes Gutenberg-University Mainz 12 13 ag@muwiinfa.geschichte.uni-mainz.de 14 15 April 1998 16 17 18Introduction 19============ 20 21This document describes the TP Lex and Yacc compiler generator toolset. These 22tools are designed especially to help you prepare compilers and similar 23programs like text processing utilities and command language interpreters with 24the Turbo Pascal (TM) programming language. 25 26TP Lex and Yacc are Turbo Pascal adaptions of the well-known UNIX (TM) 27utilities Lex and Yacc, which were written by M.E. Lesk and S.C. Johnson at 28Bell Laboratories, and are used with the C programming language. TP Lex and 29Yacc are intended to be approximately "compatible" with these programs. 30However, they are an independent development of the author, based on the 31techniques described in the famous "dragon book" of Aho, Sethi and Ullman 32(Aho, Sethi, Ullman: "Compilers : principles, techniques and tools," Reading 33(Mass.), Addison-Wesley, 1986). 34 35Version 4.1 of TP Lex and Yacc works with all recent flavours of Turbo/Borland 36Pascal, including Delphi, and with the Free Pascal Compiler, a free Turbo 37Pascal-compatible compiler which currently runs on DOS and Linux (other ports 38are under development). Recent information about TP Lex/Yacc, and the sources 39are available from the TPLY homepage: 40 41 http://www.musikwissenschaft.uni-mainz.de/~ag/tply 42 43For information about the Free Pascal Compiler, please refer to: 44 45 http://www.freepascal.org 46 47TP Lex and Yacc, like any other tools of this kind, are not intended for 48novices or casual programmers; they require extensive programming experience 49as well as a thorough understanding of the principles of parser design and 50implementation to be put to work successfully. But if you are a seasoned Turbo 51Pascal programmer with some background in compiler design and formal language 52theory, you will almost certainly find TP Lex and Yacc to be a powerful 53extension of your Turbo Pascal toolset. 54 55This manual tells you how to get started with the TP Lex and Yacc programs and 56provides a short description of these programs. Some knowledge about the C 57versions of Lex and Yacc will be useful, although not strictly necessary. For 58further reading, you may also refer to: 59 60- Aho, Sethi and Ullman: "Compilers : principles, techniques and tools." 61 Reading (Mass.), Addison-Wesley, 1986. 62 63- Johnson, S.C.: "Yacc - yet another compiler-compiler." CSTR-32, Bell 64 Telephone Laboratories, 1974. 65 66- Lesk, M.E.: "Lex - a lexical analyser generator." CSTR-39, Bell Telephone 67 Laboratories, 1975. 68 69- Schreiner, Friedman: "Introduction to compiler construction with UNIX." 70 Prentice-Hall, 1985. 71 72- The Unix Programmer's Manual, Sections `Lex' and `Yacc'. 73 74 75Credits 76------- 77 78I would like to thank Berend de Boer (berend@pobox.com), who adapted TP Lex 79and Yacc to take advantage of the large memory models in Borland Pascal 7.0 80and Delphi, and Michael Van Canneyt (Michael.VanCanneyt@fys.kuleuven.ac.be), 81the maintainer of the Linux version of the Free Pascal compiler, who is 82responsible for the Free Pascal port. And of course thanks are due to the many 83TP Lex/Yacc users all over the world for their support and comments which 84helped to improve these programs. 85 86 87Getting Started 88--------------- 89 90Instructions on how to compile and install TP Lex and Yacc on all supported 91platforms can be found in the README file contained in the distribution. 92 93Once you have installed TP Lex and Yacc on your system, you can compile your 94first TP Lex and Yacc program expr. Expr is a simple desktop calculator 95program contained in the distribution, which consists of a lexical analyzer in 96the TP Lex source file exprlex.l and the parser and main program in the TP 97Yacc source file expr.y. To compile these programs, issue the commands 98 99 lex exprlex 100 yacc expr 101 102That's it! You now have the Turbo Pascal sources (exprlex.pas and expr.pas) 103for the expr program. Use the Turbo Pascal compiler to compile these programs 104as usual: 105 106 tpc expr 107 108(Of course, the precise compilation command depends on the type of compiler 109you are using. Thus you may have to replace tpc with bpc, dcc or dcc32, 110depending on the version of the Turbo/Borland/Delphi compiler you have, and 111with ppc386 for the Free Pascal compiler. If you are using TP Lex and Yacc 112with Free Pascal under Linux, the corresponding commands are: 113 114 plex exprlex 115 pyacc expr 116 ppc386 expr 117 118Note that in the Linux version, the programs are named plex and pyacc to 119avoid name clashes with the corresponding UNIX utilities.) 120 121Having compiled expr.pas, you can execute the expr program and type some 122expressions to see it work (terminate the program with an empty line). There 123is a number of other sample TP Lex and Yacc programs (.l and .y files) in the 124distribution, including a TP Yacc cross reference utility and a complete 125parser for Standard Pascal. 126 127The TP Lex and Yacc programs recognize some options which may be specified 128anywhere on the command line. E.g., 129 130 lex -o exprlex 131 132runs TP Lex with "DFA optimization" and 133 134 yacc -v expr 135 136runs TP Yacc in "verbose" mode (TP Yacc generates a readable description of 137the generated parser). 138 139The TP Lex and Yacc programs use the following default filename extensions: 140- .l: TP Lex input files 141- .y: TP Yacc input files 142- .pas: TP Lex and Yacc output files 143 144As usual, you may overwrite default filename extensions by explicitly 145specifying suffixes. 146 147If you ever forget how to run TP Lex and Yacc, you can issue the command lex 148or yacc (resp. plex or pyacc) without arguments to get a short summary of the 149command line syntax. 150 151 152 153TP Lex 154====== 155 156This section describes the TP Lex lexical analyzer generator. 157 158 159Usage 160----- 161 162lex [options] lex-file[.l] [output-file[.pas]] 163 164 165Options 166------- 167 168-v "Verbose:" Lex generates a readable description of the generated 169 lexical analyzer, written to lex-file with new extension `.lst'. 170 171-o "Optimize:" Lex optimizes DFA tables to produce a minimal DFA. 172 173 174Description 175----------- 176 177TP Lex is a program generator that is used to generate the Turbo Pascal source 178code for a lexical analyzer subroutine from the specification of an input 179language by a regular expression grammar. 180 181TP Lex parses the source grammar contained in lex-file (with default suffix 182.l) and writes the constructed lexical analyzer subroutine to the specified 183output-file (with default suffix .pas); if no output file is specified, output 184goes to lex-file with new suffix .pas. If any errors are found during 185compilation, error messages are written to the list file (lex-file with new 186suffix .lst). 187 188The generated output file contains a lexical analyzer routine, yylex, 189implemented as: 190 191 function yylex : Integer; 192 193This routine has to be called by your main program to execute the lexical 194analyzer. The return value of the yylex routine usually denotes the number 195of a token recognized by the lexical analyzer (see the return routine in the 196LexLib unit). At end-of-file the yylex routine normally returns 0. 197 198The code template for the yylex routine may be found in the yylex.cod 199file. This file is needed by TP Lex when it constructs the output file. It 200must be present either in the current directory or in the directory from which 201TP Lex was executed (TP Lex searches these directories in the indicated 202order). (NB: For the Linux/Free Pascal version, the code template is searched 203in some directory defined at compile-time instead of the execution path, 204usually /usr/lib/fpc/lexyacc.) 205 206The TP Lex library (LexLib) unit is required by programs using Lex-generated 207lexical analyzers; you will therefore have to put an appropriate uses clause 208into your program or unit that contains the lexical analyzer routine. The 209LexLib unit also provides various useful utility routines; see the file 210lexlib.pas for further information. 211 212 213Lex Source 214---------- 215 216A TP Lex program consists of three sections separated with the %% delimiter: 217 218definitions 219%% 220rules 221%% 222auxiliary procedures 223 224All sections may be empty. The TP Lex language is line-oriented; definitions 225and rules are separated by line breaks. There is no special notation for 226comments, but (Turbo Pascal style) comments may be included as Turbo Pascal 227fragments (see below). 228 229The definitions section may contain the following elements: 230 231- regular definitions in the format: 232 233 name substitution 234 235 which serve to abbreviate common subexpressions. The {name} notation 236 causes the corresponding substitution from the definitions section to 237 be inserted into a regular expression. The name must be a legal 238 identifier (letter followed by a sequence of letters and digits; 239 the underscore counts as a letter; upper- and lowercase are distinct). 240 Regular definitions must be non-recursive. 241 242- start state definitions in the format: 243 244 %start name ... 245 246 which are used in specifying start conditions on rules (described 247 below). The %start keyword may also be abbreviated as %s or %S. 248 249- Turbo Pascal declarations enclosed between %{ and %}. These will be 250 inserted into the output file (at global scope). Also, any line that 251 does not look like a Lex definition (e.g., starts with blank or tab) 252 will be treated as Turbo Pascal code. (In particular, this also allows 253 you to include Turbo Pascal comments in your Lex program.) 254 255The rules section of a TP Lex program contains the actual specification of 256the lexical analyzer routine. It may be thought of as a big CASE statement 257discriminating over the different patterns to be matched and listing the 258corresponding statements (actions) to be executed. Each rule consists of a 259regular expression describing the strings to be matched in the input, and a 260corresponding action, a Turbo Pascal statement to be executed when the 261expression matches. Expression and statement are delimited with whitespace 262(blanks and/or tabs). Thus the format of a Lex grammar rule is: 263 264 expression statement; 265 266Note that the action must be a single Turbo Pascal statement terminated 267with a semicolon (use begin ... end for compound statements). The statement 268may span multiple lines if the successor lines are indented with at least 269one blank or tab. The action may also be replaced by the | character, 270indicating that the action for this rule is the same as that for the next 271one. 272 273The TP Lex library unit provides various variables and routines which are 274useful in the programming of actions. In particular, the yytext string 275variable holds the text of the matched string, and the yyleng Byte variable 276its length. 277 278Regular expressions are used to describe the strings to be matched in a 279grammar rule. They are built from the usual constructs describing character 280classes and sequences, and operators specifying repetitions and alternatives. 281The precise format of regular expressions is described in the next section. 282 283The rules section may also start with some Turbo Pascal declarations 284(enclosed in %{ %}) which are treated as local declarations of the 285actions routine. 286 287Finally, the auxiliary procedures section may contain arbitrary Turbo 288Pascal code (such as supporting routines or a main program) which is 289simply tacked on to the end of the output file. The auxiliary procedures 290section is optional. 291 292 293Regular Expressions 294------------------- 295 296The following table summarizes the format of the regular expressions 297recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48). 298c stands for a single character, s for a string, r for a regular expression, 299and n,m for nonnegative integers. 300 301expression matches example 302---------- ---------------------------- ------- 303c any non-operator character c a 304\c character c literally \* 305"s" string s literally "**" 306. any character but newline a.*b 307^ beginning of line ^abc 308$ end of line abc$ 309[s] any character in s [abc] 310[^s] any character not in s [^abc] 311r* zero or more r's a* 312r+ one or more r's a+ 313r? zero or one r a? 314r{m,n} m to n occurrences of r a{1,5} 315r{m} m occurrences of r a{5} 316r1r2 r1 then r2 ab 317r1|r2 r1 or r2 a|b 318(r) r (a|b) 319r1/r2 r1 when followed by r2 a/b 320<x>r r when in start condition x <x>abc 321--------------------------------------------------- 322 323The operators *, +, ? and {} have highest precedence, followed by 324concatenation. The | operator has lowest precedence. Parentheses () 325may be used to group expressions and overwrite default precedences. 326The <> and / operators may only occur once in an expression. 327 328The usual C-like escapes are recognized: 329 330\n denotes newline 331\r denotes carriage return 332\t denotes tab 333\b denotes backspace 334\f denotes form feed 335\NNN denotes character no. NNN in octal base 336 337You can also use the \ character to quote characters which would otherwise 338be interpreted as operator symbols. In character classes, you may use 339the - character to denote ranges of characters. For instance, [a-z] 340denotes the class of all lowercase letters. 341 342The expressions in a TP Lex program may be ambigious, i.e. there may be inputs 343which match more than one rule. In such a case, the lexical analyzer prefers 344the longest match and, if it still has the choice between different rules, 345it picks the first of these. If no rule matches, the lexical analyzer 346executes a default action which consists of copying the input character 347to the output unchanged. Thus, if the purpose of a lexical analyzer is 348to translate some parts of the input, and leave the rest unchanged, you 349only have to specify the patterns which have to be treated specially. If, 350however, the lexical analyzer has to absorb its whole input, you will have 351to provide rules that match everything. E.g., you might use the rules 352 353 . | 354 \n ; 355 356which match "any other character" (and ignore it). 357 358Sometimes certain patterns have to be analyzed differently depending on some 359amount of context in which the pattern appears. In such a case the / operator 360is useful. For instance, the expression a/b matches a, but only if followed 361by b. Note that the b does not belong to the match; rather, the lexical 362analyzer, when matching an a, will look ahead in the input to see whether 363it is followed by a b, before it declares that it has matched an a. Such 364lookahead may be arbitrarily complex (up to the size of the LexLib input 365buffer). E.g., the pattern a/.*b matches an a which is followed by a b 366somewhere on the same input line. TP Lex also has a means to specify left 367context which is described in the next section. 368 369 370Start Conditions 371---------------- 372 373TP Lex provides some features which make it possible to handle left context. 374The ^ character at the beginning of a regular expression may be used to 375denote the beginning of the line. More distant left context can be described 376conveniently by using start conditions on rules. 377 378Any rule which is prefixed with the <> construct is only valid if the lexical 379analyzer is in the denoted start state. For instance, the expression <x>a 380can only be matched if the lexical analyzer is in start state x. You can have 381multiple start states in a rule; e.g., <x,y>a can be matched in start states 382x or y. 383 384Start states have to be declared in the definitions section by means of 385one or more start state definitions (see above). The lexical analyzer enters 386a start state through a call to the LexLib routine start. E.g., you may 387write: 388 389%start x y 390%% 391<x>a start(y); 392<y>b start(x); 393%% 394begin 395 start(x); if yylex=0 then ; 396end. 397 398Upon initialization, the lexical analyzer is put into state x. It then 399proceeds in state x until it matches an a which puts it into state y. 400In state y it may match a b which puts it into state x again, etc. 401 402Start conditions are useful when certain constructs have to be analyzed 403differently depending on some left context (such as a special character 404at the beginning of the line), and if multiple lexical analyzers have to 405work in concert. If a rule is not prefixed with a start condition, it is 406valid in all user-defined start states, as well as in the lexical analyzer's 407default start state. 408 409 410Lex Library 411----------- 412 413The TP Lex library (LexLib) unit provides various variables and routines 414which are used by Lex-generated lexical analyzers and application programs. 415It provides the input and output streams and other internal data structures 416used by the lexical analyzer routine, and supplies some variables and utility 417routines which may be used by actions and application programs. Refer to 418the file lexlib.pas for a closer description. 419 420You can also modify the Lex library unit (and/or the code template in the 421yylex.cod file) to customize TP Lex to your target applications. E.g., 422you might wish to optimize the code of the lexical analyzer for some 423special application, make the analyzer read from/write to memory instead 424of files, etc. 425 426 427Implementation Restrictions 428--------------------------- 429 430Internal table sizes and the main memory available limit the complexity of 431source grammars that TP Lex can handle. There is currently no possibility to 432change internal table sizes (apart from modifying the sources of TP Lex 433itself), but the maximum table sizes provided by TP Lex seem to be large 434enough to handle most realistic applications. The actual table sizes depend on 435the particular implementation (they are much larger than the defaults if TP 436Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or 437Free Pascal), and are shown in the statistics printed by TP Lex when a 438compilation is finished. The units given there are "p" (positions, i.e. items 439in the position table used to construct the DFA), "s" (DFA states) and "t" 440(transitions of the generated DFA). 441 442As implemented, the generated DFA table is stored as a typed array constant 443which is inserted into the yylex.cod code template. The transitions in each 444state are stored in order. Of course it would have been more efficient to 445generate a big CASE statement instead, but I found that this may cause 446problems with the encoding of large DFA tables because Turbo Pascal has 447a quite rigid limit on the code size of individual procedures. I decided to 448use a scheme in which transitions on different symbols to the same state are 449merged into one single transition (specifying a character set and the 450corresponding next state). This keeps the number of transitions in each state 451quite small and still allows a fairly efficient access to the transition 452table. 453 454The TP Lex program has an option (-o) to optimize DFA tables. This causes a 455minimal DFA to be generated, using the algorithm described in Aho, Sethi, 456Ullman (1986). Although the absolute limit on the number of DFA states that TP 457Lex can handle is at least 300, TP Lex poses an additional restriction (100) 458on the number of states in the initial partition of the DFA optimization 459algorithm. Thus, you may get a fatal `integer set overflow' message when using 460the -o option even when TP Lex is able to generate an unoptimized DFA. In such 461cases you will just have to be content with the unoptimized DFA. (Hopefully, 462this will be fixed in a future version. Anyhow, using the merged transitions 463scheme described above, TP Lex usually constructs unoptimized DFA's which are 464not far from being optimal, and thus in most cases DFA optimization won't have 465a great impact on DFA table sizes.) 466 467 468Differences from UNIX Lex 469------------------------- 470 471Major differences between TP Lex and UNIX Lex are listed below. 472 473- TP Lex produces output code for Turbo Pascal, rather than for C. 474 475- Character tables (%T) are not supported; neither are any directives 476 to determine internal table sizes (%p, %n, etc.). 477 478- Library routines are named differently from the UNIX version (e.g., 479 the `start' routine takes the place of the `BEGIN' macro of UNIX 480 Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had 481 to be implemented as procedures. 482 483- The TP Lex library unit starts counting line numbers at 0, incrementing 484 the count BEFORE a line is read (in contrast, UNIX Lex initializes 485 yylineno to 1 and increments it AFTER the line end has been read). This 486 is motivated by the way in which TP Lex maintains the current line, 487 and will not affect your programs unless you explicitly reset the 488 yylineno value (e.g., when opening a new input file). In such a case 489 you should set yylineno to 0 rather than 1. 490 491 492 493 494TP Yacc 495======= 496 497This section describes the TP Yacc compiler compiler. 498 499 500Usage 501----- 502 503yacc [options] yacc-file[.y] [output-file[.pas]] 504 505 506Options 507------- 508 509-v "Verbose:" TP Yacc generates a readable description of the generated 510 parser, written to yacc-file with new extension .lst. 511 512-d "Debug:" TP Yacc generates parser with debugging output. 513 514 515Description 516----------- 517 518TP Yacc is a program that lets you prepare parsers from the description 519of input languages by BNF-like grammars. You simply specify the grammar 520for your target language, augmented with the Turbo Pascal code necessary 521to process the syntactic constructs, and TP Yacc translates your grammar 522into the Turbo Pascal code for a corresponding parser subroutine named 523yyparse. 524 525TP Yacc parses the source grammar contained in yacc-file (with default 526suffix .y) and writes the constructed parser subroutine to the specified 527output-file (with default suffix .pas); if no output file is specified, 528output goes to yacc-file with new suffix .pas. If any errors are found 529during compilation, error messages are written to the list file (yacc-file 530with new suffix .lst). 531 532The generated parser routine, yyparse, is declared as: 533 534 function yyparse : Integer; 535 536This routine may be called by your main program to execute the parser. 537The return value of the yyparse routine denotes success or failure of 538the parser (possible return values: 0 = success, 1 = unrecoverable syntax 539error or parse stack overflow). 540 541Similar to TP Lex, the code template for the yyparse routine may be found in 542the yyparse.cod file. The rules for locating this file are analogous to those 543of TP Lex (see Section `TP Lex'). 544 545The TP Yacc library (YaccLib) unit is required by programs using Yacc- 546generated parsers; you will therefore have to put an appropriate uses clause 547into your program or unit that contains the parser routine. The YaccLib unit 548also provides some routines which may be used to control the actions of the 549parser. See the file yacclib.pas for further information. 550 551 552Yacc Source 553----------- 554 555A TP Yacc program consists of three sections separated with the %% delimiter: 556 557definitions 558%% 559rules 560%% 561auxiliary procedures 562 563 564The TP Yacc language is free-format: whitespace (blanks, tabs and newlines) 565is ignored, except if it serves as a delimiter. Comments have the C-like 566format /* ... */. They are treated as whitespace. Grammar symbols are denoted 567by identifiers which have the usual form (letter, including underscore, 568followed by a sequence of letters and digits; upper- and lowercase is 569distinct). The TP Yacc language also has some keywords which always start 570with the % character. Literals are denoted by characters enclosed in single 571quotes. The usual C-like escapes are recognized: 572 573\n denotes newline 574\r denotes carriage return 575\t denotes tab 576\b denotes backspace 577\f denotes form feed 578\NNN denotes character no. NNN in octal base 579 580 581Definitions 582----------- 583 584The first section of a TP Yacc grammar serves to define the symbols used in 585the grammar. It may contain the following types of definitions: 586 587- start symbol definition: A definition of the form 588 589 %start symbol 590 591 declares the start nonterminal of the grammar (if this definition is 592 omitted, TP Yacc assumes the left-hand side nonterminal of the first 593 grammar rule as the start symbol of the grammar). 594 595- terminal definitions: Definitions of the form 596 597 %token symbol ... 598 599 are used to declare the terminal symbols ("tokens") of the target 600 language. Any identifier not introduced in a %token definition will 601 be treated as a nonterminal symbol. 602 603 As far as TP Yacc is concerned, tokens are atomic symbols which do not 604 have an innert structure. A lexical analyzer must be provided which 605 takes on the task of tokenizing the input stream and return the 606 individual tokens and literals to the parser (see Section `Lexical 607 Analysis'). 608 609- precedence definitions: Operator symbols (terminals) may be associated 610 with a precedence by means of a precedence definition which may have 611 one of the following forms 612 613 %left symbol ... 614 %right symbol ... 615 %nonassoc symbol ... 616 617 which are used to declare left-, right- and nonassociative operators, 618 respectively. Each precedence definition introduces a new precedence 619 level, lowest precedence first. E.g., you may write: 620 621 %nonassoc '<' '>' '=' GEQ LEQ NEQ /* relational operators */ 622 %left '+' '-' OR /* addition operators */ 623 %left '*' '/' AND /* multiplication operators */ 624 %right NOT UMINUS /* unary operators */ 625 626 A terminal identifier introduced in a precedence definition may, but 627 need not, appear in a %token definition as well. 628 629- type definitions: Any (terminal or nonterminal) grammar symbol may be 630 associated with a type identifier which is used in the processing of 631 semantic values. Type tags of the form <name> may be used in token and 632 precedence definitions to declare the type of a terminal symbol, e.g.: 633 634 %token <Real> NUM 635 %left <AddOp> '+' '-' 636 637 To declare the type of a nonterminal symbol, use a type definition of 638 the form: 639 640 %type <name> symbol ... 641 642 e.g.: 643 644 %type <Real> expr 645 646 In a %type definition, you may also omit the nonterminals, i.e. you 647 may write: 648 649 %type <name> 650 651 This is useful when a given type is only used with type casts (see 652 Section `Grammar Rules and Actions'), and is not associated with a 653 specific nonterminal. 654 655- Turbo Pascal declarations: You may also include arbitrary Turbo Pascal 656 code in the definitions section, enclosed in %{ %}. This code will be 657 inserted as global declarations into the output file, unchanged. 658 659 660Grammar Rules and Actions 661------------------------- 662 663The second part of a TP Yacc grammar contains the grammar rules for the 664target language. Grammar rules have the format 665 666 name : symbol ... ; 667 668The left-hand side of a rule must be an identifier (which denotes a 669nonterminal symbol). The right-hand side may be an arbitrary (possibly 670empty) sequence of nonterminal and terminal symbols (including literals 671enclosed in single quotes). The terminating semicolon may also be omitted. 672Different rules for the same left-hand side symbols may be written using 673the | character to separate the different alternatives: 674 675 name : symbol ... 676 | symbol ... 677 ... 678 ; 679 680For instance, to specify a simple grammar for arithmetic expressions, you 681may write: 682 683%left '+' '-' 684%left '*' '/' 685%token NUM 686%% 687expr : expr '+' expr 688 | expr '-' expr 689 | expr '*' expr 690 | expr '/' expr 691 | '(' expr ')' 692 | NUM 693 ; 694 695(The %left definitions at the beginning of the grammar are needed to specify 696the precedence and associativity of the operator symbols. This will be 697discussed in more detail in Section `Ambigious Grammars'.) 698 699Grammar rules may contain actions - Turbo Pascal statements enclosed in 700{ } - to be executed as the corresponding rules are recognized. Furthermore, 701rules may return values, and access values returned by other rules. These 702"semantic" values are written as $$ (value of the left-hand side nonterminal) 703and $i (value of the ith right-hand side symbol). They are kept on a special 704value stack which is maintained automatically by the parser. 705 706Values associated with terminal symbols must be set by the lexical analyzer 707(more about this in Section `Lexical Analysis'). Actions of the form $$ := $1 708can frequently be omitted, since it is the default action assumed by TP Yacc 709for any rule that does not have an explicit action. 710 711By default, the semantic value type provided by Yacc is Integer. You can 712also put a declaration like 713 714 %{ 715 type YYSType = Real; 716 %} 717 718into the definitions section of your Yacc grammar to change the default value 719type. However, if you have different value types, the preferred method is to 720use type definitions as discussed in Section `Definitions'. When such type 721definitions are given, TP Yacc handles all the necessary details of the 722YYSType definition and also provides a fair amount of type checking which 723makes it easier to find type errors in the grammar. 724 725For instance, we may declare the symbols NUM and expr in the example above 726to be of type Real, and then use these values to evaluate an expression as 727it is parsed. 728 729%left '+' '-' 730%left '*' '/' 731%token <Real> NUM 732%type <Real> expr 733%% 734expr : expr '+' expr { $$ := $1+$3; } 735 | expr '-' expr { $$ := $1-$3; } 736 | expr '*' expr { $$ := $1*$3; } 737 | expr '/' expr { $$ := $1/$3; } 738 | '(' expr ')' { $$ := $2; } 739 | NUM 740 ; 741 742(Note that we omitted the action of the last rule. The "copy action" 743$$ := $1 required by this rule is automatically added by TP Yacc.) 744 745Actions may not only appear at the end, but also in the middle of a rule 746which is useful to perform some processing before a rule is fully parsed. 747Such actions inside a rule are treated as special nonterminals which are 748associated with an empty right-hand side. Thus, a rule like 749 750 x : y { action; } z 751 752will be treated as: 753 754 x : y $act z 755 $act : { action; } 756 757Actions inside a rule may also access values to the left of the action, 758and may return values by assigning to the $$ value. The value returned 759by such an action can then be accessed by other actions using the usual $i 760notation. E.g., we may write: 761 762 x : y { $$ := 2*$1; } z { $$ := $2+$3; } 763 764which has the effect of setting the value of x to 765 766 2*(the value of y)+(the value of z). 767 768Sometimes it is desirable to access values in enclosing rules. This can be 769done using the notation $i with i<=0. $0 refers to the first value "to the 770left" of the current rule, $-1 to the second, and so on. Note that in this 771case the referenced value depends on the actual contents of the parse stack, 772so you have to make sure that the requested values are always where you 773expect them. 774 775There are some situations in which TP Yacc cannot easily determine the 776type of values (when a typed parser is used). This is true, in particular, 777for values in enclosing rules and for the $$ value in an action inside a 778rule. In such cases you may use a type cast to explicitly specify the type 779of a value. The format for such type casts is $<name>$ (for left-hand side 780values) and $<name>i (for right-hand side values) where name is a type 781identifier (which must occur in a %token, precedence or %type definition). 782 783 784Auxiliary Procedures 785-------------------- 786 787The third section of a TP Yacc program is optional. If it is present, it 788may contain any Turbo Pascal code (such as supporting routines or a main 789program) which is tacked on to the end of the output file. 790 791 792Lexical Analysis 793---------------- 794 795For any TP Yacc-generated parser, the programmer must supply a lexical 796analyzer routine named yylex which performs the lexical analysis for 797the parser. This routine must be declared as 798 799 function yylex : Integer; 800 801The yylex routine may either be prepared by hand, or by using the lexical 802analyzer generator TP Lex (see Section `TP Lex'). 803 804The lexical analyzer must be included in your main program behind the 805parser subroutine (the yyparse code template includes a forward 806definition of the yylex routine such that the parser can access the 807lexical analyzer). For instance, you may put the lexical analyzer 808routine into the auxiliary procedures section of your TP Yacc grammar, 809either directly, or by using the the Turbo Pascal include directive 810($I). 811 812The parser repeatedly calls the yylex routine to tokenize the input 813stream and obtain the individual lexical items in the input. For any 814literal character, the yylex routine has to return the corresponding 815character code. For the other, symbolic, terminals of the input language, 816the lexical analyzer must return corresponding Integer codes. These are 817assigned automatically by TP Yacc in the order in which token definitions 818appear in the definitions section of the source grammar. The lexical 819analyzer can access these values through corresponding Integer constants 820which are declared by TP Yacc in the output file. 821 822For instance, if 823 824 %token NUM 825 826is the first definition in the Yacc grammar, then TP Yacc will create 827a corresponding constant declaration 828 829 const NUM = 257; 830 831in the output file (TP Yacc automatically assigns symbolic token numbers 832starting at 257; 1 thru 255 are reserved for character literals, 0 denotes 833end-of-file, and 256 is reserved for the special error token which will be 834discussed in Section `Error Handling'). This definition may then be used, 835e.g., in a corresponding TP Lex program as follows: 836 837 [0-9]+ return(NUM); 838 839You can also explicitly assign token numbers in the grammar. For this 840purpose, the first occurrence of a token identifier in the definitions 841section may be followed by an unsigned integer. E.g. you may write: 842 843 %token NUM 299 844 845Besides the return value of yylex, the lexical analyzer routine may also 846return an additional semantic value for the recognized token. This value 847is assigned to a variable named "yylval" and may then be accessed in actions 848through the $i notation (see above, Section `Grammar Rules and Actions'). 849The yylval variable is of type YYSType (the semantic value type, Integer 850by default); its declaration may be found in the yyparse.cod file. 851 852For instance, to assign an Integer value to a NUM token in the above 853example, we may write: 854 855 [0-9]+ begin 856 val(yytext, yylval, code); 857 return(NUM); 858 end; 859 860This assigns yylval the value of the NUM token (using the Turbo Pascal 861standard procedure val). 862 863If a parser uses tokens of different types (via a %token <name> definition), 864then the yylval variable will not be of type Integer, but instead of a 865corresponding variant record type which is capable of holding all the 866different value types declared in the TP Yacc grammar. In this case, the 867lexical analyzer must assign a semantic value to the corresponding record 868component which is named yy<name> (where <name> stands for the corresponding 869type identifier). 870 871E.g., if token NUM is declared Real: 872 873 %token <Real> NUM 874 875then the value for token NUM must be assigned to yylval.yyReal. 876 877 878How The Parser Works 879-------------------- 880 881TP Yacc uses the LALR(1) technique developed by Donald E. Knuth and F. 882DeRemer to construct a simple, efficient, non-backtracking bottom-up 883parser for the source grammar. The LALR parsing technique is described 884in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a 885look at the parser description TP Yacc generates from a small sample 886grammar, to get an idea of how the LALR parsing algorithm works. We 887consider the following simplified version of the arithmetic expression 888grammar: 889 890%token NUM 891%left '+' 892%left '*' 893%% 894expr : expr '+' expr 895 | expr '*' expr 896 | '(' expr ')' 897 | NUM 898 ; 899 900When run with the -v option on the above grammar, TP Yacc generates the 901parser description listed below. 902 903state 0: 904 905 $accept : _ expr $end 906 907 '(' shift 2 908 NUM shift 3 909 . error 910 911 expr goto 1 912 913state 1: 914 915 $accept : expr _ $end 916 expr : expr _ '+' expr 917 expr : expr _ '*' expr 918 919 $end accept 920 '*' shift 4 921 '+' shift 5 922 . error 923 924state 2: 925 926 expr : '(' _ expr ')' 927 928 '(' shift 2 929 NUM shift 3 930 . error 931 932 expr goto 6 933 934state 3: 935 936 expr : NUM _ (4) 937 938 . reduce 4 939 940state 4: 941 942 expr : expr '*' _ expr 943 944 '(' shift 2 945 NUM shift 3 946 . error 947 948 expr goto 7 949 950state 5: 951 952 expr : expr '+' _ expr 953 954 '(' shift 2 955 NUM shift 3 956 . error 957 958 expr goto 8 959 960state 6: 961 962 expr : '(' expr _ ')' 963 expr : expr _ '+' expr 964 expr : expr _ '*' expr 965 966 ')' shift 9 967 '*' shift 4 968 '+' shift 5 969 . error 970 971state 7: 972 973 expr : expr '*' expr _ (2) 974 expr : expr _ '+' expr 975 expr : expr _ '*' expr 976 977 . reduce 2 978 979state 8: 980 981 expr : expr '+' expr _ (1) 982 expr : expr _ '+' expr 983 expr : expr _ '*' expr 984 985 '*' shift 4 986 $end reduce 1 987 ')' reduce 1 988 '+' reduce 1 989 . error 990 991state 9: 992 993 expr : '(' expr ')' _ (3) 994 995 . reduce 3 996 997 998Each state of the parser corresponds to a certain prefix of the input 999which has already been seen. The parser description lists the grammar 1000rules wich are parsed in each state, and indicates the portion of each 1001rule which has already been parsed by an underscore. In state 0, the 1002start state of the parser, the parsed rule is 1003 1004 $accept : expr $end 1005 1006This is not an actual grammar rule, but a starting rule automatically 1007added by TP Yacc. In general, it has the format 1008 1009 $accept : X $end 1010 1011where X is the start nonterminal of the grammar, and $end is a pseudo 1012token denoting end-of-input (the $end symbol is used by the parser to 1013determine when it has successfully parsed the input). 1014 1015The description of the start rule in state 0, 1016 1017 $accept : _ expr $end 1018 1019with the underscore positioned before the expr symbol, indicates that 1020we are at the beginning of the parse and are ready to parse an expression 1021(nonterminal expr). 1022 1023The parser maintains a stack to keep track of states visited during the 1024parse. There are two basic kinds of actions in each state: "shift", which 1025reads an input symbol and pushes the corresponding next state on top of 1026the stack, and "reduce" which pops a number of states from the stack 1027(corresponding to the number of right-hand side symbols of the rule used 1028in the reduction) and consults the "goto" entries of the uncovered state 1029to find the transition corresponding to the left-hand side symbol of the 1030reduced rule. 1031 1032In each step of the parse, the parser is in a given state (the state on 1033top of its stack) and may consult the current "lookahead symbol", the 1034next symbol in the input, to determine the parse action - shift or reduce - 1035to perform. The parser terminates as soon as it reaches state 1 and reads 1036in the endmarker, indicated by the "accept" action on $end in state 1. 1037 1038Sometimes the parser may also carry out an action without inspecting the 1039current lookahead token. This is the case, e.g., in state 3 where the 1040only action is reduction by rule 4: 1041 1042 . reduce 4 1043 1044The default action in a state can also be "error" indicating that any 1045other input represents a syntax error. (In case of such an error the 1046parser will start syntactic error recovery, as described in Section 1047`Error Handling'.) 1048 1049Now let us see how the parser responds to a given input. We consider the 1050input string 2+5*3 which is presented to the parser as the token sequence: 1051 1052 NUM + NUM * NUM 1053 1054The following table traces the corresponding actions of the parser. We also 1055show the current state in each move, and the remaining states on the stack. 1056 1057State Stack Lookahead Action 1058----- ------------ --------- -------------------------------------------- 1059 10600 NUM shift state 3 1061 10623 0 reduce rule 4 (pop 1 state, uncovering state 1063 0, then goto state 1 on symbol expr) 1064 10651 0 + shift state 5 1066 10675 1 0 NUM shift state 3 1068 10693 5 1 0 reduce rule 4 (pop 1 state, uncovering state 1070 5, then goto state 8 on symbol expr) 1071 10728 5 1 0 * shift 4 1073 10744 8 5 1 0 NUM shift 3 1075 10763 4 8 5 1 0 reduce rule 4 (pop 1 state, uncovering state 1077 4, then goto state 7 on symbol expr) 1078 10797 4 8 5 1 0 reduce rule 2 (pop 3 states, uncovering state 1080 5, then goto state 8 on symbol expr) 1081 10828 5 1 0 $end reduce rule 1 (pop 3 states, uncovering state 1083 0, then goto state 1 on symbol expr) 1084 10851 0 $end accept 1086 1087It is also instructive to see how the parser responds to illegal inputs. 1088E.g., you may try to figure out what the parser does when confronted with: 1089 1090 NUM + ) 1091 1092or: 1093 1094 ( NUM * NUM 1095 1096You will find that the parser, sooner or later, will always run into an 1097error action when confronted with errorneous inputs. An LALR parser will 1098never shift an invalid symbol and thus will always find syntax errors as 1099soon as it is possible during a left-to-right scan of the input. 1100 1101TP Yacc provides a debugging option (-d) that may be used to trace the 1102actions performed by the parser. When a grammar is compiled with the 1103-d option, the generated parser will print out the actions as it parses 1104its input. 1105 1106 1107Ambigious Grammars 1108------------------ 1109 1110There are situations in which TP Yacc will not produce a valid parser for 1111a given input language. LALR(1) parsers are restricted to one-symbol 1112lookahead on which they have to base their parsing decisions. If a 1113grammar is ambigious, or cannot be parsed unambigiously using one-symbol 1114lookahead, TP Yacc will generate parsing conflicts when constructing the 1115parse table. There are two types of such conflicts: shift/reduce conflicts 1116(when there is both a shift and a reduce action for a given input symbol 1117in a given state), and reduce/reduce conflicts (if there is more than 1118one reduce action for a given input symbol in a given state). Note that 1119there never will be a shift/shift conflict. 1120 1121When a grammar generates parsing conflicts, TP Yacc prints out the number 1122of shift/reduce and reduce/reduce conflicts it encountered when constructing 1123the parse table. However, TP Yacc will still generate the output code for the 1124parser. To resolve parsing conflicts, TP Yacc uses the following built-in 1125disambiguating rules: 1126 1127- in a shift/reduce conflict, TP Yacc chooses the shift action. 1128 1129- in a reduce/reduce conflict, TP Yacc chooses reduction of the first 1130 grammar rule. 1131 1132The shift/reduce disambiguating rule correctly resolves a type of 1133ambiguity known as the "dangling-else ambiguity" which arises in the 1134syntax of conditional statements of many programming languages (as in 1135Pascal): 1136 1137%token IF THEN ELSE 1138%% 1139stmt : IF expr THEN stmt 1140 | IF expr THEN stmt ELSE stmt 1141 ; 1142 1143This grammar is ambigious, because a nested construct like 1144 1145 IF expr-1 THEN IF expr-2 THEN stmt-1 ELSE stmt-2 1146 1147can be parsed two ways, either as: 1148 1149 IF expr-1 THEN ( IF expr-2 THEN stmt-1 ELSE stmt-2 ) 1150 1151or as: 1152 1153 IF expr-1 THEN ( IF expr-2 THEN stmt-1 ) ELSE stmt-2 1154 1155The first interpretation makes an ELSE belong to the last unmatched 1156IF which also is the interpretation chosen in most programming languages. 1157This is also the way that a TP Yacc-generated parser will parse the construct 1158since the shift/reduce disambiguating rule has the effect of neglecting the 1159reduction of IF expr-2 THEN stmt-1; instead, the parser will shift the ELSE 1160symbol which eventually leads to the reduction of IF expr-2 THEN stmt-1 ELSE 1161stmt-2. 1162 1163The reduce/reduce disambiguating rule is used to resolve conflicts that 1164arise when there is more than one grammar rule matching a given construct. 1165Such ambiguities are often caused by "special case constructs" which may be 1166given priority by simply listing the more specific rules ahead of the more 1167general ones. 1168 1169For instance, the following is an excerpt from the grammar describing the 1170input language of the UNIX equation formatter EQN: 1171 1172%right SUB SUP 1173%% 1174expr : expr SUB expr SUP expr 1175 | expr SUB expr 1176 | expr SUP expr 1177 ; 1178 1179Here, the SUB and SUP operator symbols denote sub- and superscript, 1180respectively. The rationale behind this example is that an expression 1181involving both sub- and superscript is often set differently from a 1182superscripted subscripted expression. This special case is therefore 1183caught by the first rule in the above example which causes a reduce/reduce 1184conflict with rule 3 in expressions like expr-1 SUB expr-2 SUP expr-3. 1185The conflict is resolved in favour of the first rule. 1186 1187In both cases discussed above, the ambiguities could also be eliminated 1188by rewriting the grammar accordingly (although this yields more complicated 1189and less readable grammars). This may not always be the case. Often 1190ambiguities are also caused by design errors in the grammar. Hence, if 1191TP Yacc reports any parsing conflicts when constructing the parser, you 1192should use the -v option to generate the parser description (.lst file) 1193and check whether TP Yacc resolved the conflicts correctly. 1194 1195There is one type of syntactic constructs for which one often deliberately 1196uses an ambigious grammar as a more concise representation for a language 1197that could also be specified unambigiously: the syntax of expressions. 1198For instance, the following is an unambigious grammar for simple arithmetic 1199expressions: 1200 1201%token NUM 1202 1203%% 1204 1205expr : term 1206 | expr '+' term 1207 ; 1208 1209term : factor 1210 | term '*' factor 1211 ; 1212 1213factor : '(' expr ')' 1214 | NUM 1215 ; 1216 1217You may check yourself that this grammar gives * a higher precedence than 1218+ and makes both operators left-associative. The same effect can be achieved 1219with the following ambigious grammar using precedence definitions: 1220 1221%token NUM 1222%left '+' 1223%left '*' 1224%% 1225expr : expr '+' expr 1226 | expr '*' expr 1227 | '(' expr ')' 1228 | NUM 1229 ; 1230 1231Without the precedence definitions, this is an ambigious grammar causing 1232a number of shift/reduce conflicts. The precedence definitions are used 1233to correctly resolve these conflicts (conflicts resolved using precedence 1234will not be reported by TP Yacc). 1235 1236Each precedence definition introduces a new precedence level (lowest 1237precedence first) and specifies whether the corresponding operators 1238should be left-, right- or nonassociative (nonassociative operators 1239cannot be combined at all; example: relational operators in Pascal). 1240 1241TP Yacc uses precedence information to resolve shift/reduce conflicts as 1242follows. Precedences are associated with each terminal occuring in a 1243precedence definition. Furthermore, each grammar rule is given the 1244precedence of its rightmost terminal (this default choice can be 1245overwritten using a %prec tag; see below). To resolve a shift/reduce 1246conflict using precedence, both the symbol and the rule involved must 1247have been assigned precedences. TP Yacc then chooses the parse action 1248as follows: 1249 1250- If the symbol has higher precedence than the rule: shift. 1251 1252- If the rule has higher precedence than the symbol: reduce. 1253 1254- If symbol and rule have the same precedence, the associativity of the 1255 symbol determines the parse action: if the symbol is left-associative: 1256 reduce; if the symbol is right-associative: shift; if the symbol is 1257 non-associative: error. 1258 1259To give you an idea of how this works, let us consider our ambigious 1260arithmetic expression grammar (without precedences): 1261 1262%token NUM 1263%% 1264expr : expr '+' expr 1265 | expr '*' expr 1266 | '(' expr ')' 1267 | NUM 1268 ; 1269 1270This grammar generates four shift/reduce conflicts. The description 1271of state 8 reads as follows: 1272 1273state 8: 1274 1275 *** conflicts: 1276 1277 shift 4, reduce 1 on '*' 1278 shift 5, reduce 1 on '+' 1279 1280 expr : expr '+' expr _ (1) 1281 expr : expr _ '+' expr 1282 expr : expr _ '*' expr 1283 1284 '*' shift 4 1285 '+' shift 5 1286 $end reduce 1 1287 ')' reduce 1 1288 . error 1289 1290In this state, we have successfully parsed a + expression (rule 1). When 1291the next symbol is + or *, we have the choice between the reduction and 1292shifting the symbol. Using the default shift/reduce disambiguating rule, 1293TP Yacc has resolved these conflicts in favour of shift. 1294 1295Now let us assume the above precedence definition: 1296 1297 %left '+' 1298 %left '*' 1299 1300which gives * higher precedence than + and makes both operators left- 1301associative. The rightmost terminal in rule 1 is +. Hence, given these 1302precedence definitions, the first conflict will be resolved in favour 1303of shift (* has higher precedence than +), while the second one is resolved 1304in favour of reduce (+ is left-associative). 1305 1306Similar conflicts arise in state 7: 1307 1308state 7: 1309 1310 *** conflicts: 1311 1312 shift 4, reduce 2 on '*' 1313 shift 5, reduce 2 on '+' 1314 1315 expr : expr '*' expr _ (2) 1316 expr : expr _ '+' expr 1317 expr : expr _ '*' expr 1318 1319 '*' shift 4 1320 '+' shift 5 1321 $end reduce 2 1322 ')' reduce 2 1323 . error 1324 1325Here, we have successfully parsed a * expression which may be followed 1326by another + or * operator. Since * is left-associative and has higher 1327precedence than +, both conflicts will be resolved in favour of reduce. 1328 1329Of course, you can also have different operators on the same precedence 1330level. For instance, consider the following extended version of the 1331arithmetic expression grammar: 1332 1333%token NUM 1334%left '+' '-' 1335%left '*' '/' 1336%% 1337expr : expr '+' expr 1338 | expr '-' expr 1339 | expr '*' expr 1340 | expr '/' expr 1341 | '(' expr ')' 1342 | NUM 1343 ; 1344 1345This puts all "addition" operators on the first and all "multiplication" 1346operators on the second precedence level. All operators are left-associative; 1347for instance, 5+3-2 will be parsed as (5+3)-2. 1348 1349By default, TP Yacc assigns each rule the precedence of its rightmost 1350terminal. This is a sensible decision in most cases. Occasionally, it 1351may be necessary to overwrite this default choice and explicitly assign 1352a precedence to a rule. This can be done by putting a precedence tag 1353of the form 1354 1355 %prec symbol 1356 1357at the end of the corresponding rule which gives the rule the precedence 1358of the specified symbol. For instance, to extend the expression grammar 1359with a unary minus operator, giving it highest precedence, you may write: 1360 1361%token NUM 1362%left '+' '-' 1363%left '*' '/' 1364%right UMINUS 1365%% 1366expr : expr '+' expr 1367 | expr '-' expr 1368 | expr '*' expr 1369 | expr '/' expr 1370 | '-' expr %prec UMINUS 1371 | '(' expr ')' 1372 | NUM 1373 ; 1374 1375Note the use of the UMINUS token which is not an actual input symbol but 1376whose sole purpose it is to give unary minus its proper precedence. If 1377we omitted the precedence tag, both unary and binary minus would have the 1378same precedence because they are represented by the same input symbol. 1379 1380 1381Error Handling 1382-------------- 1383 1384Syntactic error handling is a difficult area in the design of user-friendly 1385parsers. Usually, you will not like to have the parser give up upon the 1386first occurrence of an errorneous input symbol. Instead, the parser should 1387recover from a syntax error, that is, it should try to find a place in the 1388input where it can resume the parse. 1389 1390TP Yacc provides a general mechanism to implement parsers with error 1391recovery. A special predefined "error" token may be used in grammar rules 1392to indicate positions where syntax errors might occur. When the parser runs 1393into an error action (i.e., reads an errorneous input symbol) it prints out 1394an error message and starts error recovery by popping its stack until it 1395uncovers a state in which there is a shift action on the error token. If 1396there is no such state, the parser terminates with return value 1, indicating 1397an unrecoverable syntax error. If there is such a state, the parser takes the 1398shift on the error token (pretending it has seen an imaginary error token in 1399the input), and resumes parsing in a special "error mode." 1400 1401While in error mode, the parser quietly skips symbols until it can again 1402perform a legal shift action. To prevent a cascade of error messages, the 1403parser returns to its normal mode of operation only after it has seen 1404and shifted three legal input symbols. Any additional error found after 1405the first shifted symbol restarts error recovery, but no error message 1406is printed. The TP Yacc library routine yyerrok may be used to reset the 1407parser to its normal mode of operation explicitly. 1408 1409For a simple example, consider the rule 1410 1411stmt : error ';' { yyerrok; } 1412 1413and assume a syntax error occurs while a statement (nonterminal stmt) is 1414parsed. The parser prints an error message, then pops its stack until it 1415can shift the token error of the error rule. Proceeding in error mode, it 1416will skip symbols until it finds a semicolon, then reduces by the error 1417rule. The call to yyerrok tells the parser that we have recovered from 1418the error and that it should proceed with the normal parse. This kind of 1419"panic mode" error recovery scheme works well when statements are always 1420terminated with a semicolon. The parser simply skips the "bad" statement 1421and then resumes the parse. 1422 1423Implementing a good error recovery scheme can be a difficult task; see 1424Aho/Sethi/Ullman (1986) for a more comprehensive treatment of this topic. 1425Schreiner and Friedman have developed a systematic technique to implement 1426error recovery with Yacc which I found quite useful (I used it myself 1427to implement error recovery in the TP Yacc parser); see Schreiner/Friedman 1428(1985). 1429 1430 1431Yacc Library 1432------------ 1433 1434The TP Yacc library (YaccLib) unit provides some global declarations used 1435by the parser routine yyparse, and some variables and utility routines 1436which may be used to control the actions of the parser and to implement 1437error recovery. See the file yacclib.pas for a description of these 1438variables and routines. 1439 1440You can also modify the Yacc library unit (and/or the code template in the 1441yyparse.cod file) to customize TP Yacc to your target applications. 1442 1443 1444Other Features 1445-------------- 1446 1447TP Yacc supports all additional language elements entitled as "Old Features 1448Supported But not Encouraged" in the UNIX manual, which are provided for 1449backward compatibility with older versions of (UNIX) Yacc: 1450 1451- literals delimited by double quotes. 1452 1453- multiple-character literals. Note that these are not treated as character 1454 sequences but represent single tokens which are given a symbolic integer 1455 code just like any other token identifier. However, they will not be 1456 declared in the output file, so you have to make sure yourself that 1457 the lexical analyzer returns the correct codes for these symbols. E.g., 1458 you might explicitly assign token numbers by using a definition like 1459 1460 %token ':=' 257 1461 1462 at the beginning of the Yacc grammar. 1463 1464- \ may be used instead of %, i.e. \\ means %%, \left is the same as %left, 1465 etc. 1466 1467- other synonyms: 1468 %< for %left 1469 %> for %right 1470 %binary or %2 for %nonassoc 1471 %term or %0 for %token 1472 %= for %prec 1473 1474- actions may also be written as = { ... } or = single-statement; 1475 1476- Turbo Pascal declarations (%{ ... %}) may be put at the beginning of the 1477 rules section. They will be treated as local declarations of the actions 1478 routine. 1479 1480 1481Implementation Restrictions 1482--------------------------- 1483 1484As with TP Lex, internal table sizes and the main memory available limit the 1485complexity of source grammars that TP Yacc can handle. However, the maximum 1486table sizes provided by TP Yacc are large enough to handle quite complex 1487grammars (such as the Pascal grammar in the TP Yacc distribution). The actual 1488table sizes are shown in the statistics printed by TP Yacc when a compilation 1489is finished. The given figures are "s" (states), "i" (LR0 kernel items), "t" 1490(shift and goto transitions) and "r" (reductions). 1491 1492The default stack size of the generated parsers is yymaxdepth = 1024, as 1493declared in the TP Yacc library unit. This should be sufficient for any 1494average application, but you can change the stack size by including a 1495corresponding declaration in the definitions part of the Yacc grammar 1496(or change the value in the YaccLib unit). Note that right-recursive 1497grammar rules may increase stack space requirements, so it is a good 1498idea to use left-recursive rules wherever possible. 1499 1500 1501Differences from UNIX Yacc 1502-------------------------- 1503 1504Major differences between TP Yacc and UNIX Yacc are listed below. 1505 1506- TP Yacc produces output code for Turbo Pascal, rather than for C. 1507 1508- TP Yacc does not support %union definitions. Instead, a value type is 1509 declared by specifying the type identifier itself as the tag of a %token 1510 or %type definition. TP Yacc will automatically generate an appropriate 1511 variant record type (YYSType) which is capable of holding values of any 1512 of the types used in %token and %type. 1513 1514 Type checking is very strict. If you use type definitions, then 1515 any symbol referred to in an action must have a type introduced 1516 in a type definition. Either the symbol must have been assigned a 1517 type in the definitions section, or the $<type-identifier> notation 1518 must be used. The syntax of the %type definition has been changed 1519 slightly to allow definitions of the form 1520 %type <type-identifier> 1521 (omitting the nonterminals) which may be used to declare types which 1522 are not assigned to any grammar symbol, but are used with the 1523 $<...> construct. 1524 1525- The parse tables constructed by this Yacc version are slightly greater 1526 than those constructed by UNIX Yacc, since a reduce action will only be 1527 chosen as the default action if it is the only action in the state. 1528 In difference, UNIX Yacc chooses a reduce action as the default action 1529 whenever it is the only reduce action of the state (even if there are 1530 other shift actions). 1531 1532 This solves a bug in UNIX Yacc that makes the generated parser start 1533 error recovery too late with certain types of error productions (see 1534 also Schreiner/Friedman, "Introduction to compiler construction with 1535 UNIX," 1985). Also, errors will be caught sooner in most cases where 1536 UNIX Yacc would carry out an additional (default) reduction before 1537 detecting the error. 1538 1539- Library routines are named differently from the UNIX version (e.g., 1540 the `yyerrlab' routine takes the place of the `YYERROR' macro of UNIX 1541 Yacc), and, of course, all macros of UNIX Yacc (YYERROR, YYACCEPT, etc.) 1542 had to be implemented as procedures. 1543 1544 1545