1
2
3      TP Lex and Yacc - The Compiler Writer's Tools for Turbo Pascal
4      == === === ==== = === ======== ======== ===== === ===== ======
5
6                     Version 4.1 User Manual
7                     ======= === ==== ======
8
9                         Albert Graef
10                 Department of Musicinformatics
11               Johannes Gutenberg-University Mainz
12
13               ag@muwiinfa.geschichte.uni-mainz.de
14
15                          April 1998
16
17
18Introduction
19============
20
21This document describes the TP Lex and Yacc compiler generator toolset. These
22tools are designed especially to help you prepare compilers and similar
23programs like text processing utilities and command language interpreters with
24the Turbo Pascal (TM) programming language.
25
26TP Lex and Yacc are Turbo Pascal adaptions of the well-known UNIX (TM)
27utilities Lex and Yacc, which were written by M.E. Lesk and S.C. Johnson at
28Bell Laboratories, and are used with the C programming language. TP Lex and
29Yacc are intended to be approximately "compatible" with these programs.
30However, they are an independent development of the author, based on the
31techniques described in the famous "dragon book" of Aho, Sethi and Ullman
32(Aho, Sethi, Ullman: "Compilers : principles, techniques and tools," Reading
33(Mass.), Addison-Wesley, 1986).
34
35Version 4.1 of TP Lex and Yacc works with all recent flavours of Turbo/Borland
36Pascal, including Delphi, and with the Free Pascal Compiler, a free Turbo
37Pascal-compatible compiler which currently runs on DOS and Linux (other ports
38are under development). Recent information about TP Lex/Yacc, and the sources
39are available from the TPLY homepage:
40
41   http://www.musikwissenschaft.uni-mainz.de/~ag/tply
42
43For information about the Free Pascal Compiler, please refer to:
44
45   http://www.freepascal.org
46
47TP Lex and Yacc, like any other tools of this kind, are not intended for
48novices or casual programmers; they require extensive programming experience
49as well as a thorough understanding of the principles of parser design and
50implementation to be put to work successfully. But if you are a seasoned Turbo
51Pascal programmer with some background in compiler design and formal language
52theory, you will almost certainly find TP Lex and Yacc to be a powerful
53extension of your Turbo Pascal toolset.
54
55This manual tells you how to get started with the TP Lex and Yacc programs and
56provides a short description of these programs. Some knowledge about the C
57versions of Lex and Yacc will be useful, although not strictly necessary. For
58further reading, you may also refer to:
59
60- Aho, Sethi and Ullman: "Compilers : principles, techniques and tools."
61  Reading (Mass.), Addison-Wesley, 1986.
62
63- Johnson, S.C.: "Yacc - yet another compiler-compiler." CSTR-32, Bell
64  Telephone Laboratories, 1974.
65
66- Lesk, M.E.: "Lex - a lexical analyser generator." CSTR-39, Bell Telephone
67  Laboratories, 1975.
68
69- Schreiner, Friedman: "Introduction to compiler construction with UNIX."
70  Prentice-Hall, 1985.
71
72- The Unix Programmer's Manual, Sections `Lex' and `Yacc'.
73
74
75Credits
76-------
77
78I would like to thank Berend de Boer (berend@pobox.com), who adapted TP Lex
79and Yacc to take advantage of the large memory models in Borland Pascal 7.0
80and Delphi, and Michael Van Canneyt (Michael.VanCanneyt@fys.kuleuven.ac.be),
81the maintainer of the Linux version of the Free Pascal compiler, who is
82responsible for the Free Pascal port. And of course thanks are due to the many
83TP Lex/Yacc users all over the world for their support and comments which
84helped to improve these programs.
85
86
87Getting Started
88---------------
89
90Instructions on how to compile and install TP Lex and Yacc on all supported
91platforms can be found in the README file contained in the distribution.
92
93Once you have installed TP Lex and Yacc on your system, you can compile your
94first TP Lex and Yacc program expr. Expr is a simple desktop calculator
95program contained in the distribution, which consists of a lexical analyzer in
96the TP Lex source file exprlex.l and the parser and main program in the TP
97Yacc source file expr.y. To compile these programs, issue the commands
98
99   lex exprlex
100   yacc expr
101
102That's it! You now have the Turbo Pascal sources (exprlex.pas and expr.pas)
103for the expr program. Use the Turbo Pascal compiler to compile these programs
104as usual:
105
106   tpc expr
107
108(Of course, the precise compilation command depends on the type of compiler
109you are using. Thus you may have to replace tpc with bpc, dcc or dcc32,
110depending on the version of the Turbo/Borland/Delphi compiler you have, and
111with ppc386 for the Free Pascal compiler. If you are using TP Lex and Yacc
112with Free Pascal under Linux, the corresponding commands are:
113
114   plex exprlex
115   pyacc expr
116   ppc386 expr
117
118Note that in the Linux version, the programs are named plex and pyacc to
119avoid name clashes with the corresponding UNIX utilities.)
120
121Having compiled expr.pas, you can execute the expr program and type some
122expressions to see it work (terminate the program with an empty line). There
123is a number of other sample TP Lex and Yacc programs (.l and .y files) in the
124distribution, including a TP Yacc cross reference utility and a complete
125parser for Standard Pascal.
126
127The TP Lex and Yacc programs recognize some options which may be specified
128anywhere on the command line. E.g.,
129
130   lex -o exprlex
131
132runs TP Lex with "DFA optimization" and
133
134   yacc -v expr
135
136runs TP Yacc in "verbose" mode (TP Yacc generates a readable description of
137the generated parser).
138
139The TP Lex and Yacc programs use the following default filename extensions:
140- .l:   TP Lex input files
141- .y:   TP Yacc input files
142- .pas: TP Lex and Yacc output files
143
144As usual, you may overwrite default filename extensions by explicitly
145specifying suffixes.
146
147If you ever forget how to run TP Lex and Yacc, you can issue the command lex
148or yacc (resp. plex or pyacc) without arguments to get a short summary of the
149command line syntax.
150
151
152
153TP Lex
154======
155
156This section describes the TP Lex lexical analyzer generator.
157
158
159Usage
160-----
161
162lex [options] lex-file[.l] [output-file[.pas]]
163
164
165Options
166-------
167
168-v  "Verbose:" Lex generates a readable description of the generated
169    lexical analyzer, written to lex-file with new extension `.lst'.
170
171-o  "Optimize:" Lex optimizes DFA tables to produce a minimal DFA.
172
173
174Description
175-----------
176
177TP Lex is a program generator that is used to generate the Turbo Pascal source
178code for a lexical analyzer subroutine from the specification of an input
179language by a regular expression grammar.
180
181TP Lex parses the source grammar contained in lex-file (with default suffix
182.l) and writes the constructed lexical analyzer subroutine to the specified
183output-file (with default suffix .pas); if no output file is specified, output
184goes to lex-file with new suffix .pas. If any errors are found during
185compilation, error messages are written to the list file (lex-file with new
186suffix .lst).
187
188The generated output file contains a lexical analyzer routine, yylex,
189implemented as:
190
191  function yylex : Integer;
192
193This routine has to be called by your main program to execute the lexical
194analyzer. The return value of the yylex routine usually denotes the number
195of a token recognized by the lexical analyzer (see the return routine in the
196LexLib unit). At end-of-file the yylex routine normally returns 0.
197
198The code template for the yylex routine may be found in the yylex.cod
199file. This file is needed by TP Lex when it constructs the output file. It
200must be present either in the current directory or in the directory from which
201TP Lex was executed (TP Lex searches these directories in the indicated
202order). (NB: For the Linux/Free Pascal version, the code template is searched
203in some directory defined at compile-time instead of the execution path,
204usually /usr/lib/fpc/lexyacc.)
205
206The TP Lex library (LexLib) unit is required by programs using Lex-generated
207lexical analyzers; you will therefore have to put an appropriate uses clause
208into your program or unit that contains the lexical analyzer routine. The
209LexLib unit also provides various useful utility routines; see the file
210lexlib.pas for further information.
211
212
213Lex Source
214----------
215
216A TP Lex program consists of three sections separated with the %% delimiter:
217
218definitions
219%%
220rules
221%%
222auxiliary procedures
223
224All sections may be empty. The TP Lex language is line-oriented; definitions
225and rules are separated by line breaks. There is no special notation for
226comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
227fragments (see below).
228
229The definitions section may contain the following elements:
230
231- regular definitions in the format:
232
233     name   substitution
234
235  which serve to abbreviate common subexpressions. The {name} notation
236  causes the corresponding substitution from the definitions section to
237  be inserted into a regular expression. The name must be a legal
238  identifier (letter followed by a sequence of letters and digits;
239  the underscore counts as a letter; upper- and lowercase are distinct).
240  Regular definitions must be non-recursive.
241
242- start state definitions in the format:
243
244     %start name ...
245
246  which are used in specifying start conditions on rules (described
247  below). The %start keyword may also be abbreviated as %s or %S.
248
249- Turbo Pascal declarations enclosed between %{ and %}. These will be
250  inserted into the output file (at global scope). Also, any line that
251  does not look like a Lex definition (e.g., starts with blank or tab)
252  will be treated as Turbo Pascal code. (In particular, this also allows
253  you to include Turbo Pascal comments in your Lex program.)
254
255The rules section of a TP Lex program contains the actual specification of
256the lexical analyzer routine. It may be thought of as a big CASE statement
257discriminating over the different patterns to be matched and listing the
258corresponding statements (actions) to be executed. Each rule consists of a
259regular expression describing the strings to be matched in the input, and a
260corresponding action, a Turbo Pascal statement to be executed when the
261expression matches. Expression and statement are delimited with whitespace
262(blanks and/or tabs). Thus the format of a Lex grammar rule is:
263
264   expression      statement;
265
266Note that the action must be a single Turbo Pascal statement terminated
267with a semicolon (use begin ... end for compound statements). The statement
268may span multiple lines if the successor lines are indented with at least
269one blank or tab. The action may also be replaced by the | character,
270indicating that the action for this rule is the same as that for the next
271one.
272
273The TP Lex library unit provides various variables and routines which are
274useful in the programming of actions. In particular, the yytext string
275variable holds the text of the matched string, and the yyleng Byte variable
276its length.
277
278Regular expressions are used to describe the strings to be matched in a
279grammar rule. They are built from the usual constructs describing character
280classes and sequences, and operators specifying repetitions and alternatives.
281The precise format of regular expressions is described in the next section.
282
283The rules section may also start with some Turbo Pascal declarations
284(enclosed in %{ %}) which are treated as local declarations of the
285actions routine.
286
287Finally, the auxiliary procedures section may contain arbitrary Turbo
288Pascal code (such as supporting routines or a main program) which is
289simply tacked on to the end of the output file. The auxiliary procedures
290section is optional.
291
292
293Regular Expressions
294-------------------
295
296The following table summarizes the format of the regular expressions
297recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48).
298c stands for a single character, s for a string, r for a regular expression,
299and n,m for nonnegative integers.
300
301expression   matches                        example
302----------   ----------------------------   -------
303c            any non-operator character c   a
304\c           character c literally          \*
305"s"          string s literally             "**"
306.            any character but newline      a.*b
307^            beginning of line              ^abc
308$            end of line                    abc$
309[s]          any character in s             [abc]
310[^s]         any character not in s         [^abc]
311r*           zero or more r's               a*
312r+           one or more r's                a+
313r?           zero or one r                  a?
314r{m,n}       m to n occurrences of r        a{1,5}
315r{m}         m occurrences of r             a{5}
316r1r2         r1 then r2                     ab
317r1|r2        r1 or r2                       a|b
318(r)          r                              (a|b)
319r1/r2        r1 when followed by r2         a/b
320<x>r         r when in start condition x    <x>abc
321---------------------------------------------------
322
323The operators *, +, ? and {} have highest precedence, followed by
324concatenation. The | operator has lowest precedence. Parentheses ()
325may be used to group expressions and overwrite default precedences.
326The <> and / operators may only occur once in an expression.
327
328The usual C-like escapes are recognized:
329
330\n     denotes newline
331\r     denotes carriage return
332\t     denotes tab
333\b     denotes backspace
334\f     denotes form feed
335\NNN   denotes character no. NNN in octal base
336
337You can also use the \ character to quote characters which would otherwise
338be interpreted as operator symbols. In character classes, you may use
339the - character to denote ranges of characters. For instance, [a-z]
340denotes the class of all lowercase letters.
341
342The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
343which match more than one rule. In such a case, the lexical analyzer prefers
344the longest match and, if it still has the choice between different rules,
345it picks the first of these. If no rule matches, the lexical analyzer
346executes a default action which consists of copying the input character
347to the output unchanged. Thus, if the purpose of a lexical analyzer is
348to translate some parts of the input, and leave the rest unchanged, you
349only have to specify the patterns which have to be treated specially. If,
350however, the lexical analyzer has to absorb its whole input, you will have
351to provide rules that match everything. E.g., you might use the rules
352
353   .   |
354   \n  ;
355
356which match "any other character" (and ignore it).
357
358Sometimes certain patterns have to be analyzed differently depending on some
359amount of context in which the pattern appears. In such a case the / operator
360is useful. For instance, the expression a/b matches a, but only if followed
361by b. Note that the b does not belong to the match; rather, the lexical
362analyzer, when matching an a, will look ahead in the input to see whether
363it is followed by a b, before it declares that it has matched an a. Such
364lookahead may be arbitrarily complex (up to the size of the LexLib input
365buffer). E.g., the pattern a/.*b matches an a which is followed by a b
366somewhere on the same input line. TP Lex also has a means to specify left
367context which is described in the next section.
368
369
370Start Conditions
371----------------
372
373TP Lex provides some features which make it possible to handle left context.
374The ^ character at the beginning of a regular expression may be used to
375denote the beginning of the line. More distant left context can be described
376conveniently by using start conditions on rules.
377
378Any rule which is prefixed with the <> construct is only valid if the lexical
379analyzer is in the denoted start state. For instance, the expression <x>a
380can only be matched if the lexical analyzer is in start state x. You can have
381multiple start states in a rule; e.g., <x,y>a can be matched in start states
382x or y.
383
384Start states have to be declared in the definitions section by means of
385one or more start state definitions (see above). The lexical analyzer enters
386a start state through a call to the LexLib routine start. E.g., you may
387write:
388
389%start x y
390%%
391<x>a    start(y);
392<y>b    start(x);
393%%
394begin
395  start(x); if yylex=0 then ;
396end.
397
398Upon initialization, the lexical analyzer is put into state x. It then
399proceeds in state x until it matches an a which puts it into state y.
400In state y it may match a b which puts it into state x again, etc.
401
402Start conditions are useful when certain constructs have to be analyzed
403differently depending on some left context (such as a special character
404at the beginning of the line), and if multiple lexical analyzers have to
405work in concert. If a rule is not prefixed with a start condition, it is
406valid in all user-defined start states, as well as in the lexical analyzer's
407default start state.
408
409
410Lex Library
411-----------
412
413The TP Lex library (LexLib) unit provides various variables and routines
414which are used by Lex-generated lexical analyzers and application programs.
415It provides the input and output streams and other internal data structures
416used by the lexical analyzer routine, and supplies some variables and utility
417routines which may be used by actions and application programs. Refer to
418the file lexlib.pas for a closer description.
419
420You can also modify the Lex library unit (and/or the code template in the
421yylex.cod file) to customize TP Lex to your target applications. E.g.,
422you might wish to optimize the code of the lexical analyzer for some
423special application, make the analyzer read from/write to memory instead
424of files, etc.
425
426
427Implementation Restrictions
428---------------------------
429
430Internal table sizes and the main memory available limit the complexity of
431source grammars that TP Lex can handle. There is currently no possibility to
432change internal table sizes (apart from modifying the sources of TP Lex
433itself), but the maximum table sizes provided by TP Lex seem to be large
434enough to handle most realistic applications. The actual table sizes depend on
435the particular implementation (they are much larger than the defaults if TP
436Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
437Free Pascal), and are shown in the statistics printed by TP Lex when a
438compilation is finished. The units given there are "p" (positions, i.e. items
439in the position table used to construct the DFA), "s" (DFA states) and "t"
440(transitions of the generated DFA).
441
442As implemented, the generated DFA table is stored as a typed array constant
443which is inserted into the yylex.cod code template. The transitions in each
444state are stored in order. Of course it would have been more efficient to
445generate a big CASE statement instead, but I found that this may cause
446problems with the encoding of large DFA tables because Turbo Pascal has
447a quite rigid limit on the code size of individual procedures. I decided to
448use a scheme in which transitions on different symbols to the same state are
449merged into one single transition (specifying a character set and the
450corresponding next state). This keeps the number of transitions in each state
451quite small and still allows a fairly efficient access to the transition
452table.
453
454The TP Lex program has an option (-o) to optimize DFA tables. This causes a
455minimal DFA to be generated, using the algorithm described in Aho, Sethi,
456Ullman (1986). Although the absolute limit on the number of DFA states that TP
457Lex can handle is at least 300, TP Lex poses an additional restriction (100)
458on the number of states in the initial partition of the DFA optimization
459algorithm. Thus, you may get a fatal `integer set overflow' message when using
460the -o option even when TP Lex is able to generate an unoptimized DFA. In such
461cases you will just have to be content with the unoptimized DFA. (Hopefully,
462this will be fixed in a future version. Anyhow, using the merged transitions
463scheme described above, TP Lex usually constructs unoptimized DFA's which are
464not far from being optimal, and thus in most cases DFA optimization won't have
465a great impact on DFA table sizes.)
466
467
468Differences from UNIX Lex
469-------------------------
470
471Major differences between TP Lex and UNIX Lex are listed below.
472
473- TP Lex produces output code for Turbo Pascal, rather than for C.
474
475- Character tables (%T) are not supported; neither are any directives
476  to determine internal table sizes (%p, %n, etc.).
477
478- Library routines are named differently from the UNIX version (e.g.,
479  the `start' routine takes the place of the `BEGIN' macro of UNIX
480  Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had
481  to be implemented as procedures.
482
483- The TP Lex library unit starts counting line numbers at 0, incrementing
484  the count BEFORE a line is read (in contrast, UNIX Lex initializes
485  yylineno to 1 and increments it AFTER the line end has been read). This
486  is motivated by the way in which TP Lex maintains the current line,
487  and will not affect your programs unless you explicitly reset the
488  yylineno value (e.g., when opening a new input file). In such a case
489  you should set yylineno to 0 rather than 1.
490
491
492
493
494TP Yacc
495=======
496
497This section describes the TP Yacc compiler compiler.
498
499
500Usage
501-----
502
503yacc [options] yacc-file[.y] [output-file[.pas]]
504
505
506Options
507-------
508
509-v  "Verbose:" TP Yacc generates a readable description of the generated
510    parser, written to yacc-file with new extension .lst.
511
512-d  "Debug:" TP Yacc generates parser with debugging output.
513
514
515Description
516-----------
517
518TP Yacc is a program that lets you prepare parsers from the description
519of input languages by BNF-like grammars. You simply specify the grammar
520for your target language, augmented with the Turbo Pascal code necessary
521to process the syntactic constructs, and TP Yacc translates your grammar
522into the Turbo Pascal code for a corresponding parser subroutine named
523yyparse.
524
525TP Yacc parses the source grammar contained in yacc-file (with default
526suffix .y) and writes the constructed parser subroutine to the specified
527output-file (with default suffix .pas); if no output file is specified,
528output goes to yacc-file with new suffix .pas. If any errors are found
529during compilation, error messages are written to the list file (yacc-file
530with new suffix .lst).
531
532The generated parser routine, yyparse, is declared as:
533
534   function yyparse : Integer;
535
536This routine may be called by your main program to execute the parser.
537The return value of the yyparse routine denotes success or failure of
538the parser (possible return values: 0 = success, 1 = unrecoverable syntax
539error or parse stack overflow).
540
541Similar to TP Lex, the code template for the yyparse routine may be found in
542the yyparse.cod file. The rules for locating this file are analogous to those
543of TP Lex (see Section `TP Lex').
544
545The TP Yacc library (YaccLib) unit is required by programs using Yacc-
546generated parsers; you will therefore have to put an appropriate uses clause
547into your program or unit that contains the parser routine. The YaccLib unit
548also provides some routines which may be used to control the actions of the
549parser. See the file yacclib.pas for further information.
550
551
552Yacc Source
553-----------
554
555A TP Yacc program consists of three sections separated with the %% delimiter:
556
557definitions
558%%
559rules
560%%
561auxiliary procedures
562
563
564The TP Yacc language is free-format: whitespace (blanks, tabs and newlines)
565is ignored, except if it serves as a delimiter. Comments have the C-like
566format /* ... */. They are treated as whitespace. Grammar symbols are denoted
567by identifiers which have the usual form (letter, including underscore,
568followed by a sequence of letters and digits; upper- and lowercase is
569distinct). The TP Yacc language also has some keywords which always start
570with the % character. Literals are denoted by characters enclosed in single
571quotes. The usual C-like escapes are recognized:
572
573\n     denotes newline
574\r     denotes carriage return
575\t     denotes tab
576\b     denotes backspace
577\f     denotes form feed
578\NNN   denotes character no. NNN in octal base
579
580
581Definitions
582-----------
583
584The first section of a TP Yacc grammar serves to define the symbols used in
585the grammar. It may contain the following types of definitions:
586
587- start symbol definition: A definition of the form
588
589     %start symbol
590
591  declares the start nonterminal of the grammar (if this definition is
592  omitted, TP Yacc assumes the left-hand side nonterminal of the first
593  grammar rule as the start symbol of the grammar).
594
595- terminal definitions: Definitions of the form
596
597     %token symbol ...
598
599  are used to declare the terminal symbols ("tokens") of the target
600  language. Any identifier not introduced in a %token definition will
601  be treated as a nonterminal symbol.
602
603  As far as TP Yacc is concerned, tokens are atomic symbols which do not
604  have an innert structure. A lexical analyzer must be provided which
605  takes on the task of tokenizing the input stream and return the
606  individual tokens and literals to the parser (see Section `Lexical
607  Analysis').
608
609- precedence definitions: Operator symbols (terminals) may be associated
610  with a precedence by means of a precedence definition which may have
611  one of the following forms
612
613     %left symbol ...
614     %right symbol ...
615     %nonassoc symbol ...
616
617  which are used to declare left-, right- and nonassociative operators,
618  respectively. Each precedence definition introduces a new precedence
619  level, lowest precedence first. E.g., you may write:
620
621     %nonassoc '<' '>' '=' GEQ LEQ NEQ  /* relational operators */
622     %left     '+' '-'  OR              /* addition operators */
623     %left     '*' '/' AND              /* multiplication operators */
624     %right    NOT UMINUS               /* unary operators */
625
626  A terminal identifier introduced in a precedence definition may, but
627  need not, appear in a %token definition as well.
628
629- type definitions: Any (terminal or nonterminal) grammar symbol may be
630  associated with a type identifier which is used in the processing of
631  semantic values. Type tags of the form <name> may be used in token and
632  precedence definitions to declare the type of a terminal symbol, e.g.:
633
634     %token <Real>  NUM
635     %left  <AddOp> '+' '-'
636
637  To declare the type of a nonterminal symbol, use a type definition of
638  the form:
639
640     %type <name> symbol ...
641
642  e.g.:
643
644     %type <Real> expr
645
646  In a %type definition, you may also omit the nonterminals, i.e. you
647  may write:
648
649     %type <name>
650
651  This is useful when a given type is only used with type casts (see
652  Section `Grammar Rules and Actions'), and is not associated with a
653  specific nonterminal.
654
655- Turbo Pascal declarations: You may also include arbitrary Turbo Pascal
656  code in the definitions section, enclosed in %{ %}. This code will be
657  inserted as global declarations into the output file, unchanged.
658
659
660Grammar Rules and Actions
661-------------------------
662
663The second part of a TP Yacc grammar contains the grammar rules for the
664target language. Grammar rules have the format
665
666   name : symbol ... ;
667
668The left-hand side of a rule must be an identifier (which denotes a
669nonterminal symbol). The right-hand side may be an arbitrary (possibly
670empty) sequence of nonterminal and terminal symbols (including literals
671enclosed in single quotes). The terminating semicolon may also be omitted.
672Different rules for the same left-hand side symbols may be written using
673the | character to separate the different alternatives:
674
675   name : symbol ...
676        | symbol ...
677        ...
678        ;
679
680For instance, to specify a simple grammar for arithmetic expressions, you
681may write:
682
683%left '+' '-'
684%left '*' '/'
685%token NUM
686%%
687expr : expr '+' expr
688     | expr '-' expr
689     | expr '*' expr
690     | expr '/' expr
691     | '(' expr ')'
692     | NUM
693     ;
694
695(The %left definitions at the beginning of the grammar are needed to specify
696the precedence and associativity of the operator symbols. This will be
697discussed in more detail in Section `Ambigious Grammars'.)
698
699Grammar rules may contain actions - Turbo Pascal statements enclosed in
700{ } - to be executed as the corresponding rules are recognized. Furthermore,
701rules may return values, and access values returned by other rules. These
702"semantic" values are written as $$ (value of the left-hand side nonterminal)
703and $i (value of the ith right-hand side symbol). They are kept on a special
704value stack which is maintained automatically by the parser.
705
706Values associated with terminal symbols must be set by the lexical analyzer
707(more about this in Section `Lexical Analysis'). Actions of the form $$ := $1
708can frequently be omitted, since it is the default action assumed by TP Yacc
709for any rule that does not have an explicit action.
710
711By default, the semantic value type provided by Yacc is Integer. You can
712also put a declaration like
713
714   %{
715   type YYSType = Real;
716   %}
717
718into the definitions section of your Yacc grammar to change the default value
719type. However, if you have different value types, the preferred method is to
720use type definitions as discussed in Section `Definitions'. When such type
721definitions are given, TP Yacc handles all the necessary details of the
722YYSType definition and also provides a fair amount of type checking which
723makes it easier to find type errors in the grammar.
724
725For instance, we may declare the symbols NUM and expr in the example above
726to be of type Real, and then use these values to evaluate an expression as
727it is parsed.
728
729%left '+' '-'
730%left '*' '/'
731%token <Real> NUM
732%type  <Real> expr
733%%
734expr : expr '+' expr   { $$ := $1+$3; }
735     | expr '-' expr   { $$ := $1-$3; }
736     | expr '*' expr   { $$ := $1*$3; }
737     | expr '/' expr   { $$ := $1/$3; }
738     | '(' expr ')'    { $$ := $2;    }
739     | NUM
740     ;
741
742(Note that we omitted the action of the last rule. The "copy action"
743$$ := $1 required by this rule is automatically added by TP Yacc.)
744
745Actions may not only appear at the end, but also in the middle of a rule
746which is useful to perform some processing before a rule is fully parsed.
747Such actions inside a rule are treated as special nonterminals which are
748associated with an empty right-hand side. Thus, a rule like
749
750   x : y { action; } z
751
752will be treated as:
753
754  x : y $act z
755  $act : { action; }
756
757Actions inside a rule may also access values to the left of the action,
758and may return values by assigning to the $$ value. The value returned
759by such an action can then be accessed by other actions using the usual $i
760notation. E.g., we may write:
761
762   x : y { $$ := 2*$1; } z { $$ := $2+$3; }
763
764which has the effect of setting the value of x to
765
766   2*(the value of y)+(the value of z).
767
768Sometimes it is desirable to access values in enclosing rules. This can be
769done using the notation $i with i<=0. $0 refers to the first value "to the
770left" of the current rule, $-1 to the second, and so on. Note that in this
771case the referenced value depends on the actual contents of the parse stack,
772so you have to make sure that the requested values are always where you
773expect them.
774
775There are some situations in which TP Yacc cannot easily determine the
776type of values (when a typed parser is used). This is true, in particular,
777for values in enclosing rules and for the $$ value in an action inside a
778rule. In such cases you may use a type cast to explicitly specify the type
779of a value. The format for such type casts is $<name>$ (for left-hand side
780values) and $<name>i (for right-hand side values) where name is a type
781identifier (which must occur in a %token, precedence or %type definition).
782
783
784Auxiliary Procedures
785--------------------
786
787The third section of a TP Yacc program is optional. If it is present, it
788may contain any Turbo Pascal code (such as supporting routines or a main
789program) which is tacked on to the end of the output file.
790
791
792Lexical Analysis
793----------------
794
795For any TP Yacc-generated parser, the programmer must supply a lexical
796analyzer routine named yylex which performs the lexical analysis for
797the parser. This routine must be declared as
798
799   function yylex : Integer;
800
801The yylex routine may either be prepared by hand, or by using the lexical
802analyzer generator TP Lex (see Section `TP Lex').
803
804The lexical analyzer must be included in your main program behind the
805parser subroutine (the yyparse code template includes a forward
806definition of the yylex routine such that the parser can access the
807lexical analyzer). For instance, you may put the lexical analyzer
808routine into the auxiliary procedures section of your TP Yacc grammar,
809either directly, or by using the the Turbo Pascal include directive
810($I).
811
812The parser repeatedly calls the yylex routine to tokenize the input
813stream and obtain the individual lexical items in the input. For any
814literal character, the yylex routine has to return the corresponding
815character code. For the other, symbolic, terminals of the input language,
816the lexical analyzer must return corresponding Integer codes. These are
817assigned automatically by TP Yacc in the order in which token definitions
818appear in the definitions section of the source grammar. The lexical
819analyzer can access these values through corresponding Integer constants
820which are declared by TP Yacc in the output file.
821
822For instance, if
823
824   %token NUM
825
826is the first definition in the Yacc grammar, then TP Yacc will create
827a corresponding constant declaration
828
829   const NUM = 257;
830
831in the output file (TP Yacc automatically assigns symbolic token numbers
832starting at 257; 1 thru 255 are reserved for character literals, 0 denotes
833end-of-file, and 256 is reserved for the special error token which will be
834discussed in Section `Error Handling'). This definition may then be used,
835e.g., in a corresponding TP Lex program as follows:
836
837   [0-9]+   return(NUM);
838
839You can also explicitly assign token numbers in the grammar. For this
840purpose, the first occurrence of a token identifier in the definitions
841section may be followed by an unsigned integer. E.g. you may write:
842
843   %token NUM 299
844
845Besides the return value of yylex, the lexical analyzer routine may also
846return an additional semantic value for the recognized token. This value
847is assigned to a variable named "yylval" and may then be accessed in actions
848through the $i notation (see above, Section `Grammar Rules and Actions').
849The yylval variable is of type YYSType (the semantic value type, Integer
850by default); its declaration may be found in the yyparse.cod file.
851
852For instance, to assign an Integer value to a NUM token in the above
853example, we may write:
854
855   [0-9]+   begin
856              val(yytext, yylval, code);
857              return(NUM);
858            end;
859
860This assigns yylval the value of the NUM token (using the Turbo Pascal
861standard procedure val).
862
863If a parser uses tokens of different types (via a %token <name> definition),
864then the yylval variable will not be of type Integer, but instead of a
865corresponding variant record type which is capable of holding all the
866different value types declared in the TP Yacc grammar. In this case, the
867lexical analyzer must assign a semantic value to the corresponding record
868component which is named yy<name> (where <name> stands for the corresponding
869type identifier).
870
871E.g., if token NUM is declared Real:
872
873   %token <Real> NUM
874
875then the value for token NUM must be assigned to yylval.yyReal.
876
877
878How The Parser Works
879--------------------
880
881TP Yacc uses the LALR(1) technique developed by Donald E. Knuth and F.
882DeRemer to construct a simple, efficient, non-backtracking bottom-up
883parser for the source grammar. The LALR parsing technique is described
884in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a
885look at the parser description TP Yacc generates from a small sample
886grammar, to get an idea of how the LALR parsing algorithm works. We
887consider the following simplified version of the arithmetic expression
888grammar:
889
890%token NUM
891%left '+'
892%left '*'
893%%
894expr : expr '+' expr
895     | expr '*' expr
896     | '(' expr ')'
897     | NUM
898     ;
899
900When run with the -v option on the above grammar, TP Yacc generates the
901parser description listed below.
902
903state 0:
904
905	$accept : _ expr $end
906
907	'('	shift 2
908	NUM	shift 3
909	.	error
910
911	expr	goto 1
912
913state 1:
914
915	$accept : expr _ $end
916	expr : expr _ '+' expr
917	expr : expr _ '*' expr
918
919	$end	accept
920	'*'	shift 4
921	'+'	shift 5
922	.	error
923
924state 2:
925
926	expr : '(' _ expr ')'
927
928	'('	shift 2
929	NUM	shift 3
930	.	error
931
932	expr	goto 6
933
934state 3:
935
936	expr : NUM _	(4)
937
938	.	reduce 4
939
940state 4:
941
942	expr : expr '*' _ expr
943
944	'('	shift 2
945	NUM	shift 3
946	.	error
947
948	expr	goto 7
949
950state 5:
951
952	expr : expr '+' _ expr
953
954	'('	shift 2
955	NUM	shift 3
956	.	error
957
958	expr	goto 8
959
960state 6:
961
962	expr : '(' expr _ ')'
963	expr : expr _ '+' expr
964	expr : expr _ '*' expr
965
966	')'	shift 9
967	'*'	shift 4
968	'+'	shift 5
969	.	error
970
971state 7:
972
973	expr : expr '*' expr _	(2)
974	expr : expr _ '+' expr
975	expr : expr _ '*' expr
976
977	.	reduce 2
978
979state 8:
980
981	expr : expr '+' expr _	(1)
982	expr : expr _ '+' expr
983	expr : expr _ '*' expr
984
985	'*'	shift 4
986	$end	reduce 1
987	')'	reduce 1
988	'+'	reduce 1
989	.	error
990
991state 9:
992
993	expr : '(' expr ')' _	(3)
994
995	.	reduce 3
996
997
998Each state of the parser corresponds to a certain prefix of the input
999which has already been seen. The parser description lists the grammar
1000rules wich are parsed in each state, and indicates the portion of each
1001rule which has already been parsed by an underscore. In state 0, the
1002start state of the parser, the parsed rule is
1003
1004	$accept : expr $end
1005
1006This is not an actual grammar rule, but a starting rule automatically
1007added by TP Yacc. In general, it has the format
1008
1009	$accept : X $end
1010
1011where X is the start nonterminal of the grammar, and $end is a pseudo
1012token denoting end-of-input (the $end symbol is used by the parser to
1013determine when it has successfully parsed the input).
1014
1015The description of the start rule in state 0,
1016
1017	$accept : _ expr $end
1018
1019with the underscore positioned before the expr symbol, indicates that
1020we are at the beginning of the parse and are ready to parse an expression
1021(nonterminal expr).
1022
1023The parser maintains a stack to keep track of states visited during the
1024parse. There are two basic kinds of actions in each state: "shift", which
1025reads an input symbol and pushes the corresponding next state on top of
1026the stack, and "reduce" which pops a number of states from the stack
1027(corresponding to the number of right-hand side symbols of the rule used
1028in the reduction) and consults the "goto" entries of the uncovered state
1029to find the transition corresponding to the left-hand side symbol of the
1030reduced rule.
1031
1032In each step of the parse, the parser is in a given state (the state on
1033top of its stack) and may consult the current "lookahead symbol", the
1034next symbol in the input, to determine the parse action - shift or reduce -
1035to perform. The parser terminates as soon as it reaches state 1 and reads
1036in the endmarker, indicated by the "accept" action on $end in state 1.
1037
1038Sometimes the parser may also carry out an action without inspecting the
1039current lookahead token. This is the case, e.g., in state 3 where the
1040only action is reduction by rule 4:
1041
1042	.	reduce 4
1043
1044The default action in a state can also be "error" indicating that any
1045other input represents a syntax error. (In case of such an error the
1046parser will start syntactic error recovery, as described in Section
1047`Error Handling'.)
1048
1049Now let us see how the parser responds to a given input. We consider the
1050input string 2+5*3 which is presented to the parser as the token sequence:
1051
1052   NUM + NUM * NUM
1053
1054The following table traces the corresponding actions of the parser. We also
1055show the current state in each move, and the remaining states on the stack.
1056
1057State  Stack         Lookahead  Action
1058-----  ------------  ---------  --------------------------------------------
1059
10600                    NUM        shift state 3
1061
10623      0                        reduce rule 4 (pop 1 state, uncovering state
1063                                0, then goto state 1 on symbol expr)
1064
10651      0             +          shift state 5
1066
10675      1 0           NUM        shift state 3
1068
10693      5 1 0                    reduce rule 4 (pop 1 state, uncovering state
1070                                5, then goto state 8 on symbol expr)
1071
10728      5 1 0         *          shift 4
1073
10744      8 5 1 0       NUM        shift 3
1075
10763      4 8 5 1 0                reduce rule 4 (pop 1 state, uncovering state
1077                                4, then goto state 7 on symbol expr)
1078
10797      4 8 5 1 0                reduce rule 2 (pop 3 states, uncovering state
1080                                5, then goto state 8 on symbol expr)
1081
10828      5 1 0         $end       reduce rule 1 (pop 3 states, uncovering state
1083                                0, then goto state 1 on symbol expr)
1084
10851      0             $end       accept
1086
1087It is also instructive to see how the parser responds to illegal inputs.
1088E.g., you may try to figure out what the parser does when confronted with:
1089
1090   NUM + )
1091
1092or:
1093
1094   ( NUM * NUM
1095
1096You will find that the parser, sooner or later, will always run into an
1097error action when confronted with errorneous inputs. An LALR parser will
1098never shift an invalid symbol and thus will always find syntax errors as
1099soon as it is possible during a left-to-right scan of the input.
1100
1101TP Yacc provides a debugging option (-d) that may be used to trace the
1102actions performed by the parser. When a grammar is compiled with the
1103-d option, the generated parser will print out the actions as it parses
1104its input.
1105
1106
1107Ambigious Grammars
1108------------------
1109
1110There are situations in which TP Yacc will not produce a valid parser for
1111a given input language. LALR(1) parsers are restricted to one-symbol
1112lookahead on which they have to base their parsing decisions. If a
1113grammar is ambigious, or cannot be parsed unambigiously using one-symbol
1114lookahead, TP Yacc will generate parsing conflicts when constructing the
1115parse table. There are two types of such conflicts: shift/reduce conflicts
1116(when there is both a shift and a reduce action for a given input symbol
1117in a given state), and reduce/reduce conflicts (if there is more than
1118one reduce action for a given input symbol in a given state). Note that
1119there never will be a shift/shift conflict.
1120
1121When a grammar generates parsing conflicts, TP Yacc prints out the number
1122of shift/reduce and reduce/reduce conflicts it encountered when constructing
1123the parse table. However, TP Yacc will still generate the output code for the
1124parser. To resolve parsing conflicts, TP Yacc uses the following built-in
1125disambiguating rules:
1126
1127- in a shift/reduce conflict, TP Yacc chooses the shift action.
1128
1129- in a reduce/reduce conflict, TP Yacc chooses reduction of the first
1130  grammar rule.
1131
1132The shift/reduce disambiguating rule correctly resolves a type of
1133ambiguity known as the "dangling-else ambiguity" which arises in the
1134syntax of conditional statements of many programming languages (as in
1135Pascal):
1136
1137%token IF THEN ELSE
1138%%
1139stmt : IF expr THEN stmt
1140     | IF expr THEN stmt ELSE stmt
1141     ;
1142
1143This grammar is ambigious, because a nested construct like
1144
1145   IF expr-1 THEN IF expr-2 THEN stmt-1 ELSE stmt-2
1146
1147can be parsed two ways, either as:
1148
1149   IF expr-1 THEN ( IF expr-2 THEN stmt-1 ELSE stmt-2 )
1150
1151or as:
1152
1153   IF expr-1 THEN ( IF expr-2 THEN stmt-1 ) ELSE stmt-2
1154
1155The first interpretation makes an ELSE belong to the last unmatched
1156IF which also is the interpretation chosen in most programming languages.
1157This is also the way that a TP Yacc-generated parser will parse the construct
1158since the shift/reduce disambiguating rule has the effect of neglecting the
1159reduction of IF expr-2 THEN stmt-1; instead, the parser will shift the ELSE
1160symbol which eventually leads to the reduction of IF expr-2 THEN stmt-1 ELSE
1161stmt-2.
1162
1163The reduce/reduce disambiguating rule is used to resolve conflicts that
1164arise when there is more than one grammar rule matching a given construct.
1165Such ambiguities are often caused by "special case constructs" which may be
1166given priority by simply listing the more specific rules ahead of the more
1167general ones.
1168
1169For instance, the following is an excerpt from the grammar describing the
1170input language of the UNIX equation formatter EQN:
1171
1172%right SUB SUP
1173%%
1174expr : expr SUB expr SUP expr
1175     | expr SUB expr
1176     | expr SUP expr
1177     ;
1178
1179Here, the SUB and SUP operator symbols denote sub- and superscript,
1180respectively. The rationale behind this example is that an expression
1181involving both sub- and superscript is often set differently from a
1182superscripted subscripted expression. This special case is therefore
1183caught by the first rule in the above example which causes a reduce/reduce
1184conflict with rule 3 in expressions like expr-1 SUB expr-2 SUP expr-3.
1185The conflict is resolved in favour of the first rule.
1186
1187In both cases discussed above, the ambiguities could also be eliminated
1188by rewriting the grammar accordingly (although this yields more complicated
1189and less readable grammars). This may not always be the case. Often
1190ambiguities are also caused by design errors in the grammar. Hence, if
1191TP Yacc reports any parsing conflicts when constructing the parser, you
1192should use the -v option to generate the parser description (.lst file)
1193and check whether TP Yacc resolved the conflicts correctly.
1194
1195There is one type of syntactic constructs for which one often deliberately
1196uses an ambigious grammar as a more concise representation for a language
1197that could also be specified unambigiously: the syntax of expressions.
1198For instance, the following is an unambigious grammar for simple arithmetic
1199expressions:
1200
1201%token NUM
1202
1203%%
1204
1205expr	: term
1206	| expr '+' term
1207        ;
1208
1209term	: factor
1210	| term '*' factor
1211        ;
1212
1213factor	: '(' expr ')'
1214	| NUM
1215        ;
1216
1217You may check yourself that this grammar gives * a higher precedence than
1218+ and makes both operators left-associative. The same effect can be achieved
1219with the following ambigious grammar using precedence definitions:
1220
1221%token NUM
1222%left '+'
1223%left '*'
1224%%
1225expr : expr '+' expr
1226     | expr '*' expr
1227     | '(' expr ')'
1228     | NUM
1229     ;
1230
1231Without the precedence definitions, this is an ambigious grammar causing
1232a number of shift/reduce conflicts. The precedence definitions are used
1233to correctly resolve these conflicts (conflicts resolved using precedence
1234will not be reported by TP Yacc).
1235
1236Each precedence definition introduces a new precedence level (lowest
1237precedence first) and specifies whether the corresponding operators
1238should be left-, right- or nonassociative (nonassociative operators
1239cannot be combined at all; example: relational operators in Pascal).
1240
1241TP Yacc uses precedence information to resolve shift/reduce conflicts as
1242follows. Precedences are associated with each terminal occuring in a
1243precedence definition. Furthermore, each grammar rule is given the
1244precedence of its rightmost terminal (this default choice can be
1245overwritten using a %prec tag; see below). To resolve a shift/reduce
1246conflict using precedence, both the symbol and the rule involved must
1247have been assigned precedences. TP Yacc then chooses the parse action
1248as follows:
1249
1250- If the symbol has higher precedence than the rule: shift.
1251
1252- If the rule has higher precedence than the symbol: reduce.
1253
1254- If symbol and rule have the same precedence, the associativity of the
1255  symbol determines the parse action: if the symbol is left-associative:
1256  reduce; if the symbol is right-associative: shift; if the symbol is
1257  non-associative: error.
1258
1259To give you an idea of how this works, let us consider our ambigious
1260arithmetic expression grammar (without precedences):
1261
1262%token NUM
1263%%
1264expr : expr '+' expr
1265     | expr '*' expr
1266     | '(' expr ')'
1267     | NUM
1268     ;
1269
1270This grammar generates four shift/reduce conflicts. The description
1271of state 8 reads as follows:
1272
1273state 8:
1274
1275	*** conflicts:
1276
1277	shift 4, reduce 1 on '*'
1278	shift 5, reduce 1 on '+'
1279
1280	expr : expr '+' expr _	(1)
1281	expr : expr _ '+' expr
1282	expr : expr _ '*' expr
1283
1284	'*'	shift 4
1285	'+'	shift 5
1286	$end	reduce 1
1287	')'	reduce 1
1288	.	error
1289
1290In this state, we have successfully parsed a + expression (rule 1). When
1291the next symbol is + or *, we have the choice between the reduction and
1292shifting the symbol. Using the default shift/reduce disambiguating rule,
1293TP Yacc has resolved these conflicts in favour of shift.
1294
1295Now let us assume the above precedence definition:
1296
1297   %left '+'
1298   %left '*'
1299
1300which gives * higher precedence than + and makes both operators left-
1301associative. The rightmost terminal in rule 1 is +. Hence, given these
1302precedence definitions, the first conflict will be resolved in favour
1303of shift (* has higher precedence than +), while the second one is resolved
1304in favour of reduce (+ is left-associative).
1305
1306Similar conflicts arise in state 7:
1307
1308state 7:
1309
1310	*** conflicts:
1311
1312	shift 4, reduce 2 on '*'
1313	shift 5, reduce 2 on '+'
1314
1315	expr : expr '*' expr _	(2)
1316	expr : expr _ '+' expr
1317	expr : expr _ '*' expr
1318
1319	'*'	shift 4
1320	'+'	shift 5
1321	$end	reduce 2
1322	')'	reduce 2
1323	.	error
1324
1325Here, we have successfully parsed a * expression which may be followed
1326by another + or * operator. Since * is left-associative and has higher
1327precedence than +, both conflicts will be resolved in favour of reduce.
1328
1329Of course, you can also have different operators on the same precedence
1330level. For instance, consider the following extended version of the
1331arithmetic expression grammar:
1332
1333%token NUM
1334%left '+' '-'
1335%left '*' '/'
1336%%
1337expr	: expr '+' expr
1338	| expr '-' expr
1339        | expr '*' expr
1340        | expr '/' expr
1341        | '(' expr ')'
1342        | NUM
1343        ;
1344
1345This puts all "addition" operators on the first and all "multiplication"
1346operators on the second precedence level. All operators are left-associative;
1347for instance, 5+3-2 will be parsed as (5+3)-2.
1348
1349By default, TP Yacc assigns each rule the precedence of its rightmost
1350terminal. This is a sensible decision in most cases. Occasionally, it
1351may be necessary to overwrite this default choice and explicitly assign
1352a precedence to a rule. This can be done by putting a precedence tag
1353of the form
1354
1355   %prec symbol
1356
1357at the end of the corresponding rule which gives the rule the precedence
1358of the specified symbol. For instance, to extend the expression grammar
1359with a unary minus operator, giving it highest precedence, you may write:
1360
1361%token NUM
1362%left '+' '-'
1363%left '*' '/'
1364%right UMINUS
1365%%
1366expr	: expr '+' expr
1367	| expr '-' expr
1368        | expr '*' expr
1369        | expr '/' expr
1370        | '-' expr      %prec UMINUS
1371        | '(' expr ')'
1372        | NUM
1373        ;
1374
1375Note the use of the UMINUS token which is not an actual input symbol but
1376whose sole purpose it is to give unary minus its proper precedence. If
1377we omitted the precedence tag, both unary and binary minus would have the
1378same precedence because they are represented by the same input symbol.
1379
1380
1381Error Handling
1382--------------
1383
1384Syntactic error handling is a difficult area in the design of user-friendly
1385parsers. Usually, you will not like to have the parser give up upon the
1386first occurrence of an errorneous input symbol. Instead, the parser should
1387recover from a syntax error, that is, it should try to find a place in the
1388input where it can resume the parse.
1389
1390TP Yacc provides a general mechanism to implement parsers with error
1391recovery. A special predefined "error" token may be used in grammar rules
1392to indicate positions where syntax errors might occur. When the parser runs
1393into an error action (i.e., reads an errorneous input symbol) it prints out
1394an error message and starts error recovery by popping its stack until it
1395uncovers a state in which there is a shift action on the error token. If
1396there is no such state, the parser terminates with return value 1, indicating
1397an unrecoverable syntax error. If there is such a state, the parser takes the
1398shift on the error token (pretending it has seen an imaginary error token in
1399the input), and resumes parsing in a special "error mode."
1400
1401While in error mode, the parser quietly skips symbols until it can again
1402perform a legal shift action. To prevent a cascade of error messages, the
1403parser returns to its normal mode of operation only after it has seen
1404and shifted three legal input symbols. Any additional error found after
1405the first shifted symbol restarts error recovery, but no error message
1406is printed. The TP Yacc library routine yyerrok may be used to reset the
1407parser to its normal mode of operation explicitly.
1408
1409For a simple example, consider the rule
1410
1411stmt	: error ';' { yyerrok; }
1412
1413and assume a syntax error occurs while a statement (nonterminal stmt) is
1414parsed. The parser prints an error message, then pops its stack until it
1415can shift the token error of the error rule. Proceeding in error mode, it
1416will skip symbols until it finds a semicolon, then reduces by the error
1417rule. The call to yyerrok tells the parser that we have recovered from
1418the error and that it should proceed with the normal parse. This kind of
1419"panic mode" error recovery scheme works well when statements are always
1420terminated with a semicolon. The parser simply skips the "bad" statement
1421and then resumes the parse.
1422
1423Implementing a good error recovery scheme can be a difficult task; see
1424Aho/Sethi/Ullman (1986) for a more comprehensive treatment of this topic.
1425Schreiner and Friedman have developed a systematic technique to implement
1426error recovery with Yacc which I found quite useful (I used it myself
1427to implement error recovery in the TP Yacc parser); see Schreiner/Friedman
1428(1985).
1429
1430
1431Yacc Library
1432------------
1433
1434The TP Yacc library (YaccLib) unit provides some global declarations used
1435by the parser routine yyparse, and some variables and utility routines
1436which may be used to control the actions of the parser and to implement
1437error recovery. See the file yacclib.pas for a description of these
1438variables and routines.
1439
1440You can also modify the Yacc library unit (and/or the code template in the
1441yyparse.cod file) to customize TP Yacc to your target applications.
1442
1443
1444Other Features
1445--------------
1446
1447TP Yacc supports all additional language elements entitled as "Old Features
1448Supported But not Encouraged" in the UNIX manual, which are provided for
1449backward compatibility with older versions of (UNIX) Yacc:
1450
1451- literals delimited by double quotes.
1452
1453- multiple-character literals. Note that these are not treated as character
1454  sequences but represent single tokens which are given a symbolic integer
1455  code just like any other token identifier. However, they will not be
1456  declared in the output file, so you have to make sure yourself that
1457  the lexical analyzer returns the correct codes for these symbols. E.g.,
1458  you might explicitly assign token numbers by using a definition like
1459
1460     %token ':=' 257
1461
1462  at the beginning of the Yacc grammar.
1463
1464- \ may be used instead of %, i.e. \\ means %%, \left is the same as %left,
1465  etc.
1466
1467- other synonyms:
1468  %<             for %left
1469  %>             for %right
1470  %binary or %2  for %nonassoc
1471  %term or %0    for %token
1472  %=             for %prec
1473
1474- actions may also be written as = { ... } or = single-statement;
1475
1476- Turbo Pascal declarations (%{ ... %}) may be put at the beginning of the
1477  rules section. They will be treated as local declarations of the actions
1478  routine.
1479
1480
1481Implementation Restrictions
1482---------------------------
1483
1484As with TP Lex, internal table sizes and the main memory available limit the
1485complexity of source grammars that TP Yacc can handle. However, the maximum
1486table sizes provided by TP Yacc are large enough to handle quite complex
1487grammars (such as the Pascal grammar in the TP Yacc distribution). The actual
1488table sizes are shown in the statistics printed by TP Yacc when a compilation
1489is finished. The given figures are "s" (states), "i" (LR0 kernel items), "t"
1490(shift and goto transitions) and "r" (reductions).
1491
1492The default stack size of the generated parsers is yymaxdepth = 1024, as
1493declared in the TP Yacc library unit. This should be sufficient for any
1494average application, but you can change the stack size by including a
1495corresponding declaration in the definitions part of the Yacc grammar
1496(or change the value in the YaccLib unit). Note that right-recursive
1497grammar rules may increase stack space requirements, so it is a good
1498idea to use left-recursive rules wherever possible.
1499
1500
1501Differences from UNIX Yacc
1502--------------------------
1503
1504Major differences between TP Yacc and UNIX Yacc are listed below.
1505
1506- TP Yacc produces output code for Turbo Pascal, rather than for C.
1507
1508- TP Yacc does not support %union definitions. Instead, a value type is
1509  declared by specifying the type identifier itself as the tag of a %token
1510  or %type definition. TP Yacc will automatically generate an appropriate
1511  variant record type (YYSType) which is capable of holding values of any
1512  of the types used in %token and %type.
1513
1514  Type checking is very strict. If you use type definitions, then
1515  any symbol referred to in an action must have a type introduced
1516  in a type definition. Either the symbol must have been assigned a
1517  type in the definitions section, or the $<type-identifier> notation
1518  must be used. The syntax of the %type definition has been changed
1519  slightly to allow definitions of the form
1520     %type <type-identifier>
1521  (omitting the nonterminals) which may be used to declare types which
1522  are not assigned to any grammar symbol, but are used with the
1523  $<...> construct.
1524
1525- The parse tables constructed by this Yacc version are slightly greater
1526  than those constructed by UNIX Yacc, since a reduce action will only be
1527  chosen as the default action if it is the only action in the state.
1528  In difference, UNIX Yacc chooses a reduce action as the default action
1529  whenever it is the only reduce action of the state (even if there are
1530  other shift actions).
1531
1532  This solves a bug in UNIX Yacc that makes the generated parser start
1533  error recovery too late with certain types of error productions (see
1534  also Schreiner/Friedman, "Introduction to compiler construction with
1535  UNIX," 1985). Also, errors will be caught sooner in most cases where
1536  UNIX Yacc would carry out an additional (default) reduction before
1537  detecting the error.
1538
1539- Library routines are named differently from the UNIX version (e.g.,
1540  the `yyerrlab' routine takes the place of the `YYERROR' macro of UNIX
1541  Yacc), and, of course, all macros of UNIX Yacc (YYERROR, YYACCEPT, etc.)
1542  had to be implemented as procedures.
1543
1544
1545