utils/tply/tply.doc

      TP Lex and Yacc - The Compiler Writer's Tools for Turbo Pascal
      == === === ==== = === ======== ======== ===== === ===== ======

                     Version 4.1 User Manual
                     ======= === ==== ======

                         Albert Graef
                 Department of Musicinformatics
               Johannes Gutenberg-University Mainz

               ag@muwiinfa.geschichte.uni-mainz.de

                          April 1998


Introduction
============

This document describes the TP Lex and Yacc compiler generator toolset. These
tools are designed especially to help you prepare compilers and similar
programs like text processing utilities and command language interpreters with
the Turbo Pascal (TM) programming language.

TP Lex and Yacc are Turbo Pascal adaptions of the well-known UNIX (TM)
utilities Lex and Yacc, which were written by M.E. Lesk and S.C. Johnson at
Bell Laboratories, and are used with the C programming language. TP Lex and
Yacc are intended to be approximately "compatible" with these programs.
However, they are an independent development of the author, based on the
techniques described in the famous "dragon book" of Aho, Sethi and Ullman
(Aho, Sethi, Ullman: "Compilers : principles, techniques and tools," Reading
(Mass.), Addison-Wesley, 1986).

Version 4.1 of TP Lex and Yacc works with all recent flavours of Turbo/Borland
Pascal, including Delphi, and with the Free Pascal Compiler, a free Turbo
Pascal-compatible compiler which currently runs on DOS and Linux (other ports
are under development). Recent information about TP Lex/Yacc, and the sources
are available from the TPLY homepage:

   http://www.musikwissenschaft.uni-mainz.de/~ag/tply

For information about the Free Pascal Compiler, please refer to:

   http://www.freepascal.org

TP Lex and Yacc, like any other tools of this kind, are not intended for
novices or casual programmers; they require extensive programming experience
as well as a thorough understanding of the principles of parser design and
implementation to be put to work successfully. But if you are a seasoned Turbo
Pascal programmer with some background in compiler design and formal language
theory, you will almost certainly find TP Lex and Yacc to be a powerful
extension of your Turbo Pascal toolset.

This manual tells you how to get started with the TP Lex and Yacc programs and
provides a short description of these programs. Some knowledge about the C
versions of Lex and Yacc will be useful, although not strictly necessary. For
further reading, you may also refer to:

- Aho, Sethi and Ullman: "Compilers : principles, techniques and tools."
  Reading (Mass.), Addison-Wesley, 1986.

- Johnson, S.C.: "Yacc - yet another compiler-compiler." CSTR-32, Bell
  Telephone Laboratories, 1974.

- Lesk, M.E.: "Lex - a lexical analyser generator." CSTR-39, Bell Telephone
  Laboratories, 1975.

- Schreiner, Friedman: "Introduction to compiler construction with UNIX."
  Prentice-Hall, 1985.

- The Unix Programmer's Manual, Sections `Lex' and `Yacc'.


Credits
-------

I would like to thank Berend de Boer (berend@pobox.com), who adapted TP Lex
and Yacc to take advantage of the large memory models in Borland Pascal 7.0
and Delphi, and Michael Van Canneyt (Michael.VanCanneyt@fys.kuleuven.ac.be),
the maintainer of the Linux version of the Free Pascal compiler, who is
responsible for the Free Pascal port. And of course thanks are due to the many
TP Lex/Yacc users all over the world for their support and comments which
helped to improve these programs.


Getting Started
---------------

Instructions on how to compile and install TP Lex and Yacc on all supported
platforms can be found in the README file contained in the distribution.

Once you have installed TP Lex and Yacc on your system, you can compile your
first TP Lex and Yacc program expr. Expr is a simple desktop calculator
program contained in the distribution, which consists of a lexical analyzer in
the TP Lex source file exprlex.l and the parser and main program in the TP
Yacc source file expr.y. To compile these programs, issue the commands

   lex exprlex
   yacc expr

That's it! You now have the Turbo Pascal sources (exprlex.pas and expr.pas)
for the expr program. Use the Turbo Pascal compiler to compile these programs
as usual:

   tpc expr

(Of course, the precise compilation command depends on the type of compiler
you are using. Thus you may have to replace tpc with bpc, dcc or dcc32,
depending on the version of the Turbo/Borland/Delphi compiler you have, and
with ppc386 for the Free Pascal compiler. If you are using TP Lex and Yacc
with Free Pascal under Linux, the corresponding commands are:

   plex exprlex
   pyacc expr
   ppc386 expr

Note that in the Linux version, the programs are named plex and pyacc to
avoid name clashes with the corresponding UNIX utilities.)

Having compiled expr.pas, you can execute the expr program and type some
expressions to see it work (terminate the program with an empty line). There
is a number of other sample TP Lex and Yacc programs (.l and .y files) in the
distribution, including a TP Yacc cross reference utility and a complete
parser for Standard Pascal.

The TP Lex and Yacc programs recognize some options which may be specified
anywhere on the command line. E.g.,

   lex -o exprlex

runs TP Lex with "DFA optimization" and

   yacc -v expr

runs TP Yacc in "verbose" mode (TP Yacc generates a readable description of
the generated parser).

The TP Lex and Yacc programs use the following default filename extensions:
- .l:   TP Lex input files
- .y:   TP Yacc input files
- .pas: TP Lex and Yacc output files

As usual, you may overwrite default filename extensions by explicitly
specifying suffixes.

If you ever forget how to run TP Lex and Yacc, you can issue the command lex
or yacc (resp. plex or pyacc) without arguments to get a short summary of the
command line syntax.


TP Lex
======

This section describes the TP Lex lexical analyzer generator.


Usage
-----

lex [options] lex-file[.l] [output-file[.pas]]


Options
-------

-v  "Verbose:" Lex generates a readable description of the generated
    lexical analyzer, written to lex-file with new extension `.lst'.

-o  "Optimize:" Lex optimizes DFA tables to produce a minimal DFA.


Description
-----------

TP Lex is a program generator that is used to generate the Turbo Pascal source
code for a lexical analyzer subroutine from the specification of an input
language by a regular expression grammar.

TP Lex parses the source grammar contained in lex-file (with default suffix
.l) and writes the constructed lexical analyzer subroutine to the specified
output-file (with default suffix .pas); if no output file is specified, output
goes to lex-file with new suffix .pas. If any errors are found during
compilation, error messages are written to the list file (lex-file with new
suffix .lst).

The generated output file contains a lexical analyzer routine, yylex,
implemented as:

  function yylex : Integer;

This routine has to be called by your main program to execute the lexical
analyzer. The return value of the yylex routine usually denotes the number
of a token recognized by the lexical analyzer (see the return routine in the
LexLib unit). At end-of-file the yylex routine normally returns 0.

The code template for the yylex routine may be found in the yylex.cod
file. This file is needed by TP Lex when it constructs the output file. It
must be present either in the current directory or in the directory from which
TP Lex was executed (TP Lex searches these directories in the indicated
order). (NB: For the Linux/Free Pascal version, the code template is searched
in some directory defined at compile-time instead of the execution path,
usually /usr/lib/fpc/lexyacc.)

The TP Lex library (LexLib) unit is required by programs using Lex-generated
lexical analyzers; you will therefore have to put an appropriate uses clause
into your program or unit that contains the lexical analyzer routine. The
LexLib unit also provides various useful utility routines; see the file
lexlib.pas for further information.


Lex Source
----------

A TP Lex program consists of three sections separated with the %% delimiter:

definitions
%%
rules
%%
auxiliary procedures

All sections may be empty. The TP Lex language is line-oriented; definitions
and rules are separated by line breaks. There is no special notation for
comments, but (Turbo Pascal style) comments may be included as Turbo Pascal
fragments (see below).

The definitions section may contain the following elements:

- regular definitions in the format:

     name   substitution

  which serve to abbreviate common subexpressions. The {name} notation
  causes the corresponding substitution from the definitions section to
  be inserted into a regular expression. The name must be a legal
  identifier (letter followed by a sequence of letters and digits;
  the underscore counts as a letter; upper- and lowercase are distinct).
  Regular definitions must be non-recursive.

- start state definitions in the format:

     %start name ...

  which are used in specifying start conditions on rules (described
  below). The %start keyword may also be abbreviated as %s or %S.

- Turbo Pascal declarations enclosed between %{ and %}. These will be
  inserted into the output file (at global scope). Also, any line that
  does not look like a Lex definition (e.g., starts with blank or tab)
  will be treated as Turbo Pascal code. (In particular, this also allows
  you to include Turbo Pascal comments in your Lex program.)

The rules section of a TP Lex program contains the actual specification of
the lexical analyzer routine. It may be thought of as a big CASE statement
discriminating over the different patterns to be matched and listing the
corresponding statements (actions) to be executed. Each rule consists of a
regular expression describing the strings to be matched in the input, and a
corresponding action, a Turbo Pascal statement to be executed when the
expression matches. Expression and statement are delimited with whitespace
(blanks and/or tabs). Thus the format of a Lex grammar rule is:

   expression      statement;

Note that the action must be a single Turbo Pascal statement terminated
with a semicolon (use begin ... end for compound statements). The statement
may span multiple lines if the successor lines are indented with at least
one blank or tab. The action may also be replaced by the | character,
indicating that the action for this rule is the same as that for the next
one.

The TP Lex library unit provides various variables and routines which are
useful in the programming of actions. In particular, the yytext string
variable holds the text of the matched string, and the yyleng Byte variable
its length.

Regular expressions are used to describe the strings to be matched in a
grammar rule. They are built from the usual constructs describing character
classes and sequences, and operators specifying repetitions and alternatives.
The precise format of regular expressions is described in the next section.

The rules section may also start with some Turbo Pascal declarations
(enclosed in %{ %}) which are treated as local declarations of the
actions routine.

Finally, the auxiliary procedures section may contain arbitrary Turbo
Pascal code (such as supporting routines or a main program) which is
simply tacked on to the end of the output file. The auxiliary procedures
section is optional.


Regular Expressions
-------------------

The following table summarizes the format of the regular expressions
recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48).
c stands for a single character, s for a string, r for a regular expression,
and n,m for nonnegative integers.

expression   matches                        example
----------   ----------------------------   -------
c            any non-operator character c   a
\c           character c literally          \*
"s"          string s literally             "**"
.            any character but newline      a.*b
^            beginning of line              ^abc
$            end of line                    abc$
[s]          any character in s             [abc]
[^s]         any character not in s         [^abc]
r*           zero or more r's               a*
r+           one or more r's                a+
r?           zero or one r                  a?
r{m,n}       m to n occurrences of r        a{1,5}
r{m}         m occurrences of r             a{5}
r1r2         r1 then r2                     ab
r1|r2        r1 or r2                       a|b
(r)          r                              (a|b)
r1/r2        r1 when followed by r2         a/b
<x>r         r when in start condition x    <x>abc
---------------------------------------------------

The operators *, +, ? and {} have highest precedence, followed by
concatenation. The | operator has lowest precedence. Parentheses ()
may be used to group expressions and overwrite default precedences.
The <> and / operators may only occur once in an expression.

The usual C-like escapes are recognized:

\n     denotes newline
\r     denotes carriage return
\t     denotes tab
\b     denotes backspace
\f     denotes form feed
\NNN   denotes character no. NNN in octal base

You can also use the \ character to quote characters which would otherwise
be interpreted as operator symbols. In character classes, you may use
the - character to denote ranges of characters. For instance, [a-z]
denotes the class of all lowercase letters.

The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
which match more than one rule. In such a case, the lexical analyzer prefers
the longest match and, if it still has the choice between different rules,
it picks the first of these. If no rule matches, the lexical analyzer
executes a default action which consists of copying the input character
to the output unchanged. Thus, if the purpose of a lexical analyzer is
to translate some parts of the input, and leave the rest unchanged, you
only have to specify the patterns which have to be treated specially. If,
however, the lexical analyzer has to absorb its whole input, you will have
to provide rules that match everything. E.g., you might use the rules

   .   |
   \n  ;

which match "any other character" (and ignore it).

Sometimes certain patterns have to be analyzed differently depending on some
amount of context in which the pattern appears. In such a case the / operator
is useful. For instance, the expression a/b matches a, but only if followed
by b. Note that the b does not belong to the match; rather, the lexical
analyzer, when matching an a, will look ahead in the input to see whether
it is followed by a b, before it declares that it has matched an a. Such
lookahead may be arbitrarily complex (up to the size of the LexLib input
buffer). E.g., the pattern a/.*b matches an a which is followed by a b
somewhere on the same input line. TP Lex also has a means to specify left
context which is described in the next section.


Start Conditions
----------------

TP Lex provides some features which make it possible to handle left context.
The ^ character at the beginning of a regular expression may be used to
denote the beginning of the line. More distant left context can be described
conveniently by using start conditions on rules.

Any rule which is prefixed with the <> construct is only valid if the lexical
analyzer is in the denoted start state. For instance, the expression <x>a
can only be matched if the lexical analyzer is in start state x. You can have
multiple start states in a rule; e.g., <x,y>a can be matched in start states
x or y.

Start states have to be declared in the definitions section by means of
one or more start state definitions (see above). The lexical analyzer enters
a start state through a call to the LexLib routine start. E.g., you may
write:

%start x y
%%
<x>a    start(y);
<y>b    start(x);
%%
begin
  start(x); if yylex=0 then ;
end.

Upon initialization, the lexical analyzer is put into state x. It then
proceeds in state x until it matches an a which puts it into state y.
In state y it may match a b which puts it into state x again, etc.

Start conditions are useful when certain constructs have to be analyzed
differently depending on some left context (such as a special character
at the beginning of the line), and if multiple lexical analyzers have to
work in concert. If a rule is not prefixed with a start condition, it is
valid in all user-defined start states, as well as in the lexical analyzer's
default start state.


Lex Library
-----------

The TP Lex library (LexLib) unit provides various variables and routines
which are used by Lex-generated lexical analyzers and application programs.
It provides the input and output streams and other internal data structures
used by the lexical analyzer routine, and supplies some variables and utility
routines which may be used by actions and application programs. Refer to
the file lexlib.pas for a closer description.

You can also modify the Lex library unit (and/or the code template in the
yylex.cod file) to customize TP Lex to your target applications. E.g.,
you might wish to optimize the code of the lexical analyzer for some
special application, make the analyzer read from/write to memory instead
of files, etc.


Implementation Restrictions
---------------------------

Internal table sizes and the main memory available limit the complexity of
source grammars that TP Lex can handle. There is currently no possibility to
change internal table sizes (apart from modifying the sources of TP Lex
itself), but the maximum table sizes provided by TP Lex seem to be large
enough to handle most realistic applications. The actual table sizes depend on
the particular implementation (they are much larger than the defaults if TP
Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
Free Pascal), and are shown in the statistics printed by TP Lex when a
compilation is finished. The units given there are "p" (positions, i.e. items
in the position table used to construct the DFA), "s" (DFA states) and "t"
(transitions of the generated DFA).

As implemented, the generated DFA table is stored as a typed array constant
which is inserted into the yylex.cod code template. The transitions in each
state are stored in order. Of course it would have been more efficient to
generate a big CASE statement instead, but I found that this may cause
problems with the encoding of large DFA tables because Turbo Pascal has
a quite rigid limit on the code size of individual procedures. I decided to
use a scheme in which transitions on different symbols to the same state are
merged into one single transition (specifying a character set and the
corresponding next state). This keeps the number of transitions in each state
quite small and still allows a fairly efficient access to the transition
table.

The TP Lex program has an option (-o) to optimize DFA tables. This causes a
minimal DFA to be generated, using the algorithm described in Aho, Sethi,
Ullman (1986). Although the absolute limit on the number of DFA states that TP
Lex can handle is at least 300, TP Lex poses an additional restriction (100)
on the number of states in the initial partition of the DFA optimization
algorithm. Thus, you may get a fatal `integer set overflow' message when using
the -o option even when TP Lex is able to generate an unoptimized DFA. In such
cases you will just have to be content with the unoptimized DFA. (Hopefully,
this will be fixed in a future version. Anyhow, using the merged transitions
scheme described above, TP Lex usually constructs unoptimized DFA's which are
not far from being optimal, and thus in most cases DFA optimization won't have
a great impact on DFA table sizes.)


Differences from UNIX Lex
-------------------------

Major differences between TP Lex and UNIX Lex are listed below.

- TP Lex produces output code for Turbo Pascal, rather than for C.

- Character tables (%T) are not supported; neither are any directives
  to determine internal table sizes (%p, %n, etc.).

- Library routines are named differently from the UNIX version (e.g.,
  the `start' routine takes the place of the `BEGIN' macro of UNIX
  Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had
  to be implemented as procedures.

- The TP Lex library unit starts counting line numbers at 0, incrementing
  the count BEFORE a line is read (in contrast, UNIX Lex initializes
  yylineno to 1 and increments it AFTER the line end has been read). This
  is motivated by the way in which TP Lex maintains the current line,
  and will not affect your programs unless you explicitly reset the
  yylineno value (e.g., when opening a new input file). In such a case
  you should set yylineno to 0 rather than 1.


TP Yacc
=======

This section describes the TP Yacc compiler compiler.


Usage
-----

yacc [options] yacc-file[.y] [output-file[.pas]]


Options
-------

-v  "Verbose:" TP Yacc generates a readable description of the generated
    parser, written to yacc-file with new extension .lst.

-d  "Debug:" TP Yacc generates parser with debugging output.


Description
-----------

TP Yacc is a program that lets you prepare parsers from the description
of input languages by BNF-like grammars. You simply specify the grammar
for your target language, augmented with the Turbo Pascal code necessary
to process the syntactic constructs, and TP Yacc translates your grammar
into the Turbo Pascal code for a corresponding parser subroutine named
yyparse.

TP Yacc parses the source grammar contained in yacc-file (with default
suffix .y) and writes the constructed parser subroutine to the specified
output-file (with default suffix .pas); if no output file is specified,
output goes to yacc-file with new suffix .pas. If any errors are found
during compilation, error messages are written to the list file (yacc-file
with new suffix .lst).

The generated parser routine, yyparse, is declared as:

   function yyparse : Integer;

This routine may be called by your main program to execute the parser.
The return value of the yyparse routine denotes success or failure of
the parser (possible return values: 0 = success, 1 = unrecoverable syntax
error or parse stack overflow).

Similar to TP Lex, the code template for the yyparse routine may be found in
the yyparse.cod file. The rules for locating this file are analogous to those
of TP Lex (see Section `TP Lex').

The TP Yacc library (YaccLib) unit is required by programs using Yacc-
generated parsers; you will therefore have to put an appropriate uses clause
into your program or unit that contains the parser routine. The YaccLib unit
also provides some routines which may be used to control the actions of the
parser. See the file yacclib.pas for further information.


Yacc Source
-----------

A TP Yacc program consists of three sections separated with the %% delimiter:

definitions
%%
rules
%%
auxiliary procedures


The TP Yacc language is free-format: whitespace (blanks, tabs and newlines)
is ignored, except if it serves as a delimiter. Comments have the C-like
format /* ... */. They are treated as whitespace. Grammar symbols are denoted
by identifiers which have the usual form (letter, including underscore,
followed by a sequence of letters and digits; upper- and lowercase is
distinct). The TP Yacc language also has some keywords which always start
with the % character. Literals are denoted by characters enclosed in single
quotes. The usual C-like escapes are recognized:

\n     denotes newline
\r     denotes carriage return
\t     denotes tab
\b     denotes backspace
\f     denotes form feed
\NNN   denotes character no. NNN in octal base


Definitions
-----------

The first section of a TP Yacc grammar serves to define the symbols used in
the grammar. It may contain the following types of definitions:

- start symbol definition: A definition of the form

     %start symbol

  declares the start nonterminal of the grammar (if this definition is
  omitted, TP Yacc assumes the left-hand side nonterminal of the first
  grammar rule as the start symbol of the grammar).

- terminal definitions: Definitions of the form

     %token symbol ...

  are used to declare the terminal symbols ("tokens") of the target
  language. Any identifier not introduced in a %token definition will
  be treated as a nonterminal symbol.

  As far as TP Yacc is concerned, tokens are atomic symbols which do not
  have an innert structure. A lexical analyzer must be provided which
  takes on the task of tokenizing the input stream and return the
  individual tokens and literals to the parser (see Section `Lexical
  Analysis').

- precedence definitions: Operator symbols (terminals) may be associated
  with a precedence by means of a precedence definition which may have
  one of the following forms

     %left symbol ...
     %right symbol ...
     %nonassoc symbol ...

  which are used to declare left-, right- and nonassociative operators,
  respectively. Each precedence definition introduces a new precedence
  level, lowest precedence first. E.g., you may write:

     %nonassoc '<' '>' '=' GEQ LEQ NEQ  /* relational operators */
     %left     '+' '-'  OR              /* addition operators */
     %left     '*' '/' AND              /* multiplication operators */
     %right    NOT UMINUS               /* unary operators */

  A terminal identifier introduced in a precedence definition may, but
  need not, appear in a %token definition as well.

- type definitions: Any (terminal or nonterminal) grammar symbol may be
  associated with a type identifier which is used in the processing of
  semantic values. Type tags of the form <name> may be used in token and
  precedence definitions to declare the type of a terminal symbol, e.g.:

     %token <Real>  NUM
     %left  <AddOp> '+' '-'

  To declare the type of a nonterminal symbol, use a type definition of
  the form:

     %type <name> symbol ...

  e.g.:

     %type <Real> expr

  In a %type definition, you may also omit the nonterminals, i.e. you
  may write:

     %type <name>

  This is useful when a given type is only used with type casts (see
  Section `Grammar Rules and Actions'), and is not associated with a
  specific nonterminal.

- Turbo Pascal declarations: You may also include arbitrary Turbo Pascal
  code in the definitions section, enclosed in %{ %}. This code will be
  inserted as global declarations into the output file, unchanged.


Grammar Rules and Actions
-------------------------

The second part of a TP Yacc grammar contains the grammar rules for the
target language. Grammar rules have the format

   name : symbol ... ;

The left-hand side of a rule must be an identifier (which denotes a
nonterminal symbol). The right-hand side may be an arbitrary (possibly
empty) sequence of nonterminal and terminal symbols (including literals
enclosed in single quotes). The terminating semicolon may also be omitted.
Different rules for the same left-hand side symbols may be written using
the | character to separate the different alternatives:

   name : symbol ...
        | symbol ...
        ...
        ;

For instance, to specify a simple grammar for arithmetic expressions, you
may write:

%left '+' '-'
%left '*' '/'
%token NUM
%%
expr : expr '+' expr
     | expr '-' expr
     | expr '*' expr
     | expr '/' expr
     | '(' expr ')'
     | NUM
     ;

(The %left definitions at the beginning of the grammar are needed to specify
the precedence and associativity of the operator symbols. This will be
discussed in more detail in Section `Ambigious Grammars'.)

Grammar rules may contain actions - Turbo Pascal statements enclosed in
{ } - to be executed as the corresponding rules are recognized. Furthermore,
rules may return values, and access values returned by other rules. These
"semantic" values are written as $$ (value of the left-hand side nonterminal)
and $i (value of the ith right-hand side symbol). They are kept on a special
value stack which is maintained automatically by the parser.

Values associated with terminal symbols must be set by the lexical analyzer
(more about this in Section `Lexical Analysis'). Actions of the form $$ := $1
can frequently be omitted, since it is the default action assumed by TP Yacc
for any rule that does not have an explicit action.

By default, the semantic value type provided by Yacc is Integer. You can
also put a declaration like

   %{
   type YYSType = Real;
   %}

into the definitions section of your Yacc grammar to change the default value
type. However, if you have different value types, the preferred method is to
use type definitions as discussed in Section `Definitions'. When such type
definitions are given, TP Yacc handles all the necessary details of the
YYSType definition and also provides a fair amount of type checking which
makes it easier to find type errors in the grammar.

For instance, we may declare the symbols NUM and expr in the example above
to be of type Real, and then use these values to evaluate an expression as
it is parsed.

%left '+' '-'
%left '*' '/'
%token <Real> NUM
%type  <Real> expr
%%
expr : expr '+' expr   { $$ := $1+$3; }
     | expr '-' expr   { $$ := $1-$3; }
     | expr '*' expr   { $$ := $1*$3; }
     | expr '/' expr   { $$ := $1/$3; }
     | '(' expr ')'    { $$ := $2;    }
     | NUM
     ;

(Note that we omitted the action of the last rule. The "copy action"
$$ := $1 required by this rule is automatically added by TP Yacc.)

Actions may not only appear at the end, but also in the middle of a rule
which is useful to perform some processing before a rule is fully parsed.
Such actions inside a rule are treated as special nonterminals which are
associated with an empty right-hand side. Thus, a rule like

   x : y { action; } z

will be treated as:

  x : y $act z
  $act : { action; }

Actions inside a rule may also access values to the left of the action,
and may return values by assigning to the $$ value. The value returned
by such an action can then be accessed by other actions using the usual $i
notation. E.g., we may write:

   x : y { $$ := 2*$1; } z { $$ := $2+$3; }

which has the effect of setting the value of x to

   2*(the value of y)+(the value of z).

Sometimes it is desirable to access values in enclosing rules. This can be
done using the notation $i with i<=0. $0 refers to the first value "to the
left" of the current rule, $-1 to the second, and so on. Note that in this
case the referenced value depends on the actual contents of the parse stack,
so you have to make sure that the requested values are always where you
expect them.

There are some situations in which TP Yacc cannot easily determine the
type of values (when a typed parser is used). This is true, in particular,
for values in enclosing rules and for the $$ value in an action inside a
rule. In such cases you may use a type cast to explicitly specify the type
of a value. The format for such type casts is $<name>$ (for left-hand side
values) and $<name>i (for right-hand side values) where name is a type
identifier (which must occur in a %token, precedence or %type definition).


Auxiliary Procedures
--------------------

The third section of a TP Yacc program is optional. If it is present, it
may contain any Turbo Pascal code (such as supporting routines or a main
program) which is tacked on to the end of the output file.


Lexical Analysis
----------------

For any TP Yacc-generated parser, the programmer must supply a lexical
analyzer routine named yylex which performs the lexical analysis for
the parser. This routine must be declared as

   function yylex : Integer;

The yylex routine may either be prepared by hand, or by using the lexical
analyzer generator TP Lex (see Section `TP Lex').

The lexical analyzer must be included in your main program behind the
parser subroutine (the yyparse code template includes a forward
definition of the yylex routine such that the parser can access the
lexical analyzer). For instance, you may put the lexical analyzer
routine into the auxiliary procedures section of your TP Yacc grammar,
either directly, or by using the the Turbo Pascal include directive
($I).

The parser repeatedly calls the yylex routine to tokenize the input
stream and obtain the individual lexical items in the input. For any
literal character, the yylex routine has to return the corresponding
character code. For the other, symbolic, terminals of the input language,
the lexical analyzer must return corresponding Integer codes. These are
assigned automatically by TP Yacc in the order in which token definitions
appear in the definitions section of the source grammar. The lexical
analyzer can access these values through corresponding Integer constants
which are declared by TP Yacc in the output file.

For instance, if

   %token NUM

is the first definition in the Yacc grammar, then TP Yacc will create
a corresponding constant declaration

   const NUM = 257;

in the output file (TP Yacc automatically assigns symbolic token numbers
starting at 257; 1 thru 255 are reserved for character literals, 0 denotes
end-of-file, and 256 is reserved for the special error token which will be
discussed in Section `Error Handling'). This definition may then be used,
e.g., in a corresponding TP Lex program as follows:

   [0-9]+   return(NUM);

You can also explicitly assign token numbers in the grammar. For this
purpose, the first occurrence of a token identifier in the definitions
section may be followed by an unsigned integer. E.g. you may write:

   %token NUM 299

Besides the return value of yylex, the lexical analyzer routine may also
return an additional semantic value for the recognized token. This value
is assigned to a variable named "yylval" and may then be accessed in actions
through the $i notation (see above, Section `Grammar Rules and Actions').
The yylval variable is of type YYSType (the semantic value type, Integer
by default); its declaration may be found in the yyparse.cod file.

For instance, to assign an Integer value to a NUM token in the above
example, we may write:

   [0-9]+   begin
              val(yytext, yylval, code);
              return(NUM);
            end;

This assigns yylval the value of the NUM token (using the Turbo Pascal
standard procedure val).

If a parser uses tokens of different types (via a %token <name> definition),
then the yylval variable will not be of type Integer, but instead of a
corresponding variant record type which is capable of holding all the
different value types declared in the TP Yacc grammar. In this case, the
lexical analyzer must assign a semantic value to the corresponding record
component which is named yy<name> (where <name> stands for the corresponding
type identifier).

E.g., if token NUM is declared Real:

   %token <Real> NUM

then the value for token NUM must be assigned to yylval.yyReal.


How The Parser Works
--------------------

TP Yacc uses the LALR(1) technique developed by Donald E. Knuth and F.
DeRemer to construct a simple, efficient, non-backtracking bottom-up
parser for the source grammar. The LALR parsing technique is described
in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a
look at the parser description TP Yacc generates from a small sample
grammar, to get an idea of how the LALR parsing algorithm works. We
consider the following simplified version of the arithmetic expression
grammar:

%token NUM
%left '+'
%left '*'
%%
expr : expr '+' expr
     | expr '*' expr
     | '(' expr ')'
     | NUM
     ;

When run with the -v option on the above grammar, TP Yacc generates the
parser description listed below.

state 0:

	$accept : _ expr $end

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 1

state 1:

	$accept : expr _ $end
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	$end	accept
	'*'	shift 4
	'+'	shift 5
	.	error

state 2:

	expr : '(' _ expr ')'

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 6

state 3:

	expr : NUM _	(4)

	.	reduce 4

state 4:

	expr : expr '*' _ expr

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 7

state 5:

	expr : expr '+' _ expr

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 8

state 6:

	expr : '(' expr _ ')'
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	')'	shift 9
	'*'	shift 4
	'+'	shift 5
	.	error

state 7:

	expr : expr '*' expr _	(2)
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	.	reduce 2

state 8:

	expr : expr '+' expr _	(1)
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	'*'	shift 4
	$end	reduce 1
	')'	reduce 1
	'+'	reduce 1
	.	error

state 9:

	expr : '(' expr ')' _	(3)

	.	reduce 3


Each state of the parser corresponds to a certain prefix of the input
which has already been seen. The parser description lists the grammar
rules wich are parsed in each state, and indicates the portion of each
rule which has already been parsed by an underscore. In state 0, the
start state of the parser, the parsed rule is

	$accept : expr $end

This is not an actual grammar rule, but a starting rule automatically
added by TP Yacc. In general, it has the format

	$accept : X $end

where X is the start nonterminal of the grammar, and $end is a pseudo
token denoting end-of-input (the $end symbol is used by the parser to
determine when it has successfully parsed the input).

The description of the start rule in state 0,

	$accept : _ expr $end

with the underscore positioned before the expr symbol, indicates that
we are at the beginning of the parse and are ready to parse an expression
(nonterminal expr).

The parser maintains a stack to keep track of states visited during the
parse. There are two basic kinds of actions in each state: "shift", which
reads an input symbol and pushes the corresponding next state on top of
the stack, and "reduce" which pops a number of states from the stack
(corresponding to the number of right-hand side symbols of the rule used
in the reduction) and consults the "goto" entries of the uncovered state
to find the transition corresponding to the left-hand side symbol of the
reduced rule.

In each step of the parse, the parser is in a given state (the state on
top of its stack) and may consult the current "lookahead symbol", the
next symbol in the input, to determine the parse action - shift or reduce -
to perform. The parser terminates as soon as it reaches state 1 and reads
in the endmarker, indicated by the "accept" action on $end in state 1.

Sometimes the parser may also carry out an action without inspecting the
current lookahead token. This is the case, e.g., in state 3 where the
only action is reduction by rule 4:

	.	reduce 4

The default action in a state can also be "error" indicating that any
other input represents a syntax error. (In case of such an error the
parser will start syntactic error recovery, as described in Section
`Error Handling'.)

Now let us see how the parser responds to a given input. We consider the
input string 2+5*3 which is presented to the parser as the token sequence:

   NUM + NUM * NUM

The following table traces the corresponding actions of the parser. We also
show the current state in each move, and the remaining states on the stack.

State  Stack         Lookahead  Action
-----  ------------  ---------  --------------------------------------------

0                    NUM        shift state 3

3      0                        reduce rule 4 (pop 1 state, uncovering state
                                0, then goto state 1 on symbol expr)

1      0             +          shift state 5

5      1 0           NUM        shift state 3

3      5 1 0                    reduce rule 4 (pop 1 state, uncovering state
                                5, then goto state 8 on symbol expr)

8      5 1 0         *          shift 4

4      8 5 1 0       NUM        shift 3

3      4 8 5 1 0                reduce rule 4 (pop 1 state, uncovering state
                                4, then goto state 7 on symbol expr)

7      4 8 5 1 0                reduce rule 2 (pop 3 states, uncovering state
                                5, then goto state 8 on symbol expr)

8      5 1 0         $end       reduce rule 1 (pop 3 states, uncovering state
                                0, then goto state 1 on symbol expr)

1      0             $end       accept

It is also instructive to see how the parser responds to illegal inputs.
E.g., you may try to figure out what the parser does when confronted with:

   NUM + )

or:

   ( NUM * NUM

You will find that the parser, sooner or later, will always run into an
error action when confronted with errorneous inputs. An LALR parser will
never shift an invalid symbol and thus will always find syntax errors as
soon as it is possible during a left-to-right scan of the input.

TP Yacc provides a debugging option (-d) that may be used to trace the
actions performed by the parser. When a grammar is compiled with the
-d option, the generated parser will print out the actions as it parses
its input.


Ambigious Grammars
------------------

There are situations in which TP Yacc will not produce a valid parser for
a given input language. LALR(1) parsers are restricted to one-symbol
lookahead on which they have to base their parsing decisions. If a
grammar is ambigious, or cannot be parsed unambigiously using one-symbol
lookahead, TP Yacc will generate parsing conflicts when constructing the
parse table. There are two types of such conflicts: shift/reduce conflicts
(when there is both a shift and a reduce action for a given input symbol
in a given state), and reduce/reduce conflicts (if there is more than
one reduce action for a given input symbol in a given state). Note that
there never will be a shift/shift conflict.

When a grammar generates parsing conflicts, TP Yacc prints out the number
of shift/reduce and reduce/reduce conflicts it encountered when constructing
the parse table. However, TP Yacc will still generate the output code for the
parser. To resolve parsing conflicts, TP Yacc uses the following built-in
disambiguating rules:

- in a shift/reduce conflict, TP Yacc chooses the shift action.

- in a reduce/reduce conflict, TP Yacc chooses reduction of the first
  grammar rule.

The shift/reduce disambiguating rule correctly resolves a type of
ambiguity known as the "dangling-else ambiguity" which arises in the
syntax of conditional statements of many programming languages (as in
Pascal):

%token IF THEN ELSE
%%
stmt : IF expr THEN stmt
     | IF expr THEN stmt ELSE stmt
     ;

This grammar is ambigious, because a nested construct like

   IF expr-1 THEN IF expr-2 THEN stmt-1 ELSE stmt-2

can be parsed two ways, either as:

   IF expr-1 THEN ( IF expr-2 THEN stmt-1 ELSE stmt-2 )

or as:

   IF expr-1 THEN ( IF expr-2 THEN stmt-1 ) ELSE stmt-2

The first interpretation makes an ELSE belong to the last unmatched
IF which also is the interpretation chosen in most programming languages.
This is also the way that a TP Yacc-generated parser will parse the construct
since the shift/reduce disambiguating rule has the effect of neglecting the
reduction of IF expr-2 THEN stmt-1; instead, the parser will shift the ELSE
symbol which eventually leads to the reduction of IF expr-2 THEN stmt-1 ELSE
stmt-2.

The reduce/reduce disambiguating rule is used to resolve conflicts that
arise when there is more than one grammar rule matching a given construct.
Such ambiguities are often caused by "special case constructs" which may be
given priority by simply listing the more specific rules ahead of the more
general ones.

For instance, the following is an excerpt from the grammar describing the
input language of the UNIX equation formatter EQN:

%right SUB SUP
%%
expr : expr SUB expr SUP expr
     | expr SUB expr
     | expr SUP expr
     ;

Here, the SUB and SUP operator symbols denote sub- and superscript,
respectively. The rationale behind this example is that an expression
involving both sub- and superscript is often set differently from a
superscripted subscripted expression. This special case is therefore
caught by the first rule in the above example which causes a reduce/reduce
conflict with rule 3 in expressions like expr-1 SUB expr-2 SUP expr-3.
The conflict is resolved in favour of the first rule.

In both cases discussed above, the ambiguities could also be eliminated
by rewriting the grammar accordingly (although this yields more complicated
and less readable grammars). This may not always be the case. Often
ambiguities are also caused by design errors in the grammar. Hence, if
TP Yacc reports any parsing conflicts when constructing the parser, you
should use the -v option to generate the parser description (.lst file)
and check whether TP Yacc resolved the conflicts correctly.

There is one type of syntactic constructs for which one often deliberately
uses an ambigious grammar as a more concise representation for a language
that could also be specified unambigiously: the syntax of expressions.
For instance, the following is an unambigious grammar for simple arithmetic
expressions:

%token NUM

%%

expr	: term
	| expr '+' term
        ;

term	: factor
	| term '*' factor
        ;

factor	: '(' expr ')'
	| NUM
        ;

You may check yourself that this grammar gives * a higher precedence than
+ and makes both operators left-associative. The same effect can be achieved
with the following ambigious grammar using precedence definitions:

%token NUM
%left '+'
%left '*'
%%
expr : expr '+' expr
     | expr '*' expr
     | '(' expr ')'
     | NUM
     ;

Without the precedence definitions, this is an ambigious grammar causing
a number of shift/reduce conflicts. The precedence definitions are used
to correctly resolve these conflicts (conflicts resolved using precedence
will not be reported by TP Yacc).

Each precedence definition introduces a new precedence level (lowest
precedence first) and specifies whether the corresponding operators
should be left-, right- or nonassociative (nonassociative operators
cannot be combined at all; example: relational operators in Pascal).

TP Yacc uses precedence information to resolve shift/reduce conflicts as
follows. Precedences are associated with each terminal occuring in a
precedence definition. Furthermore, each grammar rule is given the
precedence of its rightmost terminal (this default choice can be
overwritten using a %prec tag; see below). To resolve a shift/reduce
conflict using precedence, both the symbol and the rule involved must
have been assigned precedences. TP Yacc then chooses the parse action
as follows:

- If the symbol has higher precedence than the rule: shift.

- If the rule has higher precedence than the symbol: reduce.

- If symbol and rule have the same precedence, the associativity of the
  symbol determines the parse action: if the symbol is left-associative:
  reduce; if the symbol is right-associative: shift; if the symbol is
  non-associative: error.

To give you an idea of how this works, let us consider our ambigious
arithmetic expression grammar (without precedences):

%token NUM
%%
expr : expr '+' expr
     | expr '*' expr
     | '(' expr ')'
     | NUM
     ;

This grammar generates four shift/reduce conflicts. The description
of state 8 reads as follows:

state 8:

	*** conflicts:

	shift 4, reduce 1 on '*'
	shift 5, reduce 1 on '+'

	expr : expr '+' expr _	(1)
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	'*'	shift 4
	'+'	shift 5
	$end	reduce 1
	')'	reduce 1
	.	error

In this state, we have successfully parsed a + expression (rule 1). When
the next symbol is + or *, we have the choice between the reduction and
shifting the symbol. Using the default shift/reduce disambiguating rule,
TP Yacc has resolved these conflicts in favour of shift.

Now let us assume the above precedence definition:

   %left '+'
   %left '*'

which gives * higher precedence than + and makes both operators left-
associative. The rightmost terminal in rule 1 is +. Hence, given these
precedence definitions, the first conflict will be resolved in favour
of shift (* has higher precedence than +), while the second one is resolved
in favour of reduce (+ is left-associative).

Similar conflicts arise in state 7:

state 7:

	*** conflicts:

	shift 4, reduce 2 on '*'
	shift 5, reduce 2 on '+'

	expr : expr '*' expr _	(2)
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	'*'	shift 4
	'+'	shift 5
	$end	reduce 2
	')'	reduce 2
	.	error

Here, we have successfully parsed a * expression which may be followed
by another + or * operator. Since * is left-associative and has higher
precedence than +, both conflicts will be resolved in favour of reduce.

Of course, you can also have different operators on the same precedence
level. For instance, consider the following extended version of the
arithmetic expression grammar:

%token NUM
%left '+' '-'
%left '*' '/'
%%
expr	: expr '+' expr
	| expr '-' expr
        | expr '*' expr
        | expr '/' expr
        | '(' expr ')'
        | NUM
        ;

This puts all "addition" operators on the first and all "multiplication"
operators on the second precedence level. All operators are left-associative;
for instance, 5+3-2 will be parsed as (5+3)-2.

By default, TP Yacc assigns each rule the precedence of its rightmost
terminal. This is a sensible decision in most cases. Occasionally, it
may be necessary to overwrite this default choice and explicitly assign
a precedence to a rule. This can be done by putting a precedence tag
of the form

   %prec symbol

at the end of the corresponding rule which gives the rule the precedence
of the specified symbol. For instance, to extend the expression grammar
with a unary minus operator, giving it highest precedence, you may write:

%token NUM
%left '+' '-'
%left '*' '/'
%right UMINUS
%%
expr	: expr '+' expr
	| expr '-' expr
        | expr '*' expr
        | expr '/' expr
        | '-' expr      %prec UMINUS
        | '(' expr ')'
        | NUM
        ;

Note the use of the UMINUS token which is not an actual input symbol but
whose sole purpose it is to give unary minus its proper precedence. If
we omitted the precedence tag, both unary and binary minus would have the
same precedence because they are represented by the same input symbol.


Error Handling
--------------

Syntactic error handling is a difficult area in the design of user-friendly
parsers. Usually, you will not like to have the parser give up upon the
first occurrence of an errorneous input symbol. Instead, the parser should
recover from a syntax error, that is, it should try to find a place in the
input where it can resume the parse.

TP Yacc provides a general mechanism to implement parsers with error
recovery. A special predefined "error" token may be used in grammar rules
to indicate positions where syntax errors might occur. When the parser runs
into an error action (i.e., reads an errorneous input symbol) it prints out
an error message and starts error recovery by popping its stack until it
uncovers a state in which there is a shift action on the error token. If
there is no such state, the parser terminates with return value 1, indicating
an unrecoverable syntax error. If there is such a state, the parser takes the
shift on the error token (pretending it has seen an imaginary error token in
the input), and resumes parsing in a special "error mode."

While in error mode, the parser quietly skips symbols until it can again
perform a legal shift action. To prevent a cascade of error messages, the
parser returns to its normal mode of operation only after it has seen
and shifted three legal input symbols. Any additional error found after
the first shifted symbol restarts error recovery, but no error message
is printed. The TP Yacc library routine yyerrok may be used to reset the
parser to its normal mode of operation explicitly.

For a simple example, consider the rule

stmt	: error ';' { yyerrok; }

and assume a syntax error occurs while a statement (nonterminal stmt) is
parsed. The parser prints an error message, then pops its stack until it
can shift the token error of the error rule. Proceeding in error mode, it
will skip symbols until it finds a semicolon, then reduces by the error
rule. The call to yyerrok tells the parser that we have recovered from
the error and that it should proceed with the normal parse. This kind of
"panic mode" error recovery scheme works well when statements are always
terminated with a semicolon. The parser simply skips the "bad" statement
and then resumes the parse.

Implementing a good error recovery scheme can be a difficult task; see
Aho/Sethi/Ullman (1986) for a more comprehensive treatment of this topic.
Schreiner and Friedman have developed a systematic technique to implement
error recovery with Yacc which I found quite useful (I used it myself
to implement error recovery in the TP Yacc parser); see Schreiner/Friedman
(1985).


Yacc Library
------------

The TP Yacc library (YaccLib) unit provides some global declarations used
by the parser routine yyparse, and some variables and utility routines
which may be used to control the actions of the parser and to implement
error recovery. See the file yacclib.pas for a description of these
variables and routines.

You can also modify the Yacc library unit (and/or the code template in the
yyparse.cod file) to customize TP Yacc to your target applications.


Other Features
--------------

TP Yacc supports all additional language elements entitled as "Old Features
Supported But not Encouraged" in the UNIX manual, which are provided for
backward compatibility with older versions of (UNIX) Yacc:

- literals delimited by double quotes.

- multiple-character literals. Note that these are not treated as character
  sequences but represent single tokens which are given a symbolic integer
  code just like any other token identifier. However, they will not be
  declared in the output file, so you have to make sure yourself that
  the lexical analyzer returns the correct codes for these symbols. E.g.,
  you might explicitly assign token numbers by using a definition like

     %token ':=' 257

  at the beginning of the Yacc grammar.

- \ may be used instead of %, i.e. \\ means %%, \left is the same as %left,
  etc.

- other synonyms:
  %<             for %left
  %>             for %right
  %binary or %2  for %nonassoc
  %term or %0    for %token
  %=             for %prec

- actions may also be written as = { ... } or = single-statement;

- Turbo Pascal declarations (%{ ... %}) may be put at the beginning of the
  rules section. They will be treated as local declarations of the actions
  routine.


Implementation Restrictions
---------------------------

As with TP Lex, internal table sizes and the main memory available limit the
complexity of source grammars that TP Yacc can handle. However, the maximum
table sizes provided by TP Yacc are large enough to handle quite complex
grammars (such as the Pascal grammar in the TP Yacc distribution). The actual
table sizes are shown in the statistics printed by TP Yacc when a compilation
is finished. The given figures are "s" (states), "i" (LR0 kernel items), "t"
(shift and goto transitions) and "r" (reductions).

The default stack size of the generated parsers is yymaxdepth = 1024, as
declared in the TP Yacc library unit. This should be sufficient for any
average application, but you can change the stack size by including a
corresponding declaration in the definitions part of the Yacc grammar
(or change the value in the YaccLib unit). Note that right-recursive
grammar rules may increase stack space requirements, so it is a good
idea to use left-recursive rules wherever possible.


Differences from UNIX Yacc
--------------------------

Major differences between TP Yacc and UNIX Yacc are listed below.

- TP Yacc produces output code for Turbo Pascal, rather than for C.

- TP Yacc does not support %union definitions. Instead, a value type is
  declared by specifying the type identifier itself as the tag of a %token
  or %type definition. TP Yacc will automatically generate an appropriate
  variant record type (YYSType) which is capable of holding values of any
  of the types used in %token and %type.

  Type checking is very strict. If you use type definitions, then
  any symbol referred to in an action must have a type introduced
  in a type definition. Either the symbol must have been assigned a
  type in the definitions section, or the $<type-identifier> notation
  must be used. The syntax of the %type definition has been changed
  slightly to allow definitions of the form
     %type <type-identifier>
  (omitting the nonterminals) which may be used to declare types which
  are not assigned to any grammar symbol, but are used with the
  $<...> construct.

- The parse tables constructed by this Yacc version are slightly greater
  than those constructed by UNIX Yacc, since a reduce action will only be
  chosen as the default action if it is the only action in the state.
  In difference, UNIX Yacc chooses a reduce action as the default action
  whenever it is the only reduce action of the state (even if there are
  other shift actions).

  This solves a bug in UNIX Yacc that makes the generated parser start
  error recovery too late with certain types of error productions (see
  also Schreiner/Friedman, "Introduction to compiler construction with
  UNIX," 1985). Also, errors will be caught sooner in most cases where
  UNIX Yacc would carry out an additional (default) reduction before
  detecting the error.

- Library routines are named differently from the UNIX version (e.g.,
  the `yyerrlab' routine takes the place of the `YYERROR' macro of UNIX
  Yacc), and, of course, all macros of UNIX Yacc (YYERROR, YYACCEPT, etc.)
  had to be implemented as procedures.