tutorial_episode_7.pod - OpenGrok cross reference for /dports/lang/parrot/parrot-8.1.0/examples/languages/squaak/doc/tutorial_episode_7.pod

# Copyright (C) 2008-2014, Parrot Foundation.
=encoding utf8

=head1 DESCRIPTION

This is the seventh episode in a tutorial series on building a compiler with
the Parrot Compiler Tools.

=head1 Episode 7: Operators and Precedence

Up till now, we've implemented a great deal of the Squaak language. We've seen
assignments, control-flow statements, variable declarations and scope,
subroutines and invocation. Our expressions have been limited so far to singular
values, such as string literals and integer constants. In this episode, we'll
enhance Squaak so it can handle operators, so you can construct more complex
expressions.

=head2 Operators, precedence and parse trees

We will first briefly introduce the problem with recursive-descent parsers
(which parsers generated with NQP are) when parsing expressions. Consider
the following mini-grammar, which is a very basic calculator.

 rule TOP {
     <expression>*
 }

 rule expression {
     <term>
 }

 rule term {
     <factor> [ <addop> <factor> ]*
 }

 token addop { '+' | '-' }

 rule factor {
     <value> [ <mulop> <value> ]*
 }

 token mulop { '*' | '/' | '%' }

 rule value{
     | <number>
     | '(' <expression> ')'
 }

This basic expression grammar implements operator precedence by taking
advantage of the nature of a recursive-descent parser (if you haven't seen
the word, google it). However, the big disadvantage of parsing expressions
this way, is that the parse trees can become quite large. Perhaps more
importantly, the parsing process is not very efficient. Let's take a look at
some sample input. We won't show the parse trees as shown in Episode 2, but
we'll just show an outline.

 input: 42 results in this parse tree:

 TOP
   expression
     term
       factor
         value
           number
             42

As you can see, the input of this single number will invoke 6 grammar rules

before parsing the actual digits. Not that bad, you might think.

 input: "1 + 2" results in this parse tree (we ignore the operator for now):

 TOP
   expression
     term
       factor
       | value
       |   number
       |     1
       factor
         value
           number
             2

Only a few more grammar rules are invoked, not really a problem either.

 input: "(1 + 2) * 3" results in this parse tree:

 TOP
   expression
     term
       factor
         value
         | expression
         |   term
         |   | factor
         |   |   value
         |   |     number
         |   |       1
         |   term
         |     factor
         |       value
         |         number
         |           2
         value
           number
             3

Right; 16 grammar rules just to parse this simple input. I'd call this slightly
inefficient. The point is, implementing operator precedence using a
recursive-descent parser is somewhat problematic, and given the fact there
are better methods to parse expressions like these, not the way to go. Check
out this nice explanation or google it.

=head2 Bottom-up parsing and stacks: operator tables

I would like to explain to you how bottom-up parsing works for expressions
(or bottom-up parsers in general; Yacc/Bison are parser generators that generate
bottom-up parsers for your grammar specification), taking operator precedence
into account. However, it's been about 6 years that I did this in a CS class,
and I don't remember the particular details. If you really want to know, check
out the links at the end of the previous section. It's actually worth checking
out. For now, I'll just assume you know what the problem is, so that I'll
introduce the solution for NQP-based compilers immediately.
At some point when parsing your input, you might encounter an expression. At
this point, we'd like the parser to switch from top-down to bottom-up parsing.
NQP-rx supports this, and is used as follows:

 <EXPR>

Of course, the optable must be populated with some operators that
we need to be able to parse and it might be told what precedence and associativity they have. The
easiest way to do this is by setting up precedence levels in an C<INIT> block:

    INIT {
        Squaak::Grammar.O(':prec<t>, :assoc<left>', '%additive');
        Squaak::Grammar.O(':prec<u>, :assoc<lefT>', '%multiplicative');
    }

In this C<INIT> block, we use the C<O> method of the compiler to set up two precedence levels: one
for operators like addition (named C<%additive>), and one for operators like multiplication (named
C<%multiplicative>). Each of them has a ":prec" value and an ":assoc" value. ":prec" determines the
precedence. Lexicographically greater values indicate higher precedence, so C<%additive> operators,
with a precedence value of "t", have lower precedence than C<%multiplicative> operators with a
precedence value of "u".":assoc" defines the associativity of the operators. If C<@> is a left
associative operator, then 1 @ 2 @ 3 is equivalent to (1 @ 2) @ 3. However, if C<@> is right
associative, then 1 @ 2 @ 3 is equivalent to 1 @ (2 @ 3). There are other options for the
associativity, but we'll discuss them as we come to them.

    token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> }

This defines the operator C<*> (the C<infix:> is a prefix that tells the
operator parser that this operator is an infix operator; there are other types,
such as prefix, postfix and others). As you can see, it uses the O rule to specify that it is part
of the C<%multiplicative> group of operators. The ":pirop" value specifies that the operator should
compile to the C<mul> PIR opcode.

Of course, the expression parser does not just parse operators, it must also
parse the operands. So, how do we declare the most basic entity that represents
an operand? It can be anything, from a basic integer-constant, a function call,
or even a function definition (but adding two function definition doesn't
really make sense, does it?). The operands are parsed in a recursive-descent
fashion, so somewhere the parser must switch back from bottom-up
(expression parsing) to top-down. This "switch-back" point is the proto token C<term>. This is the
reason why integer constants are parsed by the rule term:sym<integer_constant>, for example, in our
grammar.

The C<term> proto token is
invoked every time a new operand is needed

=head2 Squaak Operators

We have defined the entry and exit point of the expression (bottom-up) parser,
now it's time to add the operators. Let's have a look at Squaak's operators and
their precedence. The operators are listed with decreasing precedence (so that
high-precedence operators are listed at the top). (I'm not sure if this
precedence table is common compared to other languages; some operators may have
a different precedence w.r.t. other operators than you're used to. At least the
mathematical operators are organized according to standard math rules).

 unary "-"
 unary "not"
 * / %
 + - ..
 < <= >= > != ==
 and
 or

(".." is the string concatenation operator). Besides defining an entry and exit
point for the expression parser, you need to define precedence levels for your operators. Find the
C<INIT> block in Grammar.pm below the "## Operators" comment, and replace it with this:

    INIT {
        Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate');
        Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not');
        Squaak::Grammar.O(':prec<u>, :assoc<left>',  '%multiplicative');
        Squaak::Grammar.O(':prec<t>, :assoc<left>',  '%additive');
        Squaak::Grammar.O(':prec<s>, :assoc<left>',  '%relational');
        Squaak::Grammar.O(':prec<r>, :assoc<left>',  '%conjunction');
        Squaak::Grammar.O(':prec<q>, :assoc<left>',  '%disjunction');
    }

Now, we need to define the actual operators:

    token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> }
    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
    token infix:sym«<» { <sym> <O('%relational, :pirop<islt>')> }
    token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> }
    token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> }
    token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> }
    token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> }

Note that some operators are missing. See the exercises section for this.

=head2 Short-circuiting logical operators

Squaak has two logical operators: C<and> and C<or>; and results true if and
only if both operands evaluate to true, while or results true if at least one
of its operands evaluates to true. Both operands are short-circuited, which
means that they don't evaluate both operands if that's unnecessary. For
instance, if the first operand of the and operator evaluates to false, then
there's no need to evaluate the second operand, as the final result of the
and-expression cannot become true anymore (remember: both operands must
evaluate to true).Let's think about how to implement this. When evaluating an
and-expression, we first evaluate the first operand, and if it's true, only
then does it make sense to evaluate the second operand. This behavior looks
very much the same as an if-statement, doesn't it? In an if-statement, the
first child is always evaluated, and if true, the second child
(the C<then> block) is evaluated (remember, the third child -- the C<else>
clause -- is optional). It would be great to be able to implement the and
operator using a C<PAST::Op( :pasttype('if') )> node. Well, you can, using
the ":pasttype" option! Here's how:

    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }

So what about the or operator? When evaluating an or-expression, the first
operand is evaluated. If it evaluates to true, then there's no need to evaluate
the second operand, as the result of the or-expression is already true! Only if
the first operand evaluates to false, is it necessary to evaluate the second
child. Mmmmm.... what we're saying here is, unless the first operand evaluates
to true, evaluate the second child. Guess what pasttype you'd need for that!

=head2 Operators PAST types and PIR instructions

In the previous section, we introduced the C<pasttype> clause that you can
specify. This means that for that operator (for instance, the C<and> operator
we discussed), a C<PAST::Op( :pasttype('if') )> node is created. What happens
if you don't specify a pasttype? In that case, the corresponding action method is called. Obviously,
some languages have very exotic semantics for the C<+> operator,
but many languages just want to use Parrot's built-in C<add> instruction. How
do we achieve that?

Instead of adding a C<pasttype> clause, specify a C<pirop> clause. The
C<pirop>, or I<PIR operator>, clause tells the code generator what operator
should be generated. Instead of generating a subroutine invocation with the
operands as arguments, it will generate the specified instruction with the
operator's operands as arguments. Neat huh? Let's look at an example:

    token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> }

This specifies to use the C<add> instruction, which tells Parrot to create a
new result object instead of changing one of the operands. PCT
just emits the following for this:

 add $P12, $P10, $P11

which means that the PMCs in registers C<$P10> and C<$P11> are added, and
assigned to a newly created PMC which is stored in register C<$P12>.

=head2 To circumfix or not to circumfix

Squaak supports parenthesized expressions. Parentheses can be used to change
the order of evaluation in an expression, just as you probably have seen in other languages. Besides
infix, prefix and postfix operators, you can define circumfix operators, which is specified with the
left and right delimiter. This is an ideal way to implement parenthesized expressions:

    token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' }

    # with the action method:
    method circumfix:sym<( )> { make $<EXPR>.ast; }

This rule and action method were generated for us when we ran mk_language_shell.pl; you don't need
to add them to the grammar and actions yourself. Circumfix operators are treated as terms by the
operator-precedence parser, so it will parse as we want it to automatically.

=head2 Expression parser's action method

For all grammar rules we introduced, we also introduced an action method that
is invoked after the grammar rule was done matching. What about the action
method for EXPR? Our Squaak::Actions class inherits that from HLL::Actions. We don't have to write
one.

=head2 What's Next?

This episode covered the implementation of operators, which allows us to write
complex expressions. By now, most of our language is implemented, except for
one thing: aggregate data structures. This will be the topic of Episode 8. We
will introduce the two aggregate data types: array and hashtables, and see how
we can implement these. We'll also discuss what happens when we pass such
aggregates as subroutine arguments, and the difference with the basic data
types.

=head2 Exercises

=over 4

=item *

Currently, Squaak only has grammar rules for integer and string constants, not
floating point constants. Implement this grammar rule. A floating-point number
consists of zero or more digits, followed by a dot and at least one digit, or,
at least one digit followed by a dot and any number of digits. Examples are:

 42.0, 1., .0001.

There may be no whitespace between the individual digits and the dot. Make sure
you understand the difference between a "rule" and a "token".

Hint: a floating-point constant should produce a value of type 'Float'.

Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which
matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As
a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match.
Since the part of a floating-point constant before the decimal place is the same as an integer
constant, unless the token for floating-point constants has a longer name than the token for
integer-constants, the latter will match and a syntax error will result.

=item *

Implement the missing operators: (binary) "-", "<=", ">=", "==", "!=", "/",
"%", "or"

=back

=head2 References

docs/pct/pct_optable_guide.pod

=head2 Solution to the exercises

=over 4

=item 1

A floating-point number consists of zero or more digits, followed by a dot
and at least one digit, or, at least one digit followed by a dot and any
number of digits. Examples are: 42.0, 1., .0001. There may be no whitespace
between the individual digits and the dot. Make sure you understand the
difference between a C<rule> and a C<token>.

Hint: a floating-point constant should produce a value of type 'Float'.

Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which
matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As
a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match.
Since the part of a floating-point constant before the decimal place is the same as an integer
constant, unless the token for floating-point constants has a longer name than the token for
integer-constants, the latter will match and a syntax error will result.

    token term:sym<float_constant_long> { # longer to work around lack of LTM
        [
        | \d+ '.' \d*
        | \d* '.' \d+
        ]
    }

    # with action method:
    method term:sym<float_constant_long>($/) { # name worksaround lack of LTM
        make PAST::Val.new(:value(+$/), :returns<Float>);
    }

=item 2

For sake of completeness (and easy copy-paste for you), here's the list of
operator declarations as I wrote them for Squaak:

    INIT {
        Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate');
        Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not');
        Squaak::Grammar.O(':prec<u>, :assoc<left>',  '%multiplicative');
        Squaak::Grammar.O(':prec<t>, :assoc<left>',  '%additive');
        Squaak::Grammar.O(':prec<s>, :assoc<left>',  '%relational');
        Squaak::Grammar.O(':prec<r>, :assoc<left>',  '%conjunction');
        Squaak::Grammar.O(':prec<q>, :assoc<left>',  '%disjunction');
    }

    token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' }

    token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> }
    token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> }

    token infix:sym<*>  { <sym> <O('%multiplicative, :pirop<mul>')> }
    token infix:sym<%>  { <sym> <O('%multiplicative, :pirop<mod>')> }
    token infix:sym</>  { <sym> <O('%multiplicative, :pirop<div>')> }

    token infix:sym<+>  { <sym> <O('%additive, :pirop<add>')> }
    token infix:sym<->  { <sym> <O('%additive, :pirop<sub>')> }
    token infix:sym<..> { <sym> <O('%additive, :pirop<concat>')> }

    token infix:sym«<» { <sym> <O('%relational, :pirop<isle iPP>')> }
    token infix:sym«<=» { <sym> <O('%relational, :pirop<islt iPP>')> }
    token infix:sym«>» { <sym> <O('%relational, :pirop<isgt iPP>')> }
    token infix:sym«>=» { <sym> <O('%relational, :pirop<isge iPP>')> }
    token infix:sym«==» { <sym> <O('%relational, :pirop<iseq iPP>')> }
    token infix:sym«!=» { <sym> <O('%relational, :pirop<isne iPP>')> }

    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
    token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> }

=back

=cut