1# Copyright (C) 2008-2014, Parrot Foundation.
2=encoding utf8
3
4=head1 DESCRIPTION
5
6This is the seventh episode in a tutorial series on building a compiler with
7the Parrot Compiler Tools.
8
9=head1 Episode 7: Operators and Precedence
10
11Up till now, we've implemented a great deal of the Squaak language. We've seen
12assignments, control-flow statements, variable declarations and scope,
13subroutines and invocation. Our expressions have been limited so far to singular
14values, such as string literals and integer constants. In this episode, we'll
15enhance Squaak so it can handle operators, so you can construct more complex
16expressions.
17
18=head2 Operators, precedence and parse trees
19
20We will first briefly introduce the problem with recursive-descent parsers
21(which parsers generated with NQP are) when parsing expressions. Consider
22the following mini-grammar, which is a very basic calculator.
23
24 rule TOP {
25     <expression>*
26 }
27
28 rule expression {
29     <term>
30 }
31
32 rule term {
33     <factor> [ <addop> <factor> ]*
34 }
35
36 token addop { '+' | '-' }
37
38 rule factor {
39     <value> [ <mulop> <value> ]*
40 }
41
42 token mulop { '*' | '/' | '%' }
43
44 rule value{
45     | <number>
46     | '(' <expression> ')'
47 }
48
49This basic expression grammar implements operator precedence by taking
50advantage of the nature of a recursive-descent parser (if you haven't seen
51the word, google it). However, the big disadvantage of parsing expressions
52this way, is that the parse trees can become quite large. Perhaps more
53importantly, the parsing process is not very efficient. Let's take a look at
54some sample input. We won't show the parse trees as shown in Episode 2, but
55we'll just show an outline.
56
57 input: 42 results in this parse tree:
58
59 TOP
60   expression
61     term
62       factor
63         value
64           number
65             42
66
67As you can see, the input of this single number will invoke 6 grammar rules
68
69before parsing the actual digits. Not that bad, you might think.
70
71 input: "1 + 2" results in this parse tree (we ignore the operator for now):
72
73 TOP
74   expression
75     term
76       factor
77       | value
78       |   number
79       |     1
80       factor
81         value
82           number
83             2
84
85Only a few more grammar rules are invoked, not really a problem either.
86
87 input: "(1 + 2) * 3" results in this parse tree:
88
89 TOP
90   expression
91     term
92       factor
93         value
94         | expression
95         |   term
96         |   | factor
97         |   |   value
98         |   |     number
99         |   |       1
100         |   term
101         |     factor
102         |       value
103         |         number
104         |           2
105         value
106           number
107             3
108
109Right; 16 grammar rules just to parse this simple input. I'd call this slightly
110inefficient. The point is, implementing operator precedence using a
111recursive-descent parser is somewhat problematic, and given the fact there
112are better methods to parse expressions like these, not the way to go. Check
113out this nice explanation or google it.
114
115=head2 Bottom-up parsing and stacks: operator tables
116
117I would like to explain to you how bottom-up parsing works for expressions
118(or bottom-up parsers in general; Yacc/Bison are parser generators that generate
119bottom-up parsers for your grammar specification), taking operator precedence
120into account. However, it's been about 6 years that I did this in a CS class,
121and I don't remember the particular details. If you really want to know, check
122out the links at the end of the previous section. It's actually worth checking
123out. For now, I'll just assume you know what the problem is, so that I'll
124introduce the solution for NQP-based compilers immediately.
125At some point when parsing your input, you might encounter an expression. At
126this point, we'd like the parser to switch from top-down to bottom-up parsing.
127NQP-rx supports this, and is used as follows:
128
129 <EXPR>
130
131Of course, the optable must be populated with some operators that
132we need to be able to parse and it might be told what precedence and associativity they have. The
133easiest way to do this is by setting up precedence levels in an C<INIT> block:
134
135    INIT {
136        Squaak::Grammar.O(':prec<t>, :assoc<left>', '%additive');
137        Squaak::Grammar.O(':prec<u>, :assoc<lefT>', '%multiplicative');
138    }
139
140In this C<INIT> block, we use the C<O> method of the compiler to set up two precedence levels: one
141for operators like addition (named C<%additive>), and one for operators like multiplication (named
142C<%multiplicative>). Each of them has a ":prec" value and an ":assoc" value. ":prec" determines the
143precedence. Lexicographically greater values indicate higher precedence, so C<%additive> operators,
144with a precedence value of "t", have lower precedence than C<%multiplicative> operators with a
145precedence value of "u".":assoc" defines the associativity of the operators. If C<@> is a left
146associative operator, then 1 @ 2 @ 3 is equivalent to (1 @ 2) @ 3. However, if C<@> is right
147associative, then 1 @ 2 @ 3 is equivalent to 1 @ (2 @ 3). There are other options for the
148associativity, but we'll discuss them as we come to them.
149
150    token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> }
151
152This defines the operator C<*> (the C<infix:> is a prefix that tells the
153operator parser that this operator is an infix operator; there are other types,
154such as prefix, postfix and others). As you can see, it uses the O rule to specify that it is part
155of the C<%multiplicative> group of operators. The ":pirop" value specifies that the operator should
156compile to the C<mul> PIR opcode.
157
158Of course, the expression parser does not just parse operators, it must also
159parse the operands. So, how do we declare the most basic entity that represents
160an operand? It can be anything, from a basic integer-constant, a function call,
161or even a function definition (but adding two function definition doesn't
162really make sense, does it?). The operands are parsed in a recursive-descent
163fashion, so somewhere the parser must switch back from bottom-up
164(expression parsing) to top-down. This "switch-back" point is the proto token C<term>. This is the
165reason why integer constants are parsed by the rule term:sym<integer_constant>, for example, in our
166grammar.
167
168The C<term> proto token is
169invoked every time a new operand is needed
170
171=head2 Squaak Operators
172
173We have defined the entry and exit point of the expression (bottom-up) parser,
174now it's time to add the operators. Let's have a look at Squaak's operators and
175their precedence. The operators are listed with decreasing precedence (so that
176high-precedence operators are listed at the top). (I'm not sure if this
177precedence table is common compared to other languages; some operators may have
178a different precedence w.r.t. other operators than you're used to. At least the
179mathematical operators are organized according to standard math rules).
180
181 unary "-"
182 unary "not"
183 * / %
184 + - ..
185 < <= >= > != ==
186 and
187 or
188
189(".." is the string concatenation operator). Besides defining an entry and exit
190point for the expression parser, you need to define precedence levels for your operators. Find the
191C<INIT> block in Grammar.pm below the "## Operators" comment, and replace it with this:
192
193    INIT {
194        Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate');
195        Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not');
196        Squaak::Grammar.O(':prec<u>, :assoc<left>',  '%multiplicative');
197        Squaak::Grammar.O(':prec<t>, :assoc<left>',  '%additive');
198        Squaak::Grammar.O(':prec<s>, :assoc<left>',  '%relational');
199        Squaak::Grammar.O(':prec<r>, :assoc<left>',  '%conjunction');
200        Squaak::Grammar.O(':prec<q>, :assoc<left>',  '%disjunction');
201    }
202
203Now, we need to define the actual operators:
204
205    token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> }
206    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
207    token infix:sym«<» { <sym> <O('%relational, :pirop<islt>')> }
208    token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> }
209    token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> }
210    token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> }
211    token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> }
212
213Note that some operators are missing. See the exercises section for this.
214
215=head2 Short-circuiting logical operators
216
217Squaak has two logical operators: C<and> and C<or>; and results true if and
218only if both operands evaluate to true, while or results true if at least one
219of its operands evaluates to true. Both operands are short-circuited, which
220means that they don't evaluate both operands if that's unnecessary. For
221instance, if the first operand of the and operator evaluates to false, then
222there's no need to evaluate the second operand, as the final result of the
223and-expression cannot become true anymore (remember: both operands must
224evaluate to true).Let's think about how to implement this. When evaluating an
225and-expression, we first evaluate the first operand, and if it's true, only
226then does it make sense to evaluate the second operand. This behavior looks
227very much the same as an if-statement, doesn't it? In an if-statement, the
228first child is always evaluated, and if true, the second child
229(the C<then> block) is evaluated (remember, the third child -- the C<else>
230clause -- is optional). It would be great to be able to implement the and
231operator using a C<PAST::Op( :pasttype('if') )> node. Well, you can, using
232the ":pasttype" option! Here's how:
233
234    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
235
236So what about the or operator? When evaluating an or-expression, the first
237operand is evaluated. If it evaluates to true, then there's no need to evaluate
238the second operand, as the result of the or-expression is already true! Only if
239the first operand evaluates to false, is it necessary to evaluate the second
240child. Mmmmm.... what we're saying here is, unless the first operand evaluates
241to true, evaluate the second child. Guess what pasttype you'd need for that!
242
243=head2 Operators PAST types and PIR instructions
244
245In the previous section, we introduced the C<pasttype> clause that you can
246specify. This means that for that operator (for instance, the C<and> operator
247we discussed), a C<PAST::Op( :pasttype('if') )> node is created. What happens
248if you don't specify a pasttype? In that case, the corresponding action method is called. Obviously,
249some languages have very exotic semantics for the C<+> operator,
250but many languages just want to use Parrot's built-in C<add> instruction. How
251do we achieve that?
252
253Instead of adding a C<pasttype> clause, specify a C<pirop> clause. The
254C<pirop>, or I<PIR operator>, clause tells the code generator what operator
255should be generated. Instead of generating a subroutine invocation with the
256operands as arguments, it will generate the specified instruction with the
257operator's operands as arguments. Neat huh? Let's look at an example:
258
259    token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> }
260
261This specifies to use the C<add> instruction, which tells Parrot to create a
262new result object instead of changing one of the operands. PCT
263just emits the following for this:
264
265 add $P12, $P10, $P11
266
267which means that the PMCs in registers C<$P10> and C<$P11> are added, and
268assigned to a newly created PMC which is stored in register C<$P12>.
269
270=head2 To circumfix or not to circumfix
271
272Squaak supports parenthesized expressions. Parentheses can be used to change
273the order of evaluation in an expression, just as you probably have seen in other languages. Besides
274infix, prefix and postfix operators, you can define circumfix operators, which is specified with the
275left and right delimiter. This is an ideal way to implement parenthesized expressions:
276
277    token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' }
278
279    # with the action method:
280    method circumfix:sym<( )> { make $<EXPR>.ast; }
281
282This rule and action method were generated for us when we ran mk_language_shell.pl; you don't need
283to add them to the grammar and actions yourself. Circumfix operators are treated as terms by the
284operator-precedence parser, so it will parse as we want it to automatically.
285
286=head2 Expression parser's action method
287
288For all grammar rules we introduced, we also introduced an action method that
289is invoked after the grammar rule was done matching. What about the action
290method for EXPR? Our Squaak::Actions class inherits that from HLL::Actions. We don't have to write
291one.
292
293=head2 What's Next?
294
295This episode covered the implementation of operators, which allows us to write
296complex expressions. By now, most of our language is implemented, except for
297one thing: aggregate data structures. This will be the topic of Episode 8. We
298will introduce the two aggregate data types: array and hashtables, and see how
299we can implement these. We'll also discuss what happens when we pass such
300aggregates as subroutine arguments, and the difference with the basic data
301types.
302
303=head2 Exercises
304
305=over 4
306
307=item *
308
309Currently, Squaak only has grammar rules for integer and string constants, not
310floating point constants. Implement this grammar rule. A floating-point number
311consists of zero or more digits, followed by a dot and at least one digit, or,
312at least one digit followed by a dot and any number of digits. Examples are:
313
314 42.0, 1., .0001.
315
316There may be no whitespace between the individual digits and the dot. Make sure
317you understand the difference between a "rule" and a "token".
318
319Hint: a floating-point constant should produce a value of type 'Float'.
320
321Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which
322matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As
323a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match.
324Since the part of a floating-point constant before the decimal place is the same as an integer
325constant, unless the token for floating-point constants has a longer name than the token for
326integer-constants, the latter will match and a syntax error will result.
327
328=item *
329
330Implement the missing operators: (binary) "-", "<=", ">=", "==", "!=", "/",
331"%", "or"
332
333=back
334
335=head2 References
336
337docs/pct/pct_optable_guide.pod
338
339=head2 Solution to the exercises
340
341=over 4
342
343=item 1
344
345A floating-point number consists of zero or more digits, followed by a dot
346and at least one digit, or, at least one digit followed by a dot and any
347number of digits. Examples are: 42.0, 1., .0001. There may be no whitespace
348between the individual digits and the dot. Make sure you understand the
349difference between a C<rule> and a C<token>.
350
351Hint: a floating-point constant should produce a value of type 'Float'.
352
353Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which
354matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As
355a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match.
356Since the part of a floating-point constant before the decimal place is the same as an integer
357constant, unless the token for floating-point constants has a longer name than the token for
358integer-constants, the latter will match and a syntax error will result.
359
360    token term:sym<float_constant_long> { # longer to work around lack of LTM
361        [
362        | \d+ '.' \d*
363        | \d* '.' \d+
364        ]
365    }
366
367    # with action method:
368    method term:sym<float_constant_long>($/) { # name worksaround lack of LTM
369        make PAST::Val.new(:value(+$/), :returns<Float>);
370    }
371
372=item 2
373
374For sake of completeness (and easy copy-paste for you), here's the list of
375operator declarations as I wrote them for Squaak:
376
377    INIT {
378        Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate');
379        Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not');
380        Squaak::Grammar.O(':prec<u>, :assoc<left>',  '%multiplicative');
381        Squaak::Grammar.O(':prec<t>, :assoc<left>',  '%additive');
382        Squaak::Grammar.O(':prec<s>, :assoc<left>',  '%relational');
383        Squaak::Grammar.O(':prec<r>, :assoc<left>',  '%conjunction');
384        Squaak::Grammar.O(':prec<q>, :assoc<left>',  '%disjunction');
385    }
386
387    token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' }
388
389    token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> }
390    token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> }
391
392    token infix:sym<*>  { <sym> <O('%multiplicative, :pirop<mul>')> }
393    token infix:sym<%>  { <sym> <O('%multiplicative, :pirop<mod>')> }
394    token infix:sym</>  { <sym> <O('%multiplicative, :pirop<div>')> }
395
396    token infix:sym<+>  { <sym> <O('%additive, :pirop<add>')> }
397    token infix:sym<->  { <sym> <O('%additive, :pirop<sub>')> }
398    token infix:sym<..> { <sym> <O('%additive, :pirop<concat>')> }
399
400    token infix:sym«<» { <sym> <O('%relational, :pirop<isle iPP>')> }
401    token infix:sym«<=» { <sym> <O('%relational, :pirop<islt iPP>')> }
402    token infix:sym«>» { <sym> <O('%relational, :pirop<isgt iPP>')> }
403    token infix:sym«>=» { <sym> <O('%relational, :pirop<isge iPP>')> }
404    token infix:sym«==» { <sym> <O('%relational, :pirop<iseq iPP>')> }
405    token infix:sym«!=» { <sym> <O('%relational, :pirop<isne iPP>')> }
406
407    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
408    token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> }
409
410=back
411
412=cut
413