1# Copyright (C) 2008-2014, Parrot Foundation. 2=encoding utf8 3 4=head1 DESCRIPTION 5 6This is the seventh episode in a tutorial series on building a compiler with 7the Parrot Compiler Tools. 8 9=head1 Episode 7: Operators and Precedence 10 11Up till now, we've implemented a great deal of the Squaak language. We've seen 12assignments, control-flow statements, variable declarations and scope, 13subroutines and invocation. Our expressions have been limited so far to singular 14values, such as string literals and integer constants. In this episode, we'll 15enhance Squaak so it can handle operators, so you can construct more complex 16expressions. 17 18=head2 Operators, precedence and parse trees 19 20We will first briefly introduce the problem with recursive-descent parsers 21(which parsers generated with NQP are) when parsing expressions. Consider 22the following mini-grammar, which is a very basic calculator. 23 24 rule TOP { 25 <expression>* 26 } 27 28 rule expression { 29 <term> 30 } 31 32 rule term { 33 <factor> [ <addop> <factor> ]* 34 } 35 36 token addop { '+' | '-' } 37 38 rule factor { 39 <value> [ <mulop> <value> ]* 40 } 41 42 token mulop { '*' | '/' | '%' } 43 44 rule value{ 45 | <number> 46 | '(' <expression> ')' 47 } 48 49This basic expression grammar implements operator precedence by taking 50advantage of the nature of a recursive-descent parser (if you haven't seen 51the word, google it). However, the big disadvantage of parsing expressions 52this way, is that the parse trees can become quite large. Perhaps more 53importantly, the parsing process is not very efficient. Let's take a look at 54some sample input. We won't show the parse trees as shown in Episode 2, but 55we'll just show an outline. 56 57 input: 42 results in this parse tree: 58 59 TOP 60 expression 61 term 62 factor 63 value 64 number 65 42 66 67As you can see, the input of this single number will invoke 6 grammar rules 68 69before parsing the actual digits. Not that bad, you might think. 70 71 input: "1 + 2" results in this parse tree (we ignore the operator for now): 72 73 TOP 74 expression 75 term 76 factor 77 | value 78 | number 79 | 1 80 factor 81 value 82 number 83 2 84 85Only a few more grammar rules are invoked, not really a problem either. 86 87 input: "(1 + 2) * 3" results in this parse tree: 88 89 TOP 90 expression 91 term 92 factor 93 value 94 | expression 95 | term 96 | | factor 97 | | value 98 | | number 99 | | 1 100 | term 101 | factor 102 | value 103 | number 104 | 2 105 value 106 number 107 3 108 109Right; 16 grammar rules just to parse this simple input. I'd call this slightly 110inefficient. The point is, implementing operator precedence using a 111recursive-descent parser is somewhat problematic, and given the fact there 112are better methods to parse expressions like these, not the way to go. Check 113out this nice explanation or google it. 114 115=head2 Bottom-up parsing and stacks: operator tables 116 117I would like to explain to you how bottom-up parsing works for expressions 118(or bottom-up parsers in general; Yacc/Bison are parser generators that generate 119bottom-up parsers for your grammar specification), taking operator precedence 120into account. However, it's been about 6 years that I did this in a CS class, 121and I don't remember the particular details. If you really want to know, check 122out the links at the end of the previous section. It's actually worth checking 123out. For now, I'll just assume you know what the problem is, so that I'll 124introduce the solution for NQP-based compilers immediately. 125At some point when parsing your input, you might encounter an expression. At 126this point, we'd like the parser to switch from top-down to bottom-up parsing. 127NQP-rx supports this, and is used as follows: 128 129 <EXPR> 130 131Of course, the optable must be populated with some operators that 132we need to be able to parse and it might be told what precedence and associativity they have. The 133easiest way to do this is by setting up precedence levels in an C<INIT> block: 134 135 INIT { 136 Squaak::Grammar.O(':prec<t>, :assoc<left>', '%additive'); 137 Squaak::Grammar.O(':prec<u>, :assoc<lefT>', '%multiplicative'); 138 } 139 140In this C<INIT> block, we use the C<O> method of the compiler to set up two precedence levels: one 141for operators like addition (named C<%additive>), and one for operators like multiplication (named 142C<%multiplicative>). Each of them has a ":prec" value and an ":assoc" value. ":prec" determines the 143precedence. Lexicographically greater values indicate higher precedence, so C<%additive> operators, 144with a precedence value of "t", have lower precedence than C<%multiplicative> operators with a 145precedence value of "u".":assoc" defines the associativity of the operators. If C<@> is a left 146associative operator, then 1 @ 2 @ 3 is equivalent to (1 @ 2) @ 3. However, if C<@> is right 147associative, then 1 @ 2 @ 3 is equivalent to 1 @ (2 @ 3). There are other options for the 148associativity, but we'll discuss them as we come to them. 149 150 token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> } 151 152This defines the operator C<*> (the C<infix:> is a prefix that tells the 153operator parser that this operator is an infix operator; there are other types, 154such as prefix, postfix and others). As you can see, it uses the O rule to specify that it is part 155of the C<%multiplicative> group of operators. The ":pirop" value specifies that the operator should 156compile to the C<mul> PIR opcode. 157 158Of course, the expression parser does not just parse operators, it must also 159parse the operands. So, how do we declare the most basic entity that represents 160an operand? It can be anything, from a basic integer-constant, a function call, 161or even a function definition (but adding two function definition doesn't 162really make sense, does it?). The operands are parsed in a recursive-descent 163fashion, so somewhere the parser must switch back from bottom-up 164(expression parsing) to top-down. This "switch-back" point is the proto token C<term>. This is the 165reason why integer constants are parsed by the rule term:sym<integer_constant>, for example, in our 166grammar. 167 168The C<term> proto token is 169invoked every time a new operand is needed 170 171=head2 Squaak Operators 172 173We have defined the entry and exit point of the expression (bottom-up) parser, 174now it's time to add the operators. Let's have a look at Squaak's operators and 175their precedence. The operators are listed with decreasing precedence (so that 176high-precedence operators are listed at the top). (I'm not sure if this 177precedence table is common compared to other languages; some operators may have 178a different precedence w.r.t. other operators than you're used to. At least the 179mathematical operators are organized according to standard math rules). 180 181 unary "-" 182 unary "not" 183 * / % 184 + - .. 185 < <= >= > != == 186 and 187 or 188 189(".." is the string concatenation operator). Besides defining an entry and exit 190point for the expression parser, you need to define precedence levels for your operators. Find the 191C<INIT> block in Grammar.pm below the "## Operators" comment, and replace it with this: 192 193 INIT { 194 Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate'); 195 Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not'); 196 Squaak::Grammar.O(':prec<u>, :assoc<left>', '%multiplicative'); 197 Squaak::Grammar.O(':prec<t>, :assoc<left>', '%additive'); 198 Squaak::Grammar.O(':prec<s>, :assoc<left>', '%relational'); 199 Squaak::Grammar.O(':prec<r>, :assoc<left>', '%conjunction'); 200 Squaak::Grammar.O(':prec<q>, :assoc<left>', '%disjunction'); 201 } 202 203Now, we need to define the actual operators: 204 205 token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> } 206 token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> } 207 token infix:sym«<» { <sym> <O('%relational, :pirop<islt>')> } 208 token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> } 209 token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> } 210 token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> } 211 token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> } 212 213Note that some operators are missing. See the exercises section for this. 214 215=head2 Short-circuiting logical operators 216 217Squaak has two logical operators: C<and> and C<or>; and results true if and 218only if both operands evaluate to true, while or results true if at least one 219of its operands evaluates to true. Both operands are short-circuited, which 220means that they don't evaluate both operands if that's unnecessary. For 221instance, if the first operand of the and operator evaluates to false, then 222there's no need to evaluate the second operand, as the final result of the 223and-expression cannot become true anymore (remember: both operands must 224evaluate to true).Let's think about how to implement this. When evaluating an 225and-expression, we first evaluate the first operand, and if it's true, only 226then does it make sense to evaluate the second operand. This behavior looks 227very much the same as an if-statement, doesn't it? In an if-statement, the 228first child is always evaluated, and if true, the second child 229(the C<then> block) is evaluated (remember, the third child -- the C<else> 230clause -- is optional). It would be great to be able to implement the and 231operator using a C<PAST::Op( :pasttype('if') )> node. Well, you can, using 232the ":pasttype" option! Here's how: 233 234 token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> } 235 236So what about the or operator? When evaluating an or-expression, the first 237operand is evaluated. If it evaluates to true, then there's no need to evaluate 238the second operand, as the result of the or-expression is already true! Only if 239the first operand evaluates to false, is it necessary to evaluate the second 240child. Mmmmm.... what we're saying here is, unless the first operand evaluates 241to true, evaluate the second child. Guess what pasttype you'd need for that! 242 243=head2 Operators PAST types and PIR instructions 244 245In the previous section, we introduced the C<pasttype> clause that you can 246specify. This means that for that operator (for instance, the C<and> operator 247we discussed), a C<PAST::Op( :pasttype('if') )> node is created. What happens 248if you don't specify a pasttype? In that case, the corresponding action method is called. Obviously, 249some languages have very exotic semantics for the C<+> operator, 250but many languages just want to use Parrot's built-in C<add> instruction. How 251do we achieve that? 252 253Instead of adding a C<pasttype> clause, specify a C<pirop> clause. The 254C<pirop>, or I<PIR operator>, clause tells the code generator what operator 255should be generated. Instead of generating a subroutine invocation with the 256operands as arguments, it will generate the specified instruction with the 257operator's operands as arguments. Neat huh? Let's look at an example: 258 259 token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> } 260 261This specifies to use the C<add> instruction, which tells Parrot to create a 262new result object instead of changing one of the operands. PCT 263just emits the following for this: 264 265 add $P12, $P10, $P11 266 267which means that the PMCs in registers C<$P10> and C<$P11> are added, and 268assigned to a newly created PMC which is stored in register C<$P12>. 269 270=head2 To circumfix or not to circumfix 271 272Squaak supports parenthesized expressions. Parentheses can be used to change 273the order of evaluation in an expression, just as you probably have seen in other languages. Besides 274infix, prefix and postfix operators, you can define circumfix operators, which is specified with the 275left and right delimiter. This is an ideal way to implement parenthesized expressions: 276 277 token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' } 278 279 # with the action method: 280 method circumfix:sym<( )> { make $<EXPR>.ast; } 281 282This rule and action method were generated for us when we ran mk_language_shell.pl; you don't need 283to add them to the grammar and actions yourself. Circumfix operators are treated as terms by the 284operator-precedence parser, so it will parse as we want it to automatically. 285 286=head2 Expression parser's action method 287 288For all grammar rules we introduced, we also introduced an action method that 289is invoked after the grammar rule was done matching. What about the action 290method for EXPR? Our Squaak::Actions class inherits that from HLL::Actions. We don't have to write 291one. 292 293=head2 What's Next? 294 295This episode covered the implementation of operators, which allows us to write 296complex expressions. By now, most of our language is implemented, except for 297one thing: aggregate data structures. This will be the topic of Episode 8. We 298will introduce the two aggregate data types: array and hashtables, and see how 299we can implement these. We'll also discuss what happens when we pass such 300aggregates as subroutine arguments, and the difference with the basic data 301types. 302 303=head2 Exercises 304 305=over 4 306 307=item * 308 309Currently, Squaak only has grammar rules for integer and string constants, not 310floating point constants. Implement this grammar rule. A floating-point number 311consists of zero or more digits, followed by a dot and at least one digit, or, 312at least one digit followed by a dot and any number of digits. Examples are: 313 314 42.0, 1., .0001. 315 316There may be no whitespace between the individual digits and the dot. Make sure 317you understand the difference between a "rule" and a "token". 318 319Hint: a floating-point constant should produce a value of type 'Float'. 320 321Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which 322matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As 323a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match. 324Since the part of a floating-point constant before the decimal place is the same as an integer 325constant, unless the token for floating-point constants has a longer name than the token for 326integer-constants, the latter will match and a syntax error will result. 327 328=item * 329 330Implement the missing operators: (binary) "-", "<=", ">=", "==", "!=", "/", 331"%", "or" 332 333=back 334 335=head2 References 336 337docs/pct/pct_optable_guide.pod 338 339=head2 Solution to the exercises 340 341=over 4 342 343=item 1 344 345A floating-point number consists of zero or more digits, followed by a dot 346and at least one digit, or, at least one digit followed by a dot and any 347number of digits. Examples are: 42.0, 1., .0001. There may be no whitespace 348between the individual digits and the dot. Make sure you understand the 349difference between a C<rule> and a C<token>. 350 351Hint: a floating-point constant should produce a value of type 'Float'. 352 353Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which 354matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As 355a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match. 356Since the part of a floating-point constant before the decimal place is the same as an integer 357constant, unless the token for floating-point constants has a longer name than the token for 358integer-constants, the latter will match and a syntax error will result. 359 360 token term:sym<float_constant_long> { # longer to work around lack of LTM 361 [ 362 | \d+ '.' \d* 363 | \d* '.' \d+ 364 ] 365 } 366 367 # with action method: 368 method term:sym<float_constant_long>($/) { # name worksaround lack of LTM 369 make PAST::Val.new(:value(+$/), :returns<Float>); 370 } 371 372=item 2 373 374For sake of completeness (and easy copy-paste for you), here's the list of 375operator declarations as I wrote them for Squaak: 376 377 INIT { 378 Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate'); 379 Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not'); 380 Squaak::Grammar.O(':prec<u>, :assoc<left>', '%multiplicative'); 381 Squaak::Grammar.O(':prec<t>, :assoc<left>', '%additive'); 382 Squaak::Grammar.O(':prec<s>, :assoc<left>', '%relational'); 383 Squaak::Grammar.O(':prec<r>, :assoc<left>', '%conjunction'); 384 Squaak::Grammar.O(':prec<q>, :assoc<left>', '%disjunction'); 385 } 386 387 token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' } 388 389 token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> } 390 token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> } 391 392 token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> } 393 token infix:sym<%> { <sym> <O('%multiplicative, :pirop<mod>')> } 394 token infix:sym</> { <sym> <O('%multiplicative, :pirop<div>')> } 395 396 token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> } 397 token infix:sym<-> { <sym> <O('%additive, :pirop<sub>')> } 398 token infix:sym<..> { <sym> <O('%additive, :pirop<concat>')> } 399 400 token infix:sym«<» { <sym> <O('%relational, :pirop<isle iPP>')> } 401 token infix:sym«<=» { <sym> <O('%relational, :pirop<islt iPP>')> } 402 token infix:sym«>» { <sym> <O('%relational, :pirop<isgt iPP>')> } 403 token infix:sym«>=» { <sym> <O('%relational, :pirop<isge iPP>')> } 404 token infix:sym«==» { <sym> <O('%relational, :pirop<iseq iPP>')> } 405 token infix:sym«!=» { <sym> <O('%relational, :pirop<isne iPP>')> } 406 407 token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> } 408 token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> } 409 410=back 411 412=cut 413