1<?xml version="1.0" encoding="iso-8859-1"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" 3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> 4 5<book id="happy"> 6 <bookinfo> 7 <date>2001-4-27</date> 8 <title>Happy User Guide</title> 9 <author> 10 <firstname>Simon</firstname> 11 <surname>Marlow</surname> 12 </author> 13 <author> 14 <firstname>Andy</firstname> 15 <surname>Gill</surname> 16 </author> 17 <address><email>simonmar@microsoft.com</email></address> 18 <copyright> 19 <year>1997-2009</year> 20 <holder>Simon Marlow</holder> 21 </copyright> 22 <abstract> 23 <para>This document describes Happy, the Haskell Parser 24 Generator, version 1.18.</para> 25 </abstract> 26 </bookinfo> 27 28 <!-- Table of contents --> 29 <toc></toc> 30 31<!-- Introduction ========================================================= --> 32 33 <chapter id="happy-introduction"> 34 <title>Introduction</title> 35 36 37 <para> <application>Happy</application> is a parser generator 38 system for Haskell, similar to the tool 39 <application>yacc</application> for C. Like 40 <application>yacc</application>, it takes a file containing an 41 annotated BNF specification of a grammar and produces a Haskell 42 module containing a parser for the grammar. </para> 43 44 <indexterm><primary>yacc</primary></indexterm> 45 46 <para> <application>Happy</application> is flexible: you can have several 47 <application>Happy</application> parsers in the same program, and 48 each parser may have multiple entry points. 49 <application>Happy</application> can work in conjunction with a 50 lexical analyser supplied by the user (either hand-written or 51 generated by another program), or it can parse a stream of 52 characters directly (but this isn't practical in most cases). In 53 a future version we hope to include a lexical analyser generator 54 with <application>Happy</application> as a single package. </para> 55 56 <para> Parsers generated by <application>Happy</application> are 57 fast; generally faster than an equivalent parser written using 58 parsing combinators or similar tools. Furthermore, any future 59 improvements made to <application>Happy</application> will benefit 60 an existing grammar, without need for a rewrite. </para> 61 62 <para> <application>Happy</application> is sufficiently powerful 63 to parse full Haskell 64 - <ulink url="http://www.haskell.org/ghc">GHC</ulink> itself uses 65 a Happy parser.</para> 66 67 <indexterm><primary><literal>hsparser</literal></primary></indexterm> 68 <indexterm> 69 <primary>Haskell parser</primary> 70 <see><literal>hsparser</literal></see> 71 </indexterm> 72 73 <para> <application>Happy</application> can currently generate 74 four types of parser from a given grammar, the intention being 75 that we can experiment with different kinds of functional code to 76 see which is the best, and compiler writers can use the different 77 types of parser to tune their compilers. The types of parser 78 supported are: </para> 79 80 <orderedlist> 81 82 <listitem id="item-default-backend"> 83 <para><quote>standard</quote> Haskell 98 (should work with any compiler 84 that compiles Haskell 98).</para> 85 </listitem> 86 87 <listitem> 88 <para>standard Haskell using arrays 89 <indexterm scope="all"><primary>arrays</primary></indexterm> 90 <indexterm scope="all"><primary>back-ends</primary><secondary>arrays</secondary></indexterm> 91 (this is not the default 92 because we have found this generates slower parsers than <xref 93 linkend="item-default-backend"/>).</para> 94 </listitem> 95 96 <listitem> 97 <para>Haskell with GHC 98 <indexterm><primary>GHC</primary></indexterm> 99 <indexterm><primary>back-ends</primary><secondary>GHC</secondary></indexterm> 100 (Glasgow Haskell) extensions. This is a 101 slightly faster option than <xref 102 linkend="item-default-backend"/> for Glasgow Haskell 103 users.</para> 104 </listitem> 105 106 107 <listitem> 108 <para>GHC Haskell with string-encoded arrays. This is the 109 fastest/smallest option for GHC users. If you're using GHC, 110 the optimum flag settings are <literal>-agc</literal> (see 111 <xref linkend="sec-invoking"/>).</para> 112 </listitem> 113 114 </orderedlist> 115 116 <para>Happy can also generate parsers which will dump debugging 117 information at run time, showing state transitions and the input 118 tokens to the parser.</para> 119 120 <sect1 id="sec-compatibility"> 121 <title>Compatibility</title> 122 123 <para> <application>Happy</application> is written in Glasgow Haskell. This 124 means that (for the time being), you need GHC to compile it. 125 Any version of GHC >= 6.2 should work.</para> 126 127 <para> Remember: parsers produced using 128 <application>Happy</application> should compile without 129 difficulty under any Haskell 98 compiler or interpreter.<footnote><para>With one 130 exception: if you have a production with a polymorphic type signature, 131 then a compiler that supports local universal quantification is 132 required. See <xref linkend="sec-type-signatures" />.</para> 133 </footnote></para> 134 </sect1> 135 136 <sect1 id="sec-reporting-bugs"> 137 <title>Reporting Bugs</title> 138 139 <indexterm> 140 <primary>bugs, reporting</primary> 141 </indexterm> 142 143 <para> Any bugs found in <application>Happy</application> should 144 be reported to me: Simon Marlow 145 <email>marlowsd@gmail.com</email> including all the relevant 146 information: the compiler used to compile 147 <application>Happy</application>, the command-line options used, 148 your grammar file or preferably a cut-down example showing the 149 problem, and a description of what goes wrong. A patch to fix 150 the problem would also be greatly appreciated. </para> 151 152 <para> Requests for new features should also be sent to the 153 above address, especially if accompanied by patches :-).</para> 154 155 </sect1> 156 157 <sect1 id="sec-license"> 158 <title>License</title> 159 160 <indexterm> 161 <primary>License</primary> 162 </indexterm> 163 164 <para> Previous versions of <application>Happy</application> 165 were covered by the GNU general public license. We're now 166 distributing <application>Happy</application> with a less 167 restrictive BSD-style license. If this license doesn't work for 168 you, please get in touch.</para> 169 170 <blockquote> 171 <para> Copyright 2009, Simon Marlow and Andy Gill. All rights 172 reserved. </para> 173 174 <para> Redistribution and use in source and binary forms, with 175 or without modification, are permitted provided that the 176 following conditions are met: </para> 177 178 <itemizedlist> 179 <listitem> 180 <para>Redistributions of source code must retain the above 181 copyright notice, this list of conditions and the 182 following disclaimer.</para> 183 </listitem> 184 185 <listitem> 186 <para> Redistributions in binary form must reproduce the 187 above copyright notice, this list of conditions and the 188 following disclaimer in the documentation and/or other 189 materials provided with the distribution.</para> 190 </listitem> 191 </itemizedlist> 192 193 <para>THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS "AS 194 IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 195 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND 196 FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT 197 SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY DIRECT, 198 INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 199 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 200 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 201 OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 202 LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 203 (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF 204 THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 205 OF SUCH DAMAGE.</para> 206 </blockquote> 207 </sect1> 208 209 <sect1 id="sec-obtaining"> 210 <title>Obtaining <application>Happy</application></title> 211 212 <para> <application>Happy</application>'s web page can be found at <ulink 213 url="http://www.haskell.org/happy/">http://www.haskell.org/happy/</ulink>. 214 <application>Happy</application> source and binaries can be downloaded from 215 there.</para> 216 217 </sect1> 218 219 </chapter> 220 221<!-- Using Happy =========================================================== --> 222 223 <chapter id="sec-using"> 224 <title>Using <application>Happy</application></title> 225 226 <para> Users of <application>Yacc</application> will find 227 <application>Happy</application> quite familiar. The basic idea is 228 as follows: </para> 229 230 <itemizedlist> 231 <listitem> 232 <para>Define the grammar you want to parse in a 233 <application>Happy</application> grammar file. </para> 234 </listitem> 235 236 <listitem> 237 <para> Run the grammar through <application>Happy</application>, to generate 238 a compilable Haskell module.</para> 239 </listitem> 240 241 <listitem> 242 <para> Use this module as part of your Haskell program, usually 243 in conjunction with a lexical analyser (a function that splits 244 the input into <quote>tokens</quote>, the basic unit of parsing).</para> 245 </listitem> 246 </itemizedlist> 247 248 <para> Let's run through an example. We'll implement a parser for a 249 simple expression syntax, consisting of integers, variables, the 250 operators <literal>+</literal>, <literal>-</literal>, <literal>*</literal>, 251 <literal>/</literal>, and the form <literal>let var = exp in exp</literal>. 252 The grammar file starts off like this:</para> 253 254<programlisting> 255{ 256module Main where 257} 258</programlisting> 259 260 <para>At the top of the file is an optional <firstterm>module 261 header</firstterm>, 262 <indexterm> 263 <primary>module</primary> 264 <secondary>header</secondary> 265 </indexterm> 266 which is just a Haskell module header enclosed in braces. This 267 code is emitted verbatim into the generated module, so you can put 268 any Haskell code here at all. In a grammar file, Haskell code is 269 always contained between curly braces to distinguish it from the 270 grammar.</para> 271 272 <para>In this case, the parser will be a standalone program so 273 we'll call the module <literal>Main</literal>.</para> 274 275 <para>Next comes a couple of declarations:</para> 276 277<programlisting> 278%name calc 279%tokentype { Token } 280%error { parseError } 281</programlisting> 282 283 <indexterm> 284 <primary><literal>%name</literal></primary> 285 </indexterm> 286 <indexterm> 287 <primary><literal>%tokentype</literal></primary> 288 </indexterm> 289 <indexterm> 290 <primary><literal>%error</literal></primary> 291 </indexterm> 292 293 <para>The first line declares the name of the parsing function 294 that <application>Happy</application> will generate, in this case 295 <literal>calc</literal>. In many cases, this is the only symbol you need 296 to export from the module.</para> 297 298 <para>The second line declares the type of tokens that the parser 299 will accept. The parser (i.e. the function 300 <function>calc</function>) will be of type <literal>[Token] -> 301 T</literal>, where <literal>T</literal> is the return type of the 302 parser, determined by the production rules below.</para> 303 304 <para>The <literal>%error</literal> directive tells Happy the name 305 of a function it should call in the event of a parse error. More 306 about this later.</para> 307 308 <para>Now we declare all the possible tokens:</para> 309 310<programlisting> 311%token 312 let { TokenLet } 313 in { TokenIn } 314 int { TokenInt $$ } 315 var { TokenVar $$ } 316 '=' { TokenEq } 317 '+' { TokenPlus } 318 '-' { TokenMinus } 319 '*' { TokenTimes } 320 '/' { TokenDiv } 321 '(' { TokenOB } 322 ')' { TokenCB } 323</programlisting> 324 325 <indexterm> 326 <primary><literal>%token</literal></primary> 327 </indexterm> 328 329 <para>The symbols on the left are the tokens as they will be 330 referred to in the rest of the grammar, and to the right of each 331 token enclosed in braces is a Haskell pattern that matches the 332 token. The parser will expect to receive a stream of tokens, each 333 of which will match one of the given patterns (the definition of 334 the <literal>Token</literal> datatype is given later).</para> 335 336 <para>The <literal>$$</literal> symbol is a placeholder that 337 represents the <emphasis>value</emphasis> of this token. Normally the value 338 of a token is the token itself, but by using the 339 <literal>$$</literal> symbol you can specify some component 340 of the token object to be the value. </para> 341 342 <indexterm> 343 <primary><literal>$$</literal></primary> 344 </indexterm> 345 346 <para>Like yacc, we include <literal>%%</literal> here, for no real 347 reason.</para> 348 349<programlisting> 350%% 351</programlisting> 352 353 <para>Now we have the production rules for the grammar.</para> 354 355<programlisting> 356Exp : let var '=' Exp in Exp { Let $2 $4 $6 } 357 | Exp1 { Exp1 $1 } 358 359Exp1 : Exp1 '+' Term { Plus $1 $3 } 360 | Exp1 '-' Term { Minus $1 $3 } 361 | Term { Term $1 } 362 363Term : Term '*' Factor { Times $1 $3 } 364 | Term '/' Factor { Div $1 $3 } 365 | Factor { Factor $1 } 366 367Factor 368 : int { Int $1 } 369 | var { Var $1 } 370 | '(' Exp ')' { Brack $2 } 371</programlisting> 372 373 <indexterm> 374 <primary>non-terminal</primary> 375 </indexterm> 376 <para>Each production consists of a <firstterm>non-terminal</firstterm> 377 symbol on the left, followed by a colon, followed by one or more 378 expansions on the right, separated by <literal>|</literal>. Each expansion 379 has some Haskell code associated with it, enclosed in braces as 380 usual.</para> 381 382 <para>The way to think about a parser is with each symbol having a 383 <quote>value</quote>: we defined the values of the tokens above, and the 384 grammar defines the values of non-terminal symbols in terms of 385 sequences of other symbols (either tokens or non-terminals). In a 386 production like this:</para> 387 388<programlisting> 389n : t_1 ... t_n { E } 390</programlisting> 391 392 <para>whenever the parser finds the symbols <literal>t_1...t_n</literal> in 393 the token stream, it constructs the symbol <literal>n</literal> and gives 394 it the value <literal>E</literal>, which may refer to the values of 395 <literal>t_1...t_n</literal> using the symbols 396 <literal>$1...$n</literal>.</para> 397 398 <para>The parser reduces the input using the rules in the grammar 399 until just one symbol remains: the first symbol defined in the 400 grammar (namely <literal>Exp</literal> in our example). The value of this 401 symbol is the return value from the parser.</para> 402 403 <para>To complete the program, we need some extra code. The 404 grammar file may optionally contain a final code section, enclosed 405 in curly braces.</para> 406 407<programlisting>{</programlisting> 408 409 <para>All parsers must include a function to be called in the 410 event of a parse error. In the <literal>%error</literal> 411 directive earlier, we specified that the function to be called on 412 a parse error is <literal>parseError</literal>:</para> 413 414<programlisting> 415parseError :: [Token] -> a 416parseError _ = error "Parse error" 417</programlisting> 418 419 <para>Note that <literal>parseError</literal> must be polymorphic 420 in its return type <literal>a</literal>, which usually means it 421 must be a call to <literal>error</literal>. We'll see in <xref 422 linkend="sec-monads"/> how to wrap the parser in a monad so that we 423 can do something more sensible with errors. It's also possible to 424 keep track of line numbers in the parser for use in error 425 messages, this is described in <xref 426 linkend="sec-line-numbers"/>.</para> 427 428 <para>Next we can declare the data type that represents the parsed 429 expression:</para> 430 431<programlisting> 432data Exp 433 = Let String Exp Exp 434 | Exp1 Exp1 435 deriving Show 436 437data Exp1 438 = Plus Exp1 Term 439 | Minus Exp1 Term 440 | Term Term 441 deriving Show 442 443data Term 444 = Times Term Factor 445 | Div Term Factor 446 | Factor Factor 447 deriving Show 448 449data Factor 450 = Int Int 451 | Var String 452 | Brack Exp 453 deriving Show 454</programlisting> 455 456 <para>And the data structure for the tokens...</para> 457 458<programlisting> 459data Token 460 = TokenLet 461 | TokenIn 462 | TokenInt Int 463 | TokenVar String 464 | TokenEq 465 | TokenPlus 466 | TokenMinus 467 | TokenTimes 468 | TokenDiv 469 | TokenOB 470 | TokenCB 471 deriving Show 472</programlisting> 473 474 <para>... and a simple lexer that returns this data 475 structure.</para> 476 477<programlisting> 478lexer :: String -> [Token] 479lexer [] = [] 480lexer (c:cs) 481 | isSpace c = lexer cs 482 | isAlpha c = lexVar (c:cs) 483 | isDigit c = lexNum (c:cs) 484lexer ('=':cs) = TokenEq : lexer cs 485lexer ('+':cs) = TokenPlus : lexer cs 486lexer ('-':cs) = TokenMinus : lexer cs 487lexer ('*':cs) = TokenTimes : lexer cs 488lexer ('/':cs) = TokenDiv : lexer cs 489lexer ('(':cs) = TokenOB : lexer cs 490lexer (')':cs) = TokenCB : lexer cs 491 492lexNum cs = TokenInt (read num) : lexer rest 493 where (num,rest) = span isDigit cs 494 495lexVar cs = 496 case span isAlpha cs of 497 ("let",rest) -> TokenLet : lexer rest 498 ("in",rest) -> TokenIn : lexer rest 499 (var,rest) -> TokenVar var : lexer rest 500</programlisting> 501 502 <para>And finally a top-level function to take some input, parse 503 it, and print out the result.</para> 504 505<programlisting> 506main = getContents >>= print . calc . lexer 507} 508</programlisting> 509 510 <para>And that's it! A whole lexer, parser and grammar in a few 511 dozen lines. Another good example is <application>Happy</application>'s own 512 parser. Several features in <application>Happy</application> were developed 513 using this as an example.</para> 514 515 <indexterm> 516 <primary>info file</primary> 517 </indexterm> 518 519 <para>To generate the Haskell module for this parser, type the 520 command <command>happy example.y</command> (where 521 <filename>example.y</filename> is the name of the grammar file). 522 The Haskell module will be placed in a file named 523 <filename>example.hs</filename>. Additionally, invoking the 524 command <command>happy example.y -i</command> will produce the 525 file <filename>example.info</filename> which contains detailed information 526 about the parser, including states and reduction rules (see <xref 527 linkend="sec-info-files"/>). This can be invaluable for debugging 528 parsers, but requires some knowledge of the operation of a 529 shift-reduce parser. </para> 530 531 <sect1 id="sec-other-datatypes"> 532 <title>Returning other datatypes</title> 533 534 <para>In the above example, we used a data type to represent the 535 syntax being parsed. However, there's no reason why it has to 536 be this way: you could calculate the value of the expression on 537 the fly, using productions like this:</para> 538 539<programlisting> 540Term : Term '*' Factor { $1 * $3 } 541 | Term '/' Factor { $1 / $3 } 542 | Factor { $1 } 543</programlisting> 544 545 <para>The value of a <literal>Term</literal> would be the value of the 546 expression itself, and the parser could return an integer. </para> 547 548 <para>This works for simple expression types, but our grammar 549 includes variables and the <literal>let</literal> syntax. How do we know 550 the value of a variable while we're parsing it? We don't, but 551 since the Haskell code for a production can be anything at all, 552 we could make it a function that takes an environment of 553 variable values, and returns the computed value of the 554 expression:</para> 555 556<programlisting> 557Exp : let var '=' Exp in Exp { \p -> $6 (($2,$4 p):p) } 558 | Exp1 { $1 } 559 560Exp1 : Exp1 '+' Term { \p -> $1 p + $3 p } 561 | Exp1 '-' Term { \p -> $1 p - $3 p } 562 | Term { $1 } 563 564Term : Term '*' Factor { \p -> $1 p * $3 p } 565 | Term '/' Factor { \p -> $1 p `div` $3 p } 566 | Factor { $1 } 567 568Factor 569 : int { \p -> $1 } 570 | var { \p -> case lookup $1 p of 571 Nothing -> error "no var" 572 Just i -> i } 573 | '(' Exp ')' { $2 } 574</programlisting> 575 576 <para>The value of each production is a function from an 577 environment <emphasis>p</emphasis> to a value. When parsing a 578 <literal>let</literal> construct, we extend the environment with the new 579 binding to find the value of the body, and the rule for 580 <literal>var</literal> looks up its value in the environment. There's 581 something you can't do in <literal>yacc</literal> :-)</para> 582 583 </sect1> 584 585 <sect1 id="sec-sequences"> 586 <title>Parsing sequences</title> 587 588 <para>A common feature in grammars is a <emphasis>sequence</emphasis> of a 589 particular syntactic element. In EBNF, we'd write something 590 like <literal>n+</literal> to represent a sequence of one or more 591 <literal>n</literal>s, and <literal>n*</literal> for zero or more. 592 <application>Happy</application> doesn't support this syntax explicitly, but 593 you can define the equivalent sequences using simple 594 productions.</para> 595 596 <para>For example, the grammar for <application>Happy</application> itself 597 contains a rule like this:</para> 598 599<programlisting> 600prods : prod { [$1] } 601 | prods prod { $2 : $1 } 602</programlisting> 603 604 <para>In other words, a sequence of productions is either a 605 single production, or a sequence of productions followed by a 606 single production. This recursive rule defines a sequence of 607 one or more productions.</para> 608 609 <para>One thing to note about this rule is that we used 610 <emphasis>left recursion</emphasis> to define it - we could have written 611 it like this:</para> 612 613 <indexterm> 614 <primary>recursion, left vs. right</primary> 615 </indexterm> 616 617<programlisting> 618prods : prod { [$1] } 619 | prod prods { $1 : $2 } 620</programlisting> 621 622 <para>The only reason we used left recursion is that 623 <application>Happy</application> is more efficient at parsing left-recursive 624 rules; they result in a constant stack-space parser, whereas 625 right-recursive rules require stack space proportional to the 626 length of the list being parsed. This can be extremely 627 important where long sequences are involved, for instance in 628 automatically generated output. For example, the parser in GHC 629 used to use right-recursion to parse lists, and as a result it 630 failed to parse some <application>Happy</application>-generated modules due 631 to running out of stack space!</para> 632 633 <para>One implication of using left recursion is that the resulting 634 list comes out reversed, and you have to reverse it again to get 635 it in the original order. Take a look at the 636 <application>Happy</application> grammar for Haskell for many examples of 637 this.</para> 638 639 <para>Parsing sequences of zero or more elements requires a 640 trivial change to the above pattern:</para> 641 642<programlisting> 643prods : {- empty -} { [] } 644 | prods prod { $2 : $1 } 645</programlisting> 646 647 <para>Yes - empty productions are allowed. The normal 648 convention is to include the comment <literal>{- empty -}</literal> to 649 make it more obvious to a reader of the code what's going 650 on.</para> 651 652 <sect2 id="sec-separators"> 653 <title>Sequences with separators</title> 654 655 <para>A common type of sequence is one with a 656 <emphasis>separator</emphasis>: for instance function bodies in C 657 consist of statements separated by semicolons. To parse this 658 kind of sequence we use a production like this:</para> 659 660<programlisting> 661stmts : stmt { [$1] } 662 | stmts ';' stmt { $3 : $1 } 663</programlisting> 664 665 <para>If the <literal>;</literal> is to be a <emphasis>terminator</emphasis> 666 rather than a separator (i.e. there should be one following 667 each statement), we can remove the semicolon from the above 668 rule and redefine <literal>stmt</literal> as</para> 669 670<programlisting> 671stmt : stmt1 ';' { $1 } 672</programlisting> 673 674 <para>where <literal>stmt1</literal> is the real definition of statements.</para> 675 676 <para>We might like to allow extra semicolons between 677 statements, to be a bit more liberal in what we allow as legal 678 syntax. We probably just want the parser to ignore these 679 extra semicolons, and not generate a ``null statement'' value 680 or something. The following rule parses a sequence of zero or 681 more statements separated by semicolons, in which the 682 statements may be empty:</para> 683 684<programlisting> 685stmts : stmts ';' stmt { $3 : $1 } 686 | stmts ';' { $1 } 687 | stmt { [$1] } 688 | {- empty -} { [] } 689</programlisting> 690 691 <para>Parsing sequences of <emphasis>one</emphasis> or more possibly 692 null statements is left as an exercise for the reader...</para> 693 694 </sect2> 695 </sect1> 696 697<!-- 698 <sect1 id="sec-ambiguities"> 699 <title>Ambiguities</title> 700 701 <para>(section under construction)</para> 702 703 </sect1> 704--> 705 706 <sect1 id="sec-Precedences"> 707 <title>Using Precedences</title> 708 <indexterm><primary>precedences</primary></indexterm> 709 <indexterm><primary>associativity</primary></indexterm> 710 711 <para>Going back to our earlier expression-parsing example, 712 wouldn't it be nicer if we didn't have to explicitly separate 713 the expressions into terms and factors, merely to make it 714 clear that <literal>'*'</literal> and <literal>'/'</literal> 715 operators bind more tightly than <literal>'+'</literal> and 716 <literal>'-'</literal>?</para> 717 718 <para>We could just change the grammar as follows (making the 719 appropriate changes to the expression datatype too):</para> 720 721<programlisting> 722Exp : let var '=' Exp in Exp { Let $2 $4 $6 } 723 | Exp '+' Exp { Plus $1 $3 } 724 | Exp '-' Exp { Minus $1 $3 } 725 | Exp '*' Exp { Times $1 $3 } 726 | Exp '/' Exp { Div $1 $3 } 727 | '(' Exp ')' { Brack $2 } 728 | int { Int $1 } 729 | var { Var $1 } 730</programlisting> 731 732 <para>but now Happy will complain that there are shift/reduce 733 conflicts because the grammar is ambiguous - we haven't 734 specified whether e.g. <literal>1 + 2 * 3</literal> is to be 735 parsed as <literal>1 + (2 * 3)</literal> or <literal>(1 + 2) * 736 3</literal>. Happy allows these ambiguities to be resolved by 737 specifying the <firstterm>precedences</firstterm> of the 738 operators involved using directives in the 739 header<footnote><para>Users of <literal>yacc</literal> will find 740 this familiar, Happy's precedence scheme works in exactly the 741 same way.</para></footnote>:</para> 742 743<programlisting> 744... 745%right in 746%left '+' '-' 747%left '*' '/' 748%% 749... 750</programlisting> 751<indexterm><primary><literal>%left</literal> directive</primary></indexterm> 752<indexterm><primary><literal>%right</literal> directive</primary></indexterm> 753<indexterm><primary><literal>%nonassoc</literal> directive</primary></indexterm> 754 755 <para>The <literal>%left</literal> or <literal>%right</literal> 756 directive is followed by a list of terminals, and declares all 757 these tokens to be left or right-associative respectively. The 758 precedence of these tokens with respect to other tokens is 759 established by the order of the <literal>%left</literal> and 760 <literal>%right</literal> directives: earlier means lower 761 precedence. A higher precedence causes an operator to bind more 762 tightly; in our example above, because <literal>'*'</literal> 763 has a higher precedence than <literal>'+'</literal>, the 764 expression <literal>1 + 2 * 3</literal> will parse as <literal>1 765 + (2 * 3)</literal>.</para> 766 767 <para>What happens when two operators have the same precedence? 768 This is when the <firstterm>associativity</firstterm> comes into 769 play. Operators specified as left associative will cause 770 expressions like <literal>1 + 2 - 3</literal> to parse as 771 <literal>(1 + 2) - 3</literal>, whereas right-associative 772 operators would parse as <literal>1 + (2 - 3)</literal>. There 773 is also a <literal>%nonassoc</literal> directive which indicates 774 that the specified operators may not be used together. For 775 example, if we add the comparison operators 776 <literal>'>'</literal> and <literal>'<'</literal> to our 777 grammar, then we would probably give their precedence as:</para> 778 779<programlisting>... 780%right in 781%nonassoc '>' '<' 782%left '+' '-' 783%left '*' '/' 784%% 785...</programlisting> 786 787 <para>which indicates that <literal>'>'</literal> and 788 <literal>'<'</literal> bind less tightly than the other 789 operators, and the non-associativity causes expressions such as 790 <literal>1 > 2 > 3</literal> to be disallowed.</para> 791 792 <sect2 id="how-precedence-works"> 793 <title>How precedence works</title> 794 795 <para>The precedence directives, <literal>%left</literal>, 796 <literal>%right</literal> and <literal>%nonassoc</literal>, 797 assign precedence levels to the tokens in the declaration. A 798 rule in the grammar may also have a precedence: if the last 799 terminal in the right hand side of the rule has a precedence, 800 then this is the precedence of the whole rule.</para> 801 802 <para>The precedences are used to resolve ambiguities in the 803 grammar. If there is a shift/reduce conflict, then the 804 precedence of the rule and the lookahead token are examined in 805 order to resolve the conflict:</para> 806 807 <itemizedlist> 808 <listitem> 809 <para>If the precedence of the rule is higher, then the 810 conflict is resolved as a reduce.</para> 811 </listitem> 812 <listitem> 813 <para>If the precedence of the lookahead token is higher, 814 then the conflict is resolved as a shift.</para> 815 </listitem> 816 <listitem> 817 <para>If the precedences are equal, then</para> 818 <itemizedlist> 819 <listitem> 820 <para>If the token is left-associative, then reduce</para> 821 </listitem> 822 <listitem> 823 <para>If the token is right-associative, then shift</para> 824 </listitem> 825 <listitem> 826 <para>If the token is non-associative, then fail</para> 827 </listitem> 828 </itemizedlist> 829 </listitem> 830 <listitem> 831 <para>If either the rule or the token has no precedence, 832 then the default is to shift (these conflicts are reported 833 by Happy, whereas ones that are automatically resolved by 834 the precedence rules are not).</para> 835 </listitem> 836 </itemizedlist> 837 </sect2> 838 839 <sect2 id="context-precedence"> 840 <title>Context-dependent Precedence</title> 841 842 <para>The precedence of an individual rule can be overriden, 843 using <firstterm>context precedence</firstterm>. This is 844 useful when, for example, a particular token has a different 845 precedence depending on the context. A common example is the 846 minus sign: it has high precedence when used as prefix 847 negation, but a lower precedence when used as binary 848 subtraction.</para> 849 850 <para>We can implement this in Happy as follows:</para> 851 852<programlisting>%right in 853%nonassoc '>' '<' 854%left '+' '-' 855%left '*' '/' 856%left NEG 857%% 858 859Exp : let var '=' Exp in Exp { Let $2 $4 $6 } 860 | Exp '+' Exp { Plus $1 $3 } 861 | Exp '-' Exp { Minus $1 $3 } 862 | Exp '*' Exp { Times $1 $3 } 863 | Exp '/' Exp { Div $1 $3 } 864 | '(' Exp ')' { Brack $2 } 865 | '-' Exp %prec NEG { Negate $2 } 866 | int { Int $1 } 867 | var { Var $1 }</programlisting> 868<indexterm><primary><literal>%prec</literal> directive</primary></indexterm> 869 870 <para>We invent a new token <literal>NEG</literal> as a 871 placeholder for the precedence of our prefix negation rule. 872 The <literal>NEG</literal> token doesn't need to appear in 873 a <literal>%token</literal> directive. The prefix negation 874 rule has a <literal>%prec NEG</literal> directive attached, 875 which overrides the default precedence for the rule (which 876 would normally be the precedence of '-') with the precedence 877 of <literal>NEG</literal>.</para> 878 </sect2> 879 880 <sect2 id="shift-directive"> 881 <title>The %shift directive for lowest precedence rules</title> 882 <para> 883 Rules annotated with the <literal>%shift</literal> directive 884 have the lowest possible precedence and are non-associative. 885 A shift/reduce conflict that involves such a rule is resolved as a shift. 886 887 One can think of <literal>%shift</literal> as 888 <literal>%prec SHIFT</literal> such that <literal>SHIFT</literal> 889 has lower precedence than any other token. 890 </para> 891 <para> 892 This is useful in conjunction with 893 <literal>%expect 0</literal> to explicitly point out all rules in the grammar that 894 result in conflicts, and thereby resolve such conflicts. 895 </para> 896 </sect2> 897 898 </sect1> 899 900 <sect1 id="sec-type-signatures"> 901 <title>Type Signatures</title> 902 903 <indexterm> 904 <primary>type</primary> 905 <secondary>signatures in grammar</secondary> 906 </indexterm> 907 908 <para><application>Happy</application> allows you to include type signatures 909 in the grammar file itself, to indicate the type of each 910 production. This has several benefits:</para> 911 912 <itemizedlist> 913 <listitem> 914 <para> Documentation: including types in the grammar helps 915 to document the grammar for someone else (and indeed 916 yourself) reading the code.</para> 917 </listitem> 918 919 <listitem> 920 <para> Fixing type errors in the generated module can become 921 slightly easier if <application>Happy</application> has inserted type 922 signatures for you. This is a slightly dubious benefit, 923 since type errors in the generated module are still somewhat 924 difficult to find. </para> 925 </listitem> 926 927 <listitem> 928 <para> Type signatures generally help the Haskell compiler 929 to compile the parser faster. This is important when really 930 large grammar files are being used.</para> 931 </listitem> 932 </itemizedlist> 933 934 <para>The syntax for type signatures in the grammar file is as 935 follows:</para> 936 937<programlisting> 938stmts :: { [ Stmt ] } 939stmts : stmts stmt { $2 : $1 } 940 | stmt { [$1] } 941</programlisting> 942 943 <para>In fact, you can leave out the superfluous occurrence of 944 <literal>stmts</literal>:</para> 945 946<programlisting> 947stmts :: { [ Stmt ] } 948 : stmts stmt { $2 : $1 } 949 | stmt { [$1] } 950</programlisting> 951 952 <para>Note that currently, you have to include type signatures 953 for <emphasis>all</emphasis> the productions in the grammar to benefit 954 from the second and third points above. This is due to boring 955 technical reasons, but it is hoped that this restriction can be 956 removed in the future.</para> 957 958 <para>It is possible to have productions with polymorphic or overloaded 959 types. However, because the type of each production becomes the 960 argument type of a constructor in an algebraic datatype in the 961 generated source file, compiling the generated file requires a compiler 962 that supports local universal quantification. GHC (with the 963 <option>-fglasgow-exts</option> option) and Hugs are known to support 964 this.</para> 965 </sect1> 966 967 <sect1 id="sec-monads"> 968 <title>Monadic Parsers</title> 969 970 <indexterm> 971 <primary>monadic</primary> 972 <secondary>parsers</secondary> 973 </indexterm> 974 975 <para><application>Happy</application> has support for threading a monad 976 through the generated parser. This might be useful for several 977 reasons:</para> 978 979 <itemizedlist> 980 981 <listitem> 982 <para> Handling parse errors 983 <indexterm> 984 <primary>parse errors</primary> 985 <secondary>handling</secondary> 986 </indexterm> 987<!-- <indexterm> 988 <primary>error</primary> 989 <secondary>parse</secondary> 990 <see>parse errors</see> 991 </indexterm> 992--> 993 by using an exception monad 994 (see <xref linkend="sec-exception"/>).</para> 995 </listitem> 996 997 <listitem> 998 <para> Keeping track of line numbers 999 <indexterm> 1000 <primary>line numbers</primary> 1001 </indexterm> 1002 in the input file, for 1003 example for use in error messages (see <xref 1004 linkend="sec-line-numbers"/>).</para> 1005 </listitem> 1006 1007 <listitem> 1008 <para> Performing IO operations during parsing.</para> 1009 </listitem> 1010 1011 <listitem> 1012 <para> Parsing languages with context-dependencies (such as 1013 C) require some state in the parser.</para> 1014 </listitem> 1015 1016</itemizedlist> 1017 1018 <para>Adding monadic support to your parser couldn't be simpler. 1019 Just add the following directive to the declaration section of 1020 the grammar file:</para> 1021 1022<programlisting> 1023%monad { <type> } [ { <then> } { <return> } ] 1024</programlisting> 1025 1026 <indexterm> 1027 <primary><literal>%monad</literal></primary> 1028 </indexterm> 1029 1030 <para>where <literal><type></literal> is the type constructor for 1031 the monad, <literal><then></literal> is the bind operation of the 1032 monad, and <literal><return></literal> is the return operation. If 1033 you leave out the names for the bind and return operations, 1034 <application>Happy</application> assumes that <literal><type></literal> is an 1035 instance of the standard Haskell type class <literal>Monad</literal> and 1036 uses the overloaded names for the bind and return 1037 operations.</para> 1038 1039 <para>When this declaration is included in the grammar, 1040 <application>Happy</application> makes a couple of changes to the generated 1041 parser: the types of the main parser function and 1042 <literal>parseError</literal> (the function named in 1043 <literal>%error</literal>) become <literal>[Token] -> P a</literal> where 1044 <literal>P</literal> is the monad type constructor, and the function must 1045 be polymorphic in <literal>a</literal>. In other words, 1046 <application>Happy</application> adds an application of the 1047 <literal><return></literal> operation defined in the declaration 1048 above, around the result of the parser (<literal>parseError</literal> is 1049 affected because it must have the same return type as the 1050 parser). And that's all it does.</para> 1051 1052 <para>This still isn't very useful: all you can do is return 1053 something of monadic type from <literal>parseError</literal>. How do you 1054 specify that the productions can also have type <literal>P a</literal>? 1055 Most of the time, you don't want a production to have this type: 1056 you'd have to write explicit <literal>returnP</literal>s everywhere. 1057 However, there may be a few rules in a grammar that need to get 1058 at the monad, so <application>Happy</application> has a special syntax for 1059 monadic actions:</para> 1060 1061<programlisting> 1062n : t_1 ... t_n {% <expr> } 1063</programlisting> 1064 1065 <indexterm> 1066 <primary>monadic</primary> 1067 <secondary>actions</secondary> 1068 </indexterm> 1069 <para>The <literal>%</literal> in the action indicates that this is a 1070 monadic action, with type <literal>P a</literal>, where <literal>a</literal> is 1071 the real return type of the production. When 1072 <application>Happy</application> reduces one of these rules, it evaluates the 1073 expression </para> 1074 1075<programlisting> 1076<expr> `then` \result -> <continue parsing> 1077</programlisting> 1078 1079 <para><application>Happy</application> uses <literal>result</literal> as the real 1080 semantic value of the production. During parsing, several 1081 monadic actions might be reduced, resulting in a sequence 1082 like</para> 1083 1084<programlisting> 1085<expr1> `then` \r1 -> 1086<expr2> `then` \r2 -> 1087... 1088return <expr3> 1089</programlisting> 1090 1091 <para>The monadic actions are performed in the order that they 1092 are <emphasis>reduced</emphasis>. If we consider the parse as a tree, 1093 then reductions happen in a depth-first left-to-right manner. 1094 The great thing about adding a monad to your parser is that it 1095 doesn't impose any performance overhead for normal reductions - 1096 only the monadic ones are translated like this.</para> 1097 1098 <para>Take a look at the Haskell parser for a good illustration 1099 of how to use a monad in your parser: it contains examples of 1100 all the principles discussed in this section, namely parse 1101 errors, a threaded lexer, line/column numbers, and state 1102 communication between the parser and lexer.</para> 1103 1104 <para>The following sections consider a couple of uses for 1105 monadic parsers, and describe how to also thread the monad 1106 through the lexical analyser.</para> 1107 1108 <sect2 id="sec-exception"> 1109 <title>Handling Parse Errors</title> 1110 <indexterm> 1111 <primary>parse errors</primary> 1112 <secondary>handling</secondary> 1113 </indexterm> 1114 1115 <para>It's not very convenient to just call <literal>error</literal> when 1116 a parse error is detected: in a robust setting, you'd like the 1117 program to recover gracefully and report a useful error message 1118 to the user. Exceptions (of which errors are a special case) 1119 are normally implemented in Haskell by using an exception monad, 1120 something like:</para> 1121 1122<programlisting> 1123data E a = Ok a | Failed String 1124 1125thenE :: E a -> (a -> E b) -> E b 1126m `thenE` k = 1127 case m of 1128 Ok a -> k a 1129 Failed e -> Failed e 1130 1131returnE :: a -> E a 1132returnE a = Ok a 1133 1134failE :: String -> E a 1135failE err = Failed err 1136 1137catchE :: E a -> (String -> E a) -> E a 1138catchE m k = 1139 case m of 1140 Ok a -> Ok a 1141 Failed e -> k e 1142</programlisting> 1143 1144 <para>This monad just uses a string as the error type. The 1145 functions <literal>thenE</literal> and <literal>returnE</literal> are the usual 1146 bind and return operations of the monad, <literal>failE</literal> 1147 raises an error, and <literal>catchE</literal> is a combinator for 1148 handling exceptions.</para> 1149 1150 <para>We can add this monad to the parser with the declaration</para> 1151 1152<programlisting> 1153%monad { E } { thenE } { returnE } 1154</programlisting> 1155 1156 <para>Now, without changing the grammar, we can change the 1157 definition of <literal>parseError</literal> and have something sensible 1158 happen for a parse error:</para> 1159 1160<programlisting> 1161parseError tokens = failE "Parse error" 1162</programlisting> 1163 1164 <para>The parser now raises an exception in the monad instead 1165 of bombing out on a parse error.</para> 1166 1167 <para>We can also generate errors during parsing. There are 1168 times when it is more convenient to parse a more general 1169 language than that which is actually intended, and check it 1170 later. An example comes from Haskell, where the precedence 1171 values in infix declarations must be between 0 and 9:</para> 1172 1173<programlisting>prec :: { Int } 1174 : int {% if $1 < 0 || $1 > 9 1175 then failE "Precedence out of range" 1176 else returnE $1 1177 }</programlisting> 1178 1179 <para>The monadic action allows the check to be placed in the 1180 parser itself, where it belongs.</para> 1181 1182 </sect2> 1183 1184 <sect2 id="sec-lexers"> 1185 <title>Threaded Lexers</title> 1186 <indexterm> 1187 <primary>lexer, threaded</primary> 1188 </indexterm> 1189 <indexterm> 1190 <primary>monadic</primary> 1191 <secondary>lexer</secondary> 1192 </indexterm> 1193 1194 <para><application>Happy</application> allows the monad concept to be 1195 extended to the lexical analyser, too. This has several 1196 useful consequences:</para> 1197 1198 <itemizedlist> 1199 <listitem> 1200 <para> Lexical errors can be treated in the same way as 1201 parse errors, using an exception monad.</para> 1202 <indexterm> 1203 <primary>parse errors</primary> 1204 <secondary>lexical</secondary> 1205 </indexterm> 1206 </listitem> 1207 <listitem> 1208 <para> Information such as the current file and line 1209 number can be communicated between the lexer and 1210 parser. </para> 1211 </listitem> 1212 <listitem> 1213 <para> General state communication between the parser and 1214 lexer - for example, implementation of the Haskell layout 1215 rule requires this kind of interaction. 1216 </para> 1217 </listitem> 1218 <listitem> 1219 <para> IO operations can be performed in the lexer - this 1220 could be useful for following import/include declarations 1221 for instance.</para> 1222 </listitem> 1223 </itemizedlist> 1224 1225 <para>A monadic lexer is requested by adding the following 1226 declaration to the grammar file:</para> 1227 1228<programlisting> 1229%lexer { <lexer> } { <eof> } 1230</programlisting> 1231 1232 <indexterm> 1233 <primary><literal>%lexer</literal></primary> 1234 </indexterm> 1235 1236 <para>where <literal><lexer></literal> is the name of the lexical 1237 analyser function, and <literal><eof></literal> is a token that 1238 is to be treated as the end of file.</para> 1239 1240 <para>When using a monadic lexer, the parser no longer reads a 1241 list of tokens. Instead, it calls the lexical analysis 1242 function for each new token to be read. This has the side 1243 effect of eliminating the intermediate list of tokens, which 1244 is a slight performance win.</para> 1245 1246 <para>The type of the main parser function is now just 1247 <literal>P a</literal> - the input is being handled completely 1248 within the monad.</para> 1249 1250 <para>The type of <literal>parseError</literal> becomes 1251 <literal>Token -> P a</literal>; that is it takes Happy's 1252 current lookahead token as input. This can be useful, because 1253 the error function probably wants to report the token at which 1254 the parse error occurred, and otherwise the lexer would have 1255 to store this token in the monad.</para> 1256 1257 <para>The lexical analysis function must have the following 1258 type:</para> 1259 1260<programlisting> 1261lexer :: (Token -> P a) -> P a 1262</programlisting> 1263 1264 <para>where <literal>P</literal> is the monad type constructor declared 1265 with <literal>%monad</literal>, and <literal>a</literal> can be replaced by the 1266 parser return type if desired.</para> 1267 1268 <para>You can see from this type that the lexer takes a 1269 <emphasis>continuation</emphasis> as an argument. The lexer is to find 1270 the next token, and pass it to this continuation to carry on 1271 with the parse. Obviously, we need to keep track of the input 1272 in the monad somehow, so that the lexer can do something 1273 different each time it's called!</para> 1274 1275 <para>Let's take the exception monad above, and extend it to 1276 add the input string so that we can use it with a threaded 1277 lexer.</para> 1278 1279<programlisting> 1280data ParseResult a = Ok a | Failed String 1281type P a = String -> ParseResult a 1282 1283thenP :: P a -> (a -> P b) -> P b 1284m `thenP` k = \s -> 1285 case m s of 1286 Ok a -> k a s 1287 Failed e -> Failed e 1288 1289returnP :: a -> P a 1290returnP a = \s -> Ok a 1291 1292failP :: String -> P a 1293failP err = \s -> Failed err 1294 1295catchP :: P a -> (String -> P a) -> P a 1296catchP m k = \s -> 1297 case m s of 1298 Ok a -> Ok a 1299 Failed e -> k e s 1300</programlisting> 1301 1302 <para>Notice that this isn't a real state monad - the input 1303 string just gets passed around, not returned. Our lexer will 1304 now look something like this:</para> 1305 1306<programlisting> 1307lexer :: (Token -> P a) -> P a 1308lexer cont s = 1309 ... lexical analysis code ... 1310 cont token s' 1311</programlisting> 1312 1313 <para>the lexer grabs the continuation and the input string, 1314 finds the next token <literal>token</literal>, and passes it together 1315 with the remaining input string <literal>s'</literal> to the 1316 continuation.</para> 1317 1318 <para>We can now indicate lexical errors by ignoring the 1319 continuation and calling <literal>failP "error message" s</literal> 1320 within the lexer (don't forget to pass the input string to 1321 make the types work out).</para> 1322 1323 <para>This may all seem a bit weird. Why, you ask, doesn't 1324 the lexer just have type <literal>P Token</literal>? It was 1325 done this way for performance reasons - this formulation 1326 sometimes means that you can use a reader monad instead of a 1327 state monad for <literal>P</literal>, and the reader monad 1328 might be faster. It's not at all clear that this reasoning 1329 still holds (or indeed ever held), and it's entirely possible 1330 that the use of a continuation here is just a 1331 misfeature.</para> 1332 1333 <para>If you want a lexer of type <literal>P Token</literal>, 1334 then just define a wrapper to deal with the 1335 continuation:</para> 1336 1337<programlisting> 1338lexwrap :: (Token -> P a) -> P a 1339lexwrap cont = real_lexer `thenP` \token -> cont token 1340</programlisting> 1341 1342 <sect3> 1343 <title>Monadic productions with %lexer</title> 1344 1345 <para>The <literal>{% ... }</literal> actions work fine with 1346 <literal>%lexer</literal>, but additionally there are two more 1347 forms which are useful in certain cases. Firstly:</para> 1348 1349<programlisting> 1350n : t_1 ... t_n {%^ <expr> } 1351</programlisting> 1352 1353 <para>In this case, <literal><expr></literal> has type 1354 <literal>Token -> P a</literal>. That is, Happy passes the 1355 current lookahead token to the monadic action 1356 <literal><expr></literal>. This is a useful way to get 1357 hold of Happy's current lookahead token without having to 1358 store it in the monad.</para> 1359 1360<programlisting> 1361n : t_1 ... t_n {%% <expr> } 1362</programlisting> 1363 1364 <para>This is a slight variant on the previous form. The type 1365 of <literal><expr></literal> is the same, but in this 1366 case the lookahead token is actually discarded and a new token 1367 is read from the input. This can be useful when you want to 1368 change the next token and continue parsing.</para> 1369 </sect3> 1370 </sect2> 1371 1372 <sect2 id="sec-line-numbers"> 1373 <title>Line Numbers</title> 1374 1375 <indexterm> 1376 <primary>line numbers</primary> 1377 </indexterm> 1378 1379 <indexterm> 1380 <primary><literal>%newline</literal></primary> 1381 </indexterm> 1382 <para>Previous versions of <application>Happy</application> had a 1383 <literal>%newline</literal> directive that enabled simple line numbers 1384 to be counted by the parser and referenced in the actions. We 1385 warned you that this facility may go away and be replaced by 1386 something more general, well guess what? :-)</para> 1387 1388 <para>Line numbers can now be dealt with quite 1389 straightforwardly using a monadic parser/lexer combination. 1390 Ok, we have to extend the monad a bit more:</para> 1391 1392<programlisting> 1393type LineNumber = Int 1394type P a = String -> LineNumber -> ParseResult a 1395 1396getLineNo :: P LineNumber 1397getLineNo = \s l -> Ok l 1398</programlisting> 1399 1400 <para>(the rest of the functions in the monad follow by just 1401 adding the extra line number argument in the same way as the 1402 input string). Again, the line number is just passed down, 1403 not returned: this is OK because of the continuation-based 1404 lexer that can change the line number and pass the new one to 1405 the continuation.</para> 1406 1407 <para>The lexer can now update the line number as follows:</para> 1408 1409<programlisting> 1410lexer cont s = 1411 case s of 1412 '\n':s -> \line -> lexer cont s (line + 1) 1413 ... rest of lexical analysis ... 1414</programlisting> 1415 1416 <para>It's as simple as that. Take a look at 1417 <application>Happy</application>'s own parser if you have the sources lying 1418 around, it uses a monad just like the one above.</para> 1419 1420 <para>Reporting the line number of a parse error is achieved 1421 by changing <literal>parseError</literal> to look something like 1422 this:</para> 1423 1424<programlisting> 1425parseError :: Token -> P a 1426parseError = getLineNo `thenP` \line -> 1427 failP (show line ++ ": parse error") 1428</programlisting> 1429 1430 <para>We can also get hold of the line number during parsing, 1431 to put it in the parsed data structure for future reference. 1432 A good way to do this is to have a production in the grammar 1433 that returns the current line number: </para> 1434 1435<programlisting>lineno :: { LineNumber } 1436 : {- empty -} {% getLineNo }</programlisting> 1437 1438 <para>The semantic value of <literal>lineno</literal> is the line 1439 number of the last token read - this will always be the token 1440 directly following the <literal>lineno</literal> symbol in the grammar, 1441 since <application>Happy</application> always keeps one lookahead token in 1442 reserve.</para> 1443 1444 </sect2> 1445 1446 <sect2 id="sec-monad-summary"> 1447 <title>Summary</title> 1448 1449 <para>The types of various functions related to the parser are 1450 dependent on what combination of <literal>%monad</literal> and 1451 <literal>%lexer</literal> directives are present in the grammar. For 1452 reference, we list those types here. In the following types, 1453 <emphasis>t</emphasis> is the return type of the 1454 parser. A type containing a type variable indicates that the 1455 specified function must be polymorphic.</para> 1456 1457 <indexterm> 1458 <primary>type</primary> 1459 <secondary>of <function>parseError</function></secondary> 1460 </indexterm> 1461 <indexterm> 1462 <primary>type</primary> 1463 <secondary>of parser</secondary> 1464 </indexterm> 1465 <indexterm> 1466 <primary>type</primary> 1467 <secondary>of lexer</secondary> 1468 </indexterm> 1469 1470 <itemizedlist> 1471 <listitem> 1472 <formalpara> 1473 <title> No <literal>%monad</literal> or 1474 <literal>%lexer</literal> </title> 1475 <para> 1476<programlisting> 1477parse :: [Token] -> <emphasis>t</emphasis> 1478parseError :: [Token] -> a 1479</programlisting> 1480</para> 1481 </formalpara> 1482 </listitem> 1483 1484 <listitem> 1485 <formalpara> 1486 <title> with <literal>%monad</literal> </title> 1487 <para> 1488<programlisting> 1489parse :: [Token] -> P <emphasis>t</emphasis> 1490parseError :: [Token] -> P a 1491</programlisting> 1492</para> 1493 </formalpara> 1494 </listitem> 1495 1496 1497 <listitem> 1498 <formalpara> 1499 <title> with <literal>%lexer</literal> </title> 1500 <para><programlisting> 1501parse :: T <emphasis>t</emphasis> 1502parseError :: Token -> T a 1503lexer :: (Token -> T a) -> T a 1504</programlisting> 1505where the type constructor <literal>T</literal> is whatever you want (usually <literal>T 1506a = String -> a</literal>). I'm not sure if this is useful, or even if it works 1507properly.</para> 1508 </formalpara> 1509 </listitem> 1510 1511 <listitem> 1512 <formalpara> 1513 <title> with <literal>%monad</literal> and <literal>%lexer</literal> </title> 1514 <para><programlisting> 1515parse :: P <emphasis>t</emphasis> 1516parseError :: Token -> P a 1517lexer :: (Token -> P a) -> P a 1518</programlisting> 1519</para> 1520 </formalpara> 1521 </listitem> 1522 </itemizedlist> 1523 1524 </sect2> 1525 </sect1> 1526 1527 <sect1 id="sec-error"> 1528 <title>The Error Token</title> 1529 <indexterm> 1530 <primary>error token</primary> 1531 </indexterm> 1532 1533 <para><application>Happy</application> supports a limited form of error 1534 recovery, using the special symbol <literal>error</literal> in a grammar 1535 file. When <application>Happy</application> finds a parse error during 1536 parsing, it automatically inserts the <literal>error</literal> symbol; if 1537 your grammar deals with <literal>error</literal> explicitly, then it can 1538 detect the error and carry on.</para> 1539 1540 <para>For example, the <application>Happy</application> grammar for Haskell 1541 uses error recovery to implement Haskell layout. The grammar 1542 has a rule that looks like this:</para> 1543 1544<programlisting> 1545close : '}' { () } 1546 | error { () } 1547</programlisting> 1548 1549 <para>This says that a close brace in a layout-indented context 1550 may be either a curly brace (inserted by the lexical analyser), 1551 or a parse error. </para> 1552 1553 <para>This rule is used to parse expressions like <literal>let x 1554 = e in e'</literal>: the layout system inserts an open brace before 1555 <literal>x</literal>, and the occurrence of the <literal>in</literal> symbol 1556 generates a parse error, which is interpreted as a close brace 1557 by the above rule.</para> 1558 1559 <indexterm> 1560 <primary><application>yacc</application></primary> 1561 </indexterm> 1562 <para>Note for <literal>yacc</literal> users: this form of error recovery 1563 is strictly more limited than that provided by <literal>yacc</literal>. 1564 During a parse error condition, <literal>yacc</literal> attempts to 1565 discard states and tokens in order to get back into a state 1566 where parsing may continue; <application>Happy</application> doesn't do this. 1567 The reason is that normal <literal>yacc</literal> error recovery is 1568 notoriously hard to describe, and the semantics depend heavily 1569 on the workings of a shift-reduce parser. Furthermore, 1570 different implementations of <literal>yacc</literal> appear to implement 1571 error recovery differently. <application>Happy</application>'s limited error 1572 recovery on the other hand is well-defined, as is just 1573 sufficient to implement the Haskell layout rule (which is why it 1574 was added in the first place).</para> 1575 </sect1> 1576 1577 <sect1 id="sec-multiple-parsers"> 1578 <title>Generating Multiple Parsers From a Single Grammar</title> 1579 <indexterm> 1580 <primary>multiple parsers</primary> 1581 </indexterm> 1582 1583 <para>It is often useful to use a single grammar to describe 1584 multiple parsers, where each parser has a different top-level 1585 non-terminal, but parts of the grammar are shared between 1586 parsers. A classic example of this is an interpreter, which 1587 needs to be able to parse both entire files and single 1588 expressions: the expression grammar is likely to be identical 1589 for the two parsers, so we would like to use a single grammar 1590 but have two entry points.</para> 1591 1592 <para><application>Happy</application> lets you do this by 1593 allowing multiple <literal>%name</literal> directives in the 1594 grammar file. The <literal>%name</literal> directive takes an 1595 optional second parameter specifying the top-level 1596 non-terminal for this parser, so we may specify multiple parsers 1597 like so:</para> 1598 <indexterm><primary><literal>%name</literal> directive</primary> 1599 </indexterm> 1600 1601<programlisting> 1602%name parse1 non-terminal1 1603%name parse2 non-terminal2 1604</programlisting> 1605 1606 <para><application>Happy</application> will generate from this a 1607 module which defines two functions <function>parse1</function> 1608 and <function>parse2</function>, which parse the grammars given 1609 by <literal>non-terminal1</literal> and 1610 <literal>non-terminal2</literal> respectively. Each parsing 1611 function will of course have a different type, depending on the 1612 type of the appropriate non-terminal.</para> 1613 </sect1> 1614 1615 </chapter> 1616 1617 <chapter id="sec-glr"> 1618 1619 <chapterinfo> 1620 <copyright> 1621 <year>2004</year> 1622 <holder>University of Durham, Paul Callaghan, Ben Medlock</holder> 1623 </copyright> 1624 </chapterinfo> 1625 1626 <title>Generalized LR Parsing</title> 1627 1628 <para>This chapter explains how to use the GLR parsing extension, 1629 which allows <application>Happy</application> to parse ambiguous 1630 grammars and produce useful results. 1631 This extension is triggered with the <option>--glr</option> flag, 1632 which causes <application>Happy</application> 1633 to use a different driver for the LALR(1) parsing 1634 tables. The result of parsing is a structure which encodes compactly 1635 <emphasis>all</emphasis> of the possible parses. 1636 There are two options for how semantic information is combined with 1637 the structural information. 1638 </para> 1639 1640 <para> 1641 This extension was developed by Paul Callaghan and Ben Medlock 1642 (University of Durham). It is based on the structural parser 1643 implemented in Medlock's undergraduate project, but significantly 1644 extended and improved by Callaghan. 1645 Bug reports, comments, questions etc should be sent to 1646 <email>P.C.Callaghan@durham.ac.uk</email>. 1647 Further information can be found on Callaghan's 1648 <ulink url="http://www.dur.ac.uk/p.c.callaghan/happy-glr">GLR parser 1649 page</ulink>. 1650 1651 1652 </para> 1653 1654 <sect1 id="sec-glr-intro"> 1655 <title>Introduction</title> 1656 1657 <para> 1658 Here's an ambiguous grammar. It has no information about the 1659 associativity of <literal>+</literal>, so for example, 1660 <literal>1+2+3</literal> can be parsed as 1661 <literal>(1+(2+3))</literal> or <literal>((1+2)+3)</literal>. 1662 In conventional mode, <application>Happy</application>, 1663 would complain about a shift/reduce 1664 conflict, although it would generate a parser which always shifts 1665 in such a conflict, and hence would produce <emphasis>only</emphasis> 1666 the first alternative above. 1667 </para> 1668 1669<programlisting> 1670E -> E + E 1671E -> i -- any integer 1672</programlisting> 1673 1674 <para> 1675 GLR parsing will accept this grammar without complaint, and produce 1676 a result which encodes <emphasis>both</emphasis> alternatives 1677 simultaneously. Now consider the more interesting example of 1678 <literal>1+2+3+4</literal>, which has five distinct parses -- try to 1679 list them! You will see that some of the subtrees are identical. 1680 A further property of the GLR output is that such sub-results are 1681 shared, hence efficiently represented: there is no combinatorial 1682 explosion. 1683 Below is the simplified output of the GLR parser for this example. 1684 </para> 1685 1686<programlisting> 1687Root (0,7,G_E) 1688(0,1,G_E) => [[(0,1,Tok '1'))]] 1689(0,3,G_E) => [[(0,1,G_E),(1,2,Tok '+'),(2,3,G_E)]] 1690(0,5,G_E) => [[(0,1,G_E),(1,2,Tok '+'),(2,5,G_E)] 1691 ,[(0,3,G_E),(3,4,Tok '+'),(4,5,G_E)]] 1692(0,7,G_E) => [[(0,3,G_E),(3,4,Tok '+'),(4,7,G_E)] 1693 ,[(0,1,G_E),(1,2,Tok '+'),(2,7,G_E)] 1694 ,[(0,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}] 1695(2,3,G_E) => [[(2,3,Tok '2'))]}] 1696(2,5,G_E) => [[(2,3,G_E),(3,4,Tok '+'),(4,5,G_E)]}] 1697(2,7,G_E) => [[(2,3,G_E),(3,4,Tok '+'),(4,7,G_E)]} 1698 ,[(2,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}] 1699(4,5,G_E) => [[(4,5,Tok '3'))]}] 1700(4,7,G_E) => [[(4,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}] 1701(6,7,G_E) => [[(6,7,Tok '4'))]}] 1702</programlisting> 1703 1704 <para> 1705 This is a directed, acyclic and-or graph. 1706 The node "names" are of form <literal>(a,b,c)</literal> 1707 where <literal>a</literal> and <literal>b</literal> 1708 are the start and end points (as positions in the input string) 1709 and <literal>c</literal> is a category (or name of grammar rule). 1710 For example <literal>(2,7,G_E)</literal> spans positions 2 to 7 1711 and contains analyses which match the <literal>E</literal> 1712 grammar rule. 1713 Such analyses are given as a list of alternatives (disjunctions), 1714 each corresponding to some use of a production of that 1715 category, which in turn are a conjunction of sub-analyses, 1716 each represented as a node in the graph or an instance of a token. 1717 </para> 1718 1719 <para> 1720 Hence <literal>(2,7,G_E)</literal> contains two alternatives, 1721 one which has <literal>(2,3,G_E)</literal> as its first child 1722 and the other with <literal>(2,5,G_E)</literal> as its first child, 1723 respectively corresponding to sub-analyses 1724 <literal>(2+(3+4))</literal> and <literal>((2+3)+4)</literal>. 1725 Both alternatives have the token <literal>+</literal> as their 1726 second child, but note that they are difference occurrences of 1727 <literal>+</literal> in the input! 1728 We strongly recommend looking at such results in graphical form 1729 to understand these points. If you build the 1730 <literal>expr-eval</literal> example in the directory 1731 <literal>examples/glr</literal> (NB you need to use GHC for this, 1732 unless you know how to use the <option>-F</option> flag for Hugs), 1733 running the example will produce a file which can be viewed with 1734 the <emphasis>daVinci</emphasis> graph visualization tool. 1735 (See <ulink url="http://www.informatik.uni-bremen.de/~davinci/"/> 1736 for more information. Educational use licenses are currently 1737 available without charge.) 1738 </para> 1739 1740 <para> 1741 The GLR extension also allows semantic information to be attached 1742 to productions, as in conventional <application>Happy</application>, 1743 although there are further issues to consider. 1744 Two modes are provided, one for simple applications and one for more 1745 complex use. 1746 See <xref linkend="sec-glr-semantics"/>. 1747 The extension is also integrated with <application>Happy</application>'s 1748 token handling, e.g. extraction of information from tokens. 1749 </para> 1750 1751 <para> 1752 One key feature of this implementation in Haskell is that its main 1753 result is a <emphasis>graph</emphasis>. 1754 Other implementations effectively produce a list of trees, but this 1755 limits practical use to small examples. 1756 For large and interesting applications, some of which are discussed 1757 in <xref linkend="sec-glr-misc-applications"/>, a graph is essential due 1758 to the large number of possibilities and the need to analyse the 1759 structure of the ambiguity. Converting the graph to trees could produce 1760 huge numbers of results and will lose information about sharing etc. 1761 </para> 1762 1763 <para> 1764 One final comment. You may have learnt through using 1765 <application>yacc</application>-style tools that ambiguous grammars 1766 are to be avoided, and that ambiguity is something that appears 1767 only in Natural Language processing. 1768 This is definitely not true. 1769 Many interesting grammars are ambiguous, and with GLR tools they 1770 can be used effectively. 1771 We hope you enjoy exploring this fascinating area! 1772 </para> 1773 1774 </sect1> 1775 1776 <sect1 id="sec-glr-using"> 1777 <title>Basic use of a Happy-generated GLR parser</title> 1778 1779 <para> 1780 This section explains how to generate and to use a GLR parser to 1781 produce structural results. 1782 Please check the examples for further information. 1783 Discussion of semantic issues comes later; see 1784 <xref linkend="sec-glr-semantics"/>. 1785 </para> 1786 1787 <sect2 id="sec-glr-using-intro"> 1788 <title>Overview</title> 1789 <para> 1790 The process of generating a GLR parser is broadly the same as 1791 for standard <application>Happy</application>. You write a grammar 1792 specification, run <application>Happy</application> on this to 1793 generate some Haskell code, then compile and link this into your 1794 program. 1795 </para> 1796 <para> 1797 An alternative to using Happy directly is to use the 1798 <ulink url="http://www.cs.chalmers.se/~markus/BNFC/"> 1799 <application>BNF Converter</application></ulink> tool by 1800 Markus Forsberg, Peter Gammie, Michael Pellauer and Aarne Ranta. 1801 This tool creates an abstract syntax, grammar, pretty-printer 1802 and other useful items from a single grammar formalism, thus 1803 it saves a lot of work and improves maintainability. 1804 The current output of BNFC can be used with GLR mode now 1805 with just a few small changes, but from January 2005 we expect 1806 to have a fully-compatible version of BNFC. 1807 </para> 1808 <para> 1809 Most of the features of <application>Happy</application> still 1810 work, but note the important points below. 1811 </para> 1812 <variablelist> 1813 <varlistentry> 1814 <term>module header</term> 1815 <listitem> 1816 <para> 1817 The GLR parser is generated in TWO files, one for data and 1818 one for the driver. This is because the driver code needs 1819 to be optimized, but for large parsers with lots of data, 1820 optimizing the data tables too causes compilation to be 1821 too slow. 1822 </para> 1823 <para> 1824 Given a file <literal>Foo.y</literal>, the file 1825 <literal>FooData.hs</literal>, containing the data 1826 module, is generated with basic type information, the 1827 parser tables, and the header and tail code that was 1828 included in the parser specification. Note that 1829 <application>Happy</application> can automatically 1830 generate the necessary module declaration statements, 1831 if you do not choose to provide one in the grammar 1832 file. But, if you do choose to provide the module 1833 declaration statement, then the name of the module will 1834 be parsed and used as the name of the driver 1835 module. The parsed name will also be used to form the 1836 name of the data module, but with the string 1837 <literal>Data</literal> appended to it. The driver 1838 module, which is to be found in the file 1839 <literal>Foo.hs</literal>, will not contain any other 1840 user-supplied text besides the module name. Do not 1841 bother to supply any export declarations in your module 1842 declaration statement: they will be ignored and 1843 dropped, in favor of the standard export declaration. 1844 </para> 1845 1846 </listitem> 1847 </varlistentry> 1848 <varlistentry> 1849 <term>export of lexer</term> 1850 <listitem> 1851 <para> 1852 You can declare a lexer (and error token) with the 1853 <literal>%lexer</literal> directive as normal, but the 1854 generated parser does NOT call this lexer automatically. 1855 The action of the directive is only to 1856 <emphasis>export</emphasis> the lexer function to the top 1857 level. This is because some applications need finer control 1858 of the lexing process. 1859 </para> 1860 </listitem> 1861 </varlistentry> 1862 1863 <varlistentry> 1864 <term>precedence information</term> 1865 <listitem> 1866 <para> 1867 This still works, but note the reasons. 1868 The precedence and associativity declarations are used in 1869 <application>Happy</application>'s LR table creation to 1870 resolve certain conflicts. It does this by retaining the 1871 actions implied by the declarations and removing the ones 1872 which clash with these. 1873 The GLR parser back-end then produces code from these 1874 filtered tables, hence the rejected actions are never 1875 considered by the GLR parser. 1876 </para> 1877 <para> 1878 Hence, declaring precedence and associativity is still 1879 a good thing, since it avoids a certain amount of ambiguity 1880 that the user knows how to remove. 1881 </para> 1882 </listitem> 1883 </varlistentry> 1884 <varlistentry> 1885 <term>monad directive</term> 1886 <listitem> 1887 <para> 1888 There is some support for monadic parsers. 1889 The "tree decoding" mode 1890 (see <xref linkend="sec-glr-semantics-tree"/>) can use the 1891 information given in the <literal>%monad</literal> 1892 declaration to monadify the decoding process. 1893 This is explained in more detail in 1894 <xref linkend="sec-glr-semantics-tree-monad"/>. 1895 </para> 1896 <para> 1897 <emphasis>Note</emphasis>: the generated parsers don't include 1898 Ashley Yakeley's monad context information yet. It is currently 1899 just ignored. 1900 If this is a problem, email and I'll make the changes required. 1901 </para> 1902 </listitem> 1903 </varlistentry> 1904 <varlistentry> 1905 <term>parser name directive</term> 1906 <listitem> 1907 <para> 1908 This has no effect at present. It will probably remain this 1909 way: if you want to control names, you could use qualified 1910 import. 1911 </para> 1912 </listitem> 1913 </varlistentry> 1914 <varlistentry> 1915 <term>type information on non-terminals</term> 1916 <listitem> 1917 <para> 1918 The generation of semantic code relies on type information 1919 given in the grammar specification. If you don't give an 1920 explicit signature, the type <literal>()</literal> is 1921 assumed. If you get type clashes mentioning 1922 <literal>()</literal> you may need to add type annotations. 1923 Similarly, if you don't supply code for the semantic rule 1924 portion, then the value <literal>()</literal> is used. 1925 </para> 1926 </listitem> 1927 </varlistentry> 1928 <varlistentry> 1929 <term><literal>error</literal> symbol in grammars, and recovery 1930 </term> 1931 <listitem> 1932 <para> 1933 No attempt to implement this yet. Any use of 1934 <literal>error</literal> in grammars is thus ignored, and 1935 parse errors will eventually mean a parse will fail. 1936 </para> 1937 </listitem> 1938 </varlistentry> 1939 <varlistentry> 1940 <term>the token type</term> 1941 <listitem> 1942 <para> 1943 The type used for tokens <emphasis>must</emphasis> be in 1944 the <literal>Ord</literal> type class (and hence in 1945 <literal>Eq</literal>), plus it is recommended that they 1946 are in the <literal>Show</literal> class too. 1947 The ordering is required for the implementation of 1948 ambiguity packing. 1949 It may be possible to relax this requirement, but it 1950 is probably simpler just to require instances of the type 1951 classes. Please tell us if this is a problem. 1952 </para> 1953 </listitem> 1954 </varlistentry> 1955 </variablelist> 1956 1957 </sect2> 1958 1959 <sect2 id="sec-glr-using-main"> 1960 <title>The main function</title> 1961 <para> 1962 The driver file exports a function 1963 <literal>doParse :: [[UserDefTok]] -> GLRResult</literal>. 1964 If you are using several parsers, use qualified naming to 1965 distinguish them. 1966 <literal>UserDefTok</literal> is a synonym for the type declared with 1967 the <literal>%tokentype</literal> directive. 1968 </para> 1969 </sect2> 1970 1971 <sect2 id="sec-glr-using-input"> 1972 <title>The input</title> 1973 <para> 1974 The input to <literal>doParse</literal> is a list of 1975 <emphasis>list of</emphasis> token values. 1976 The outer level represents the sequence of input symbols, and 1977 the inner list represents ambiguity in the tokenisation of each 1978 input symbol. 1979 For example, the word "run" can be at least a noun or a verb, 1980 hence the inner list will contain at least two values. 1981 If your tokens are not ambiguous, you will need to convert each 1982 token to a singleton list before parsing. 1983 </para> 1984 </sect2> 1985 1986 <sect2 id="sec-glr-using-output"> 1987 <title>The Parse Result</title> 1988 <para> 1989 The parse result is expressed with the following types. 1990 A successful parse yields a forest (explained below) and a single 1991 root node for the forest. 1992 A parse may fail for one of two reasons: running out of input or 1993 a (global) parse error. A global parse error means that it was 1994 not possible to continue parsing <emphasis>any</emphasis> of the 1995 live alternatives; this is different from a local error, which simply 1996 means that the current alternative dies and we try some other 1997 alternative. In both error cases, the forest at failure point is 1998 returned, since it may contain useful information. 1999 Unconsumed tokens are returned when there is a global parse error. 2000 </para> 2001<programlisting> 2002type ForestId = (Int,Int,GSymbol) 2003data GSymbol = <... automatically generated ...> 2004type Forest = FiniteMap ForestId [Branch] 2005type RootNode = ForestId 2006type Tokens = [[(Int, GSymbol)]] 2007data Branch = Branch {b_sem :: GSem, b_nodes :: [ForestId]} 2008data GSem = <... automatically generated ...> 2009 2010data GLRResult 2011 = ParseOK RootNode Forest -- forest with root 2012 | ParseError Tokens Forest -- partial forest with bad input 2013 | ParseEOF Forest -- partial forest (missing input) 2014</programlisting> 2015 <para> 2016 Conceptually, the parse forest is a directed, acyclic and-or 2017 graph. It is represented by a mapping of <literal>ForestId</literal>s 2018 to lists of possible analyses. The <literal>FiniteMap</literal> 2019 type is used to provide efficient and convenient access. 2020 The <literal>ForestId</literal> type identifies nodes in the 2021 graph, named by the range of input they span and the category of 2022 analysis they license. <literal>GSymbol</literal> is generated 2023 automatically as a union of the names of grammar rules (prefixed 2024 by <literal>G_</literal> to avoid name clashes) and of tokens and 2025 an EOF symbol. Tokens are wrapped in the constructor 2026 <literal>HappyTok :: UserDefTok -> GSymbol</literal>. 2027 </para> 2028 <para> 2029 The <literal>Branch</literal> type represents a match for some 2030 right-hand side of a production, containing semantic information 2031 (see below) 2032 and a list of sub-analyses. Each of these is a node in the graph. 2033 Note that tokens are represented as childless nodes that span 2034 one input position. Empty productions will appear as childless nodes 2035 that start and end at the same position. 2036 </para> 2037 </sect2> 2038 2039 <sect2 id="sec-glr-using-compiling"> 2040 <title>Compiling the parser</title> 2041 <para> 2042 <application>Happy</application> will generate two files, and these 2043 should be compiled as normal Haskell files. 2044 If speed is an issue, then you should use the <option>-O</option> 2045 flags etc with the driver code, and if feasible, with the parser 2046 tables too. 2047 </para> 2048 <para> 2049 You can also use the <option>--ghc</option> flag to trigger certain 2050 <application>GHC</application>-specific optimizations. At present, 2051 this just causes use of unboxed types in the tables and in some key 2052 code. 2053 Using this flag causes relevant <application>GHC</application> 2054 option pragmas to be inserted into the generated code, so you shouldn't 2055 have to use any strange flags (unless you want to...). 2056 </para> 2057 </sect2> 2058 </sect1> 2059 2060 <sect1 id="sec-glr-semantics"> 2061 <title>Including semantic results</title> 2062 2063 <para> 2064 This section discusses the options for including semantic information 2065 in grammars. 2066 </para> 2067 2068 <sect2 id="sec-glr-semantics-intro"> 2069 <title>Forms of semantics</title> 2070 <para> 2071 Semantic information may be attached to productions in the 2072 conventional way, but when more than one analysis is possible, 2073 the use of the semantic information must change. 2074 Two schemes have been implemented, which we call 2075 <emphasis>tree decoding</emphasis> 2076 and <emphasis>label decoding</emphasis>. 2077 The former is for simple applications, where there is not much 2078 ambiguity and hence where the effective unpacking of the parse 2079 forest isn't a factor. This mode is quite similar to the 2080 standard mode in <application>Happy</application>. 2081 The latter is for serious applications, where sharing is important 2082 and where processing of the forest (eg filtering) is needed. 2083 Here, the emphasis is about providing rich labels in nodes of the 2084 the parse forest, to support such processing. 2085 </para> 2086 <para> 2087 The default mode is labelling. If you want the tree decode mode, 2088 use the <option>--decode</option> flag. 2089 </para> 2090 </sect2> 2091 2092 <sect2 id="sec-glr-semantics-tree"> 2093 <title>Tree decoding</title> 2094 <para> 2095 Tree decoding corresponds to unpacking the parse forest to individual 2096 trees and collecting the list of semantic results computed from 2097 each of these. It is a mode intended for simple applications, 2098 where there is limited ambiguity. 2099 You may access semantic results from components of a reduction 2100 using the dollar variables. 2101 As a working example, the following is taken from the 2102 <literal>expr-tree</literal> grammar in the examples. 2103 Note that the type signature is required, else the types in use 2104 can't be determined by the parser generator. 2105 </para> 2106<programlisting> 2107E :: {Int} -- type signature needed 2108 : E '+' E { $1 + $3 } 2109 | E '*' E { $1 * $3 } 2110 | i { $1 } 2111</programlisting> 2112 <para> 2113 This mode works by converting each of the semantic rules into 2114 functions (abstracted over the dollar variables mentioned), 2115 and labelling each <literal>Branch</literal> created from a 2116 reduction of that rule with the function value. 2117 This amounts to <emphasis>delaying</emphasis> the action of the 2118 rule, since we must wait until we know the results of all of 2119 the sub-analyses before computing any of the results. (Certain 2120 cases of packing can add new analyses at a later stage.) 2121 </para> 2122 <para> 2123 At the end of parsing, the functions are applied across relevant 2124 sub-analyses via a recursive descent. The main interface to this 2125 is via the class and entry function below. Typically, 2126 <literal>decode</literal> should be called on the root of the 2127 forest, also supplying a function which maps node names to their 2128 list of analyses (typically a partial application of lookup in 2129 the forest value). 2130 The result is a list of semantic values. 2131 Note that the context of the call to <literal>decode</literal> 2132 should (eventually) supply a concrete type to allow selection 2133 of appropriate instance. Ie, you have to indicate in some way 2134 what type the semantic result should have. 2135 <literal>Decode_Result a</literal> is a synonym generated by 2136 <application>Happy</application>: for non-monadic semantics, 2137 it is equivalent to <literal>a</literal>; when monads are 2138 in use, it becomes the declared monad type. 2139 See the full <literal>expr-eval</literal> example for more 2140 information. 2141 </para> 2142<programlisting> 2143class TreeDecode a where 2144 decode_b :: (ForestId -> [Branch]) -> Branch -> [Decode_Result a] 2145decode :: TreeDecode a => (ForestId -> [Branch]) -> ForestId -> [Decode_Result a] 2146</programlisting> 2147 2148 <para> 2149 The GLR parser generator identifies the types involved in each 2150 semantic rule, hence the types of the functions, then creates 2151 a union containing distinct types. Values of this union are 2152 stored in the branches. (The union is actually a bit more complex: 2153 it must also distinguish patterns of dollar-variable usage, eg 2154 a function <literal>\x y -> x + y </literal> could be applied to 2155 the first and second constituents, or to the first and third.) 2156 The parser generator also creates instances of the 2157 <literal>TreeDecode</literal> class, which unpacks the semantic 2158 function and applies it across the decodings of the possible 2159 combinations of children. Effectively, it does a cartesian product 2160 operation across the lists of semantic results from each of the 2161 children. Eg <literal>[1,2] "+" [3,4]</literal> produces 2162 <literal>[4,5,5,6]</literal>. 2163 Information is extracted from token values using the patterns 2164 supplied by the user when declaring tokens and their Haskell 2165 representation, so the dollar-dollar convention works also. 2166 </para> 2167 <para> 2168 The decoding process could be made more efficient by using 2169 memoisation techniques, but this hasn't been implemented since 2170 we believe the other (label) decoding mode is more useful. (If someone 2171 sends in a patch, we may include it in a future release -- but this 2172 might be tricky, eg require higher-order polymorphism? 2173 Plus, are there other ways of using this form of semantic function?) 2174 </para> 2175 </sect2> 2176 2177 <sect2 id="sec-glr-semantics-label"> 2178 <title>Label decoding</title> 2179 <para> 2180 The labelling mode aims to label branches in the forest with 2181 information that supports subsequent processing, for example 2182 the filtering and prioritisation of analyses prior to extraction 2183 of favoured solutions. As above, code fragments are given in 2184 braces and can contain dollar-variables. But these variables 2185 are expanded to node names in the graph, with the intention of 2186 easing navigation. 2187 The following grammar is from the <literal>expr-tree</literal> 2188 example. 2189 </para> 2190<programlisting> 2191E :: {Tree ForestId Int} 2192 : E '+' E { Plus $1 $3 } 2193 | E '*' E { Times $1 $3 } 2194 | i { Const $1 } 2195</programlisting> 2196 2197 <para> 2198 Here, the semantic values provide more meaningful labels than 2199 the plain structural information. In particular, only the 2200 interesting parts of the branch are represented, and the 2201 programmer can clearly select or label the useful constituents 2202 if required. There is no need to remember that it is the first 2203 and third child in the branch which we need to extract, because 2204 the label only contains those values (the `noise' has been dropped). 2205 Consider also the difference between concrete and abstract syntax. 2206 The labels are oriented towards abstract syntax. 2207 Tokens are handled slightly differently here: when they appear 2208 as children in a reduction, their informational content can 2209 be extracted directly, hence the <literal>Const</literal> value 2210 above will be built with the <literal>Int</literal> value from 2211 the token, not some <literal>ForestId</literal>. 2212 </para> 2213 2214 <para> 2215 Note the useful technique of making the label types polymorphic 2216 in the position used for forest indices. This allows replacement 2217 at a later stage with more appropriate values, eg. inserting 2218 lists of actual subtrees from the final decoding. 2219 </para> 2220 <para> 2221 Use of these labels is supported by a type class 2222 <literal>LabelDecode</literal>, which unpacks values of the 2223 automatically-generated union type <literal>GSem</literal> 2224 to the original type(s). The parser generator will create 2225 appropriate instances of this class, based on the type information 2226 in the grammar file. (Note that omitting type information leads 2227 to a default of <literal>()</literal>.) 2228 Observe that use of the labels is often like traversing an abstract 2229 syntax, and the structure of the abstract syntax type usually 2230 constrains the types of constituents; so once the overall type 2231 is fixed (eg. with a type cast or signature) then there are no 2232 problems with resolution of class instances. 2233 </para> 2234 2235<programlisting> 2236class LabelDecode a where 2237 unpack :: GSem -> a 2238</programlisting> 2239 2240 <para> 2241 Internally, the semantic values are packed in a union type as 2242 before, but there is no direct abstraction step. Instead, the 2243 <literal>ForestId</literal> values (from the dollar-variables) 2244 are bound when the corresponding branch is created from the 2245 list of constituent nodes. At this stage, token information 2246 is also extracted, using the patterns supplied by the user 2247 when declaring the tokens. 2248 </para> 2249 </sect2> 2250 2251 <sect2 id="sec-glr-semantics-tree-monad"> 2252 <title>Monadic tree decoding</title> 2253 <para> 2254 You can use the <literal>%monad</literal> directive in the 2255 tree-decode mode. 2256 Essentially, the decoding process now creates a list of monadic 2257 values, using the monad type declared in the directive. 2258 The default handling of the semantic functions is to apply the 2259 relevant <literal>return</literal> function to the value being 2260 returned. You can over-ride this using the <literal>{% ... }</literal> 2261 convention. The declared <literal>(>>=)</literal> function is 2262 used to assemble the computations. 2263 </para> 2264 <para> 2265 Note that no attempt is made to share the results of monadic 2266 computations from sub-trees. (You could possibly do this by 2267 supplying a memoising lookup function for the decoding process.) 2268 Hence, the usual behaviour is that decoding produces whole 2269 monadic computations, each part of which is computed afresh 2270 (in depth-first order) when the whole is computed. 2271 Hence you should take care to initialise any relevant state 2272 before computing the results from multiple solutions. 2273 </para> 2274 <para> 2275 This facility is experimental, and we welcome comments or 2276 observations on the approach taken! 2277 An example is provided (<literal>examples/glr/expr-monad</literal>). 2278 It is the standard example of arithmetic expressions, except that 2279 the <literal>IO</literal> monad is used, and a user exception is 2280 thrown when the second argument to addition is an odd number. 2281 Running this example will show a zero (from the exception handler) 2282 instead of the expected number amongst the results from the other 2283 parses. 2284 </para> 2285 </sect2> 2286 </sect1> 2287 2288 <sect1 id="sec-glr-misc"> 2289 <title>Further information</title> 2290 2291 <para> 2292 Other useful information... 2293 </para> 2294 2295 <sect2 id="sec-glr-misc-examples"> 2296 <title>The GLR examples</title> 2297 <para> 2298 The directory <literal>examples/glr</literal> contains several examples 2299 from the small to the large. Please consult these or use them as a 2300 base for your experiments. 2301 </para> 2302 </sect2> 2303 2304 <sect2 id="sec-glr-misc-graphs"> 2305 <title>Viewing forests as graphs</title> 2306 <para> 2307 If you run the examples with <application>GHC</application>, each 2308 run will produce a file <literal>out.daVinci</literal>. This is a 2309 graph in the format expected by the <emphasis>daVinci</emphasis> 2310 graph visualization tool. 2311 (See <ulink url="http://www.informatik.uni-bremen.de/~davinci/"/> 2312 for more information. Educational use licenses are currently 2313 available without charge.) 2314 </para> 2315 <para> 2316 We highly recommend looking at graphs of parse results - it really 2317 helps to understand the results. 2318 The graphs files are created with Sven Panne's library for 2319 communicating with <emphasis>daVinci</emphasis>, supplemented 2320 with some extensions due to Callaghan. Copies of this code are 2321 included in the examples directory, for convenience. 2322 If you are trying to view large and complex graphs, contact Paul 2323 Callaghan (there are tools and techniques to make the graphs more 2324 manageable). 2325 </para> 2326 </sect2> 2327 2328 <sect2 id="sec-glr-misc-applications"> 2329 <title>Some Applications of GLR parsing</title> 2330 <para> 2331 GLR parsing (and related techniques) aren't just for badly written 2332 grammars or for things like natural language (NL) where ambiguity is 2333 inescapable. There are applications where ambiguity can represent 2334 possible alternatives in pattern-matching tasks, and the flexibility 2335 of these parsing techniques and the resulting graphs support deep 2336 analyses. Below, we briefly discuss some examples, a mixture from 2337 our recent work and from the literature. 2338 </para> 2339 2340 <variablelist> 2341 <varlistentry> 2342 <term>Gene sequence analysis</term> 2343 <listitem> 2344 <para> 2345 Combinations of structures within gene sequences can be 2346 expressed as a grammar, for example a "start" combination 2347 followed by a "promoter" combination then the gene proper. 2348 A recent undergraduate project has used this GLR implementation 2349 to detect candiate matches in data, and then to filter these 2350 matches with a mixture of local and global information. 2351 </para> 2352 </listitem> 2353 </varlistentry> 2354 <varlistentry> 2355 <term>Rhythmic structure in poetry</term> 2356 <listitem> 2357 <para> 2358 Rhythmic patterns in (English) poetry obey certain rules, 2359 and in more modern poetry can break rules in particular ways 2360 to achieve certain effects. The standard rhythmic patterns 2361 (eg. iambic pentameter) can be encoded as a grammar, and 2362 deviations from the patterns also encoded as rules. 2363 The neutral reading can be parsed with this grammar, to 2364 give a forest of alternative matches. The forest can be 2365 analysed to give a preferred reading, and to highlight 2366 certain technical features of the poetry. 2367 An undergraduate project in Durham has used this implementation 2368 for this purpose, with promising results. 2369 </para> 2370 </listitem> 2371 </varlistentry> 2372 <varlistentry> 2373 <term>Compilers -- instruction selection</term> 2374 <listitem> 2375 <para> 2376 Recent work has phrased the translation problem in 2377 compilers from intermediate representation to an 2378 instruction set for a given processor as a matching 2379 problem. Different constructs at the intermediate 2380 level can map to several combinations of machine 2381 instructions. This knowledge can be expressed as a 2382 grammar, and instances of the problem solved by 2383 parsing. The parse forest represents competing solutions, 2384 and allows selection of optimum solutions according 2385 to various measures. 2386 </para> 2387 </listitem> 2388 </varlistentry> 2389 <varlistentry> 2390 <term>Robust parsing of ill-formed input</term> 2391 <listitem> 2392 <para> 2393 The extra flexibility of GLR parsing can simplify parsing 2394 of formal languages where a degree of `informality' is allowed. 2395 For example, Html parsing. Modern browsers contain complex 2396 parsers which are designed to try to extract useful information 2397 from Html text which doesn't follow the rules precisely, 2398 eg missing start tags or missing end tags. 2399 Html with missing tags can be written as an ambiguous grammar, 2400 and it should be a simple matter to extract a usable 2401 interpretation from a forest of parses. 2402 Notice the technique: we widen the scope of the grammar, 2403 parse with GLR, then extract a reasonable solution. 2404 This is arguably simpler than pushing an LR(1) or LL(1) 2405 parser past its limits, and also more maintainable. 2406 </para> 2407 </listitem> 2408 </varlistentry> 2409 <varlistentry> 2410 <term>Natural Language Processing</term> 2411 <listitem> 2412 <para> 2413 Ambiguity is inescapable in the syntax of most human languages. 2414 In realistic systems, parse forests are useful to encode 2415 competing analyses in an efficient way, and they also provide 2416 a framework for further analysis and disambiguation. Note 2417 that ambiguity can have many forms, from simple phrase 2418 attachment uncertainty to more subtle forms involving mixtures 2419 of word senses. If some degree of ungrammaticality is to be 2420 tolerated in a system, which can be done by extending the 2421 grammar with productions incorporating common forms of 2422 infelicity, the degree of ambiguity increases further. For 2423 systems used on arbitrary text, such as on newspapers, 2424 it is not uncommon that many sentences permit several 2425 hundred or more analyses. With such grammars, parse forest 2426 techniques are essential. 2427 Many recent NLP systems use such techniques, including 2428 the Durham's earlier LOLITA system - which was mostly 2429 written in Haskell. 2430 </para> 2431 </listitem> 2432 </varlistentry> 2433 </variablelist> 2434 </sect2> 2435 2436 <sect2 id="sec-glr-misc-workings"> 2437 <title>Technical details</title> 2438 <para> 2439 The original implementation was developed by Ben Medlock, 2440 as his undergraduate final year project, 2441 using ideas from Peter Ljungloef's Licentiate thesis 2442 (see <ulink url="http://www.cs.chalmers.se/~peb/parsing"/>, and 2443 we recommend the thesis for its clear analysis of parsing 2444 algorithms). 2445 Ljungloef's version produces lists of parse trees, but Medlock 2446 adapted this to produce an explicit graph containing parse structure 2447 information. He also incorporated 2448 the code into <application>Happy</application>. 2449 </para> 2450 2451 <para> 2452 After Medlock's graduation, Callaghan extended the code to 2453 incorporate semantic information, and made several improvements 2454 to the original code, such as improved local packing and 2455 support for hidden left recursion. The performance of the 2456 code was significantly improved, after changes of representation 2457 (eg to a chart-style data structure) 2458 and technique. Medlock's code was also used in several student 2459 projects, including analysis of gene sequences (Fischer) and 2460 analysis of rhythmic patterns in poetry (Henderson). 2461 </para> 2462 2463 <para> 2464 The current code implements the standard GLR algorithm extended 2465 to handle hidden left recursion. Such recursion, as in the grammar 2466 below from Rekers [1992], causes the standard algorithm to loop 2467 because the empty reduction <literal>A -> </literal> is always 2468 possible and the LR parser will not change state. Alternatively, 2469 there is a problem because an unknown (at the start of parsing) 2470 number of <literal>A</literal> 2471 items are required, to match the number of <literal>i</literal> 2472 tokens in the input. 2473 </para> 2474<programlisting> 2475S -> A Q i | + 2476A -> 2477</programlisting> 2478 <para> 2479 The solution to this is not surprising. Problematic recursions 2480 are detected as zero-span reductions in a state which has a 2481 <literal>goto</literal> table entry looping to itself. A special 2482 symbol is pushed to the stack on the first such reduction, 2483 and such reductions are done at most once for any token 2484 alternative for any input position. 2485 When popping from the stack, if the last token being popped 2486 is such a special symbol, then two stack tails are returned: one 2487 corresponding to a conventional pop (which removes the 2488 symbol) and the other to a duplication of the special symbol 2489 (the stack is not changed, but a copy of the symbol is returned). 2490 This allows sufficient copies of the empty symbol to appear 2491 on some stack, hence allowing the parse to complete. 2492 </para> 2493 2494 <para> 2495 The forest is held in a chart-style data structure, and this supports 2496 local ambiguity packing (chart parsing is discussed in Ljungloef's 2497 thesis, among other places). 2498 A limited amount of packing of live stacks is also done, to avoid 2499 some repetition of work. 2500 </para> 2501 2502 <para> 2503 [Rekers 1992] Parser Generation for Interactive Environments, 2504 PhD thesis, University of Amsterdam, 1992. 2505 </para> 2506 </sect2> 2507 2508 <sect2 id="sec-glr-misc-filter"> 2509 <title>The <option>--filter</option> option</title> 2510 <para> 2511 You might have noticed this GLR-related option. It is an experimental 2512 feature intended to restrict the amount of structure retained in the 2513 forest by discarding everything not required for the semantic 2514 results. It may or it may not work, and may be fixed in a future 2515 release. 2516 </para> 2517 </sect2> 2518 2519 <sect2 id="sec-glr-misc-limitations"> 2520 <title>Limitations and future work</title> 2521 <para> 2522 The parser supports hidden left recursion, but makes no attempt 2523 to handle cyclic grammars that have rules which do not consume any 2524 input. If you have a grammar like this, for example with rules like 2525 <literal>S -> S</literal> or 2526 <literal>S -> A S | x; A -> empty</literal>, the implementation will 2527 loop until you run out of stack - but if it will happen, it often 2528 happens quite quickly! 2529 </para> 2530 <para> 2531 The code has been used and tested frequently over the past few years, 2532 including being used in several undergraduate projects. It should be 2533 fairly stable, but as usual, can't be guaranteed bug-free. One day 2534 I will write it in Epigram! 2535 </para> 2536 <para> 2537 If you have suggestions for improvements, or requests for features, 2538 please contact Paul 2539 Callaghan. There are some changes I am considering, and some 2540 views and/or encouragement from users will be much appreciated. 2541 Further information can be found on Callaghan's 2542 <ulink url="http://www.dur.ac.uk/p.c.callaghan/happy-glr">GLR parser 2543 page</ulink>. 2544 </para> 2545 </sect2> 2546 2547 <sect2 id="sec-glr-misc-acknowledgements"> 2548 <title>Thanks and acknowledgements</title> 2549 <para> 2550 Many thanks to the people who have used and tested this software 2551 in its various forms, including Julia Fischer, James Henderson, and 2552 Aarne Ranta. 2553 </para> 2554 </sect2> 2555 </sect1> 2556 </chapter> 2557 2558<!-- Attribute Grammars ================================================= --> 2559 <chapter id="sec-AttributeGrammar"> 2560 <title>Attribute Grammars</title> 2561 2562 <sect1 id="sec-introAttributeGrammars"> 2563 <title>Introduction</title> 2564 2565 <para>Attribute grammars are a formalism for expressing syntax directed 2566 translation of a context-free grammar. An introduction to attribute grammars 2567 may be found <ulink 2568 url="http://www-rocq.inria.fr/oscar/www/fnc2/manual/node32.html">here</ulink>. 2569 There is also an article in the Monad Reader about attribute grammars and a 2570 different approach to attribute grammars using Haskell 2571 <ulink url="http://www.haskell.org/haskellwiki/The_Monad.Reader/Issue4/Why_Attribute_Grammars_Matter">here</ulink>. 2572 </para> 2573 2574 <para> 2575 The main practical difficulty that has prevented attribute grammars from 2576 gaining widespread use involves evaluating the attributes. Attribute grammars 2577 generate non-trivial data dependency graphs that are difficult to evaluate 2578 using mainstream languages and techniques. The solutions generally involve 2579 restricting the form of the grammars or using big hammers like topological sorts. 2580 However, a language which supports lazy evaluation, such as Haskell, has no 2581 problem forming complex data dependency graphs and evaluating them. The primary 2582 intellectual barrier to attribute grammar adoption seems to stem from the fact that 2583 most programmers have difficulty with the declarative nature of the 2584 specification. Haskell programmers, on the other hand, have already 2585 embraced a purely functional language. In short, the Haskell language and 2586 community seem like a perfect place to experiment with attribute grammars. 2587 </para> 2588 2589 <para> 2590 Embedding attribute grammars in Happy is easy because because Haskell supports 2591 three important features: higher order functions, labeled records, and 2592 lazy evaluation. Attributes are encoded as fields in a labeled record. The parse 2593 result of each non-terminal in the grammar is a function which takes a record 2594 of inherited attributes and returns a record of synthesized attributes. In each 2595 production, the attributes of various non-terminals are bound together using 2596 <literal>let</literal>. 2597 Finally, at the end of the parse, a distinguished attribute is evaluated to be 2598 the final result. Lazy evaluation takes care of evaluating each attribute in the 2599 correct order, resulting in an attribute grammar system that is capable of evaluating 2600 a fairly large class of attribute grammars. 2601 </para> 2602 2603 <para> 2604 Attribute grammars in Happy do not use any language extensions, so the 2605 parsers are Haskell 98 (assuming you don't use the GHC specific -g option). 2606 Currently, attribute grammars cannot be generated for GLR parsers (It's not 2607 exactly clear how these features should interact...) 2608 </para> 2609 2610 </sect1> 2611 2612 <sect1 id="sec-AtrributeGrammarsInHappy"> 2613 <title>Attribute Grammars in Happy</title> 2614 2615 <sect2 id="sec-declaringAttributes"> 2616 <title>Declaring Attributes</title> 2617 2618 <para> 2619 The presence of one or more <literal>%attribute</literal> directives indicates 2620 that a grammar is an attribute grammar. Attributes are calculated properties 2621 that are associated with the non-terminals in a parse tree. Each 2622 <literal>%attribute</literal> directive generates a field in the attributes 2623 record with the given name and type. 2624 </para> 2625 2626 <para> 2627 The first <literal>%attribute</literal> 2628 directive in a grammar defines the default attribute. The 2629 default attribute is distinguished in two ways: 1) if no attribute specifier is 2630 given on an attribute reference, 2631 the default attribute is assumed (see <xref linkend="sec-semanticRules"/>) 2632 and 2) the value for the default attribute of the starting non-terminal becomes the 2633 return value of the parse. 2634 </para> 2635 2636 <para> 2637 Optionally, one may specify a type declaration for the attribute record using 2638 the <literal>%attributetype</literal> declaration. This allows you to define the 2639 type given to the attribute record and, more importantly, allows you to introduce 2640 type variables that can be subsequently used in <literal>%attribute</literal> 2641 declarations. If the <literal>%attributetype</literal> directive is given without 2642 any <literal>%attribute</literal> declarations, then the <literal>%attributetype</literal> 2643 declaration has no effect. 2644 </para> 2645 2646 <para> 2647 For example, the following declarations: 2648 </para> 2649 2650<programlisting> 2651%attributetype { MyAttributes a } 2652%attribute value { a } 2653%attribute num { Int } 2654%attribute label { String } 2655</programlisting> 2656 2657 <para> 2658 would generate this attribute record declaration in the parser: 2659 </para> 2660 2661<programlisting> 2662data MyAttributes a = 2663 HappyAttributes { 2664 value :: a, 2665 num :: Int, 2666 label :: String 2667 } 2668</programlisting> 2669 2670 <para> 2671 and <literal>value</literal> would be the default attribute. 2672 </para> 2673 2674 </sect2> 2675 2676 <sect2 id="sec-semanticRules"> 2677 <title>Semantic Rules</title> 2678 2679 <para>In an ordinary Happy grammar, a production consists of a list 2680 of terminals and/or non-terminals followed by an uninterpreted 2681 code fragment enclosed in braces. With an attribute grammar, the 2682 format is very similar, but the braces enclose a set of semantic rules 2683 rather than uninterpreted Haskell code. Each semantic rule is either 2684 an attribute calculation or a conditional, and rules are separated by 2685 semicolons<footnote><para>Note that semantic rules must not rely on 2686 layout, because whitespace alignment is not guaranteed to be 2687 preserved</para></footnote>. 2688 </para> 2689 2690 <para> 2691 Both attribute calculations and conditionals may contain attribute references 2692 and/or terminal references. Just like regular Happy grammars, the tokens 2693 <literal>$1</literal> through <literal>$<n></literal>, where 2694 <literal>n</literal> is the number of symbols in the production, refer to 2695 subtrees of the parse. If the referenced symbol is a terminal, then the 2696 value of the reference is just the value of the terminal, the same way as 2697 in a regular Happy grammar. If the referenced symbol is a non-terminal, 2698 then the reference may be followed by an attribute specifier, which is 2699 a dot followed by an attribute name. If the attribute specifier is omitted, 2700 then the default attribute is assumed (the default attribute is the first 2701 attribute appearing in an <literal>%attribute</literal> declaration). 2702 The special reference <literal>$$</literal> references the 2703 attributes of the current node in the parse tree; it behaves exactly 2704 like the numbered references. Additionally, the reference <literal>$></literal> 2705 always references the rightmost symbol in the production. 2706 </para> 2707 2708 <para> 2709 An attribute calculation rule is of the form: 2710 </para> 2711<programlisting> 2712<attribute reference> = <Haskell expression> 2713</programlisting> 2714 <para> 2715 A rule of this form defines the value of an attribute, possibly as a function 2716 of the attributes of <literal>$$</literal> (inherited attributes), the attributes 2717 of non-terminals in the production (synthesized attributes), or the values of 2718 terminals in the production. The value for an attribute can only 2719 be defined once for a particular production. 2720 </para> 2721 2722 <para> 2723 The following rule calculates the default attribute of the current production in 2724 terms of the first and second items of the production (a synthesized attribute): 2725 </para> 2726<programlisting> 2727$$ = $1 : $2 2728</programlisting> 2729 2730 <para> 2731 This rule calculates the length attribute of a non-terminal in terms of the 2732 length of the current non-terminal (an inherited attribute): 2733 </para> 2734<programlisting> 2735$1.length = $$.length + 1 2736</programlisting> 2737 2738 <para> 2739 Conditional rules allow the rejection of strings due to context-sensitive properties. 2740 All conditional rules have the form: 2741 </para> 2742<programlisting> 2743where <Haskell expression> 2744</programlisting> 2745 <para> 2746 For non-monadic parsers, all conditional expressions 2747 must be of the same (monomorphic) type. At 2748 the end of the parse, the conditionals will be reduced using 2749 <literal>seq</literal>, which gives the grammar an opportunity to call 2750 <literal>error</literal> with an informative message. For monadic parsers, 2751 all conditional statements must have type <literal>Monad m => m ()</literal> where 2752 <literal>m</literal> is the monad in which the parser operates. All conditionals 2753 will be sequenced at the end of the parse, which allows the conditionals to call 2754 <literal>fail</literal> with an informative message. 2755 </para> 2756 2757 <para> 2758 The following conditional rule will cause the (non-monadic) parser to fail 2759 if the inherited length attribute is not 0. 2760 </para> 2761<programlisting> 2762where if $$.length == 0 then () else error "length not equal to 0" 2763</programlisting> 2764 2765 <para> 2766 This conditional is the monadic equivalent: 2767 </para> 2768<programlisting> 2769where unless ($$.length == 0) (fail "length not equal to 0") 2770</programlisting> 2771 2772 2773 </sect2> 2774 </sect1> 2775 2776 <sect1 id="sec-AttrGrammarLimits"> 2777 <title>Limits of Happy Attribute Grammars</title> 2778 2779 <para> 2780 If you are not careful, you can write an attribute grammar which fails to 2781 terminate. This generally happens when semantic rules 2782 are written which cause a circular dependency on the value of 2783 an attribute. Even if the value of the attribute is well-defined (that is, 2784 if a fixpoint calculation over attribute values will eventually converge to 2785 a unique solution), this attribute grammar system will not evaluate such 2786 grammars. 2787 </para> 2788 <para> 2789 One practical way to overcome this limitation is to ensure that each attribute 2790 is always used in either a top-down (inherited) fashion or in a bottom-up 2791 (synthesized) fashion. If the calculations are sufficiently lazy, one can 2792 "tie the knot" by synthesizing a value in one attribute, and then assigning 2793 that value to another, inherited attribute at some point in the parse tree. 2794 This technique can be useful for common tasks like building symbol tables for 2795 a syntactic scope and making that table available to sub-nodes of the parse. 2796 </para> 2797 </sect1> 2798 2799 2800 <sect1 id="sec-AttributeGrammarExample"> 2801 <title>Example Attribute Grammars</title> 2802 <para> 2803 The following two toy attribute grammars may prove instructive. The first is 2804 an attribute grammar for the classic context-sensitive grammar 2805 { a^n b^n c^n | n >= 0 }. It demonstrates the use of conditionals, 2806 inherited and synthesized attributes. 2807 </para> 2808 2809<programlisting> 2810{ 2811module ABCParser (parse) where 2812} 2813 2814%tokentype { Char } 2815 2816%token a { 'a' } 2817%token b { 'b' } 2818%token c { 'c' } 2819%token newline { '\n' } 2820 2821%attributetype { Attrs a } 2822%attribute value { a } 2823%attribute len { Int } 2824 2825%name parse abcstring 2826 2827%% 2828 2829abcstring 2830 : alist blist clist newline 2831 { $$ = $1 ++ $2 ++ $3 2832 ; $2.len = $1.len 2833 ; $3.len = $1.len 2834 } 2835 2836alist 2837 : a alist 2838 { $$ = $1 : $2 2839 ; $$.len = $2.len + 1 2840 } 2841 | { $$ = []; $$.len = 0 } 2842 2843blist 2844 : b blist 2845 { $$ = $1 : $2 2846 ; $2.len = $$.len - 1 2847 } 2848 | { $$ = [] 2849 ; where failUnless ($$.len == 0) "blist wrong length" 2850 } 2851 2852clist 2853 : c clist 2854 { $$ = $1 : $2 2855 ; $2.len = $$.len - 1 2856 } 2857 | { $$ = [] 2858 ; where failUnless ($$.len == 0) "clist wrong length" 2859 } 2860 2861{ 2862happyError = error "parse error" 2863failUnless b msg = if b then () else error msg 2864} 2865</programlisting> 2866 2867<para> 2868This grammar parses binary numbers and 2869calculates their value. It demonstrates 2870the use of inherited and synthesized attributes. 2871</para> 2872 2873 2874<programlisting> 2875{ 2876module BitsParser (parse) where 2877} 2878 2879%tokentype { Char } 2880 2881%token minus { '-' } 2882%token plus { '+' } 2883%token one { '1' } 2884%token zero { '0' } 2885%token newline { '\n' } 2886 2887%attributetype { Attrs } 2888%attribute value { Integer } 2889%attribute pos { Int } 2890 2891%name parse start 2892 2893%% 2894 2895start 2896 : num newline { $$ = $1 } 2897 2898num 2899 : bits { $$ = $1 ; $1.pos = 0 } 2900 | plus bits { $$ = $2 ; $2.pos = 0 } 2901 | minus bits { $$ = negate $2; $2.pos = 0 } 2902 2903bits 2904 : bit { $$ = $1 2905 ; $1.pos = $$.pos 2906 } 2907 2908 | bits bit { $$ = $1 + $2 2909 ; $1.pos = $$.pos + 1 2910 ; $2.pos = $$.pos 2911 } 2912 2913bit 2914 : zero { $$ = 0 } 2915 | one { $$ = 2^($$.pos) } 2916 2917{ 2918happyError = error "parse error" 2919} 2920</programlisting> 2921 2922 2923 </sect1> 2924 2925 </chapter> 2926 2927<!-- Invoking ============================================================ --> 2928 2929 <chapter id="sec-invoking"> 2930 <title>Invoking <application>Happy</application></title> 2931 2932 <para>An invocation of <application>Happy</application> has the following syntax:</para> 2933 2934<screen>$ happy [ <emphasis>options</emphasis> ] <emphasis>filename</emphasis> [ <emphasis>options</emphasis> ]</screen> 2935 2936 <para>All the command line options are optional (!) and may occur 2937 either before or after the input file name. Options that take 2938 arguments may be given multiple times, and the last occurrence 2939 will be the value used.</para> 2940 2941 <para>There are two types of grammar files, 2942 <filename>file.y</filename> and <filename>file.ly</filename>, with 2943 the latter observing the reverse comment (or literate) convention 2944 (i.e. each code line must begin with the character 2945 <literal>></literal>, lines which don't begin with 2946 <literal>></literal> are treated as comments). The examples 2947 distributed with <application>Happy</application> are all of the 2948 .ly form.</para> 2949 <indexterm> 2950 <primary>literate grammar files</primary> 2951 </indexterm> 2952 2953 <para>The flags accepted by <application>Happy</application> are as follows:</para> 2954 2955 <variablelist> 2956 2957 <varlistentry> 2958 <term><option>-o</option> <replaceable>file</replaceable></term> 2959 <term><option>--outfile</option>=<replaceable>file</replaceable></term> 2960 <listitem> 2961 <para>Specifies the destination of the generated parser module. 2962 If omitted, the parser will be placed in 2963 <replaceable>file</replaceable><literal>.hs</literal>, 2964 where <replaceable>file</replaceable> is the name of the input 2965 file with any extension removed.</para> 2966 </listitem> 2967 </varlistentry> 2968 2969 <varlistentry> 2970 <term><option>-i</option><optional><replaceable>file</replaceable></optional></term> 2971 <term><option>--info</option><optional>=<replaceable>file</replaceable></optional></term> 2972 <listitem> 2973 <indexterm> 2974 <primary>info file</primary> 2975 </indexterm> 2976 <para> Directs <application>Happy</application> to produce an info file 2977 containing detailed information about the grammar, parser 2978 states, parser actions, and conflicts. Info files are vital 2979 during the debugging of grammars. The filename argument is 2980 optional (note that there's no space between 2981 <literal>-i</literal> and the filename in the short 2982 version), and if omitted the info file will be written to 2983 <replaceable>file</replaceable><literal>.info</literal> (where 2984 <replaceable>file</replaceable> is the input file name with any 2985 extension removed).</para> 2986 </listitem> 2987 </varlistentry> 2988 2989 <varlistentry> 2990 <term><option>-p</option><optional><replaceable>file</replaceable></optional></term> 2991 <term><option>--pretty</option><optional>=<replaceable>file</replaceable></optional></term> 2992 <listitem> 2993 <indexterm> 2994 <primary>pretty print</primary> 2995 </indexterm> 2996 <para> Directs <application>Happy</application> to produce a file 2997 containing a pretty-printed form of the grammar, containing only 2998 the productions, withouth any semantic actions or type signatures. 2999 If no file name is provided, then the file name will be computed 3000 by replacing the extension of the input file with 3001 <literal>.grammar</literal>. 3002 </para> 3003 </listitem> 3004 </varlistentry> 3005 3006 3007 3008 <varlistentry> 3009 <term><option>-t</option> <replaceable>dir</replaceable></term> 3010 <term><option>--template</option>=<replaceable>dir</replaceable></term> 3011 <listitem> 3012 <indexterm> 3013 <primary>template files</primary> 3014 </indexterm> 3015 <para>Instructs <application>Happy</application> to use this directory 3016 when looking for template files: these files contain the 3017 static code that <application>Happy</application> includes in every 3018 generated parser. You shouldn't need to use this option if 3019 <application>Happy</application> is properly configured for your 3020 computer.</para> 3021 </listitem> 3022 </varlistentry> 3023 3024 <varlistentry> 3025 <term><option>-m</option> <replaceable>name</replaceable></term> 3026 <term><option>--magic-name</option>=<replaceable>name</replaceable></term> 3027 <listitem> 3028 <para> <application>Happy</application> prefixes all the symbols it uses internally 3029 with either <literal>happy</literal> or <literal>Happy</literal>. To use a 3030 different string, for example if the use of <literal>happy</literal> 3031 is conflicting with one of your own functions, specify the 3032 prefix using the <option>-m</option> option.</para> 3033 </listitem> 3034 </varlistentry> 3035 3036 <varlistentry> 3037 <term><option>-s</option></term> 3038 <term><option>--strict</option></term> 3039 <listitem> 3040 <para>NOTE: the <option>--strict</option> option is 3041 experimental and may cause unpredictable results.</para> 3042 3043 <para>This option causes the right hand side of each 3044 production (the semantic value) to be evaluated eagerly at 3045 the moment the production is reduced. If the lazy behaviour 3046 is not required, then using this option will improve 3047 performance and may reduce space leaks. Note that the 3048 parser as a whole is never lazy - the whole input will 3049 always be consumed before any input is produced, regardless 3050 of the setting of the <option>--strict</option> flag.</para> 3051 </listitem> 3052 </varlistentry> 3053 3054 <varlistentry> 3055 <term><option>-g</option></term> 3056 <term><option>--ghc</option></term> 3057 <listitem> 3058 <indexterm> 3059 <primary>GHC</primary> 3060 </indexterm> 3061 <indexterm> 3062 <primary>back-ends</primary> 3063 <secondary>GHC</secondary> 3064 </indexterm> 3065 <para>Instructs <application>Happy</application> to generate a parser 3066 that uses GHC-specific extensions to obtain faster code.</para> 3067 </listitem> 3068 </varlistentry> 3069 3070 <varlistentry> 3071 <term><option>-c</option></term> 3072 <term><option>--coerce</option></term> 3073 <listitem> 3074 <indexterm> 3075 <primary>coerce</primary> 3076 </indexterm> 3077 <indexterm> 3078 <primary>back-ends</primary> 3079 <secondary>coerce</secondary> 3080 </indexterm> 3081 <para> Use GHC's <literal>unsafeCoerce#</literal> extension to 3082 generate smaller faster parsers. Type-safety isn't 3083 compromised.</para> 3084 3085 <para>This option may only be used in conjunction with 3086 <option>-g</option>.</para> 3087 </listitem> 3088 </varlistentry> 3089 3090 <varlistentry> 3091 <term><option>-a</option></term> 3092 <term><option>--arrays</option></term> 3093 <listitem> 3094 <indexterm> 3095 <primary>arrays</primary> 3096 </indexterm> 3097 <indexterm> 3098 <primary>back-ends</primary> 3099 <secondary>arrays</secondary> 3100 </indexterm> 3101 <para> Instructs <application>Happy</application> to generate a parser 3102 using an array-based shift reduce parser. When used in 3103 conjunction with <option>-g</option>, the arrays will be 3104 encoded as strings, resulting in faster parsers. Without 3105 <option>-g</option>, standard Haskell arrays will be 3106 used.</para> 3107 </listitem> 3108 </varlistentry> 3109 3110 <varlistentry> 3111 <term><option>-d</option></term> 3112 <term><option>--debug</option></term> 3113 <listitem> 3114 <indexterm> 3115 <primary>debug</primary> 3116 </indexterm> 3117 <indexterm> 3118 <primary>back-ends</primary> 3119 <secondary>debug</secondary> 3120 </indexterm> 3121 <para>Generate a parser that will print debugging 3122 information to <literal>stderr</literal> at run-time, 3123 including all the shifts, reductions, state transitions and 3124 token inputs performed by the parser.</para> 3125 3126 <para>This option can only be used in conjunction with 3127 <option>-a</option>.</para> 3128 </listitem> 3129 </varlistentry> 3130 3131 <varlistentry> 3132 <term><option>-l</option></term> 3133 <term><option>--glr</option></term> 3134 <listitem> 3135 <indexterm> 3136 <primary>glr</primary> 3137 </indexterm> 3138 <indexterm> 3139 <primary>back-ends</primary> 3140 <secondary>glr</secondary> 3141 </indexterm> 3142 <para>Generate a GLR parser for ambiguous grammars.</para> 3143 </listitem> 3144 </varlistentry> 3145 3146 <varlistentry> 3147 <term><option>-k</option></term> 3148 <term><option>--decode</option></term> 3149 <listitem> 3150 <indexterm> 3151 <primary>decode</primary> 3152 </indexterm> 3153 <para>Generate simple decoding code for GLR result.</para> 3154 </listitem> 3155 </varlistentry> 3156 3157 <varlistentry> 3158 <term><option>-f</option></term> 3159 <term><option>--filter</option></term> 3160 <listitem> 3161 <indexterm> 3162 <primary>filter</primary> 3163 </indexterm> 3164 <para>Filter the GLR parse forest with respect to semantic usage.</para> 3165 </listitem> 3166 </varlistentry> 3167 3168 <varlistentry> 3169 <term><option>-?</option></term> 3170 <term><option>--help</option></term> 3171 <listitem> 3172 <para>Print usage information on standard output then exit 3173 successfully.</para> 3174 </listitem> 3175 </varlistentry> 3176 3177 <varlistentry> 3178 <term><option>-V</option></term> 3179 <term><option>--version</option></term> 3180 <listitem> 3181 <para>Print version information on standard output then exit 3182 successfully. Note that for legacy reasons <option>-v</option> 3183 is supported, too, but the use of it is deprecated. 3184 <option>-v</option> will be used for verbose mode when it is 3185 actually implemented.</para> 3186 </listitem> 3187 </varlistentry> 3188 3189 </variablelist> 3190 3191 </chapter> 3192 3193 <chapter id="sec-grammar-files"> 3194 <title>Syntax of Grammar Files</title> 3195 3196 <para>The input to <application>Happy</application> is a text file containing 3197 the grammar of the language you want to parse, together with some 3198 annotations that help the parser generator make a legal Haskell 3199 module that can be included in your program. This section gives 3200 the exact syntax of grammar files. </para> 3201 3202 <para>The overall format of the grammar file is given below:</para> 3203 3204<programlisting> 3205<optional module header> 3206<directives> 3207%% 3208<grammar> 3209<optional module trailer> 3210</programlisting> 3211 3212 <indexterm> 3213 <primary>module</primary> 3214 <secondary>header</secondary> 3215 </indexterm> 3216 <indexterm> 3217 <primary>module</primary> 3218 <secondary>trailer</secondary> 3219 </indexterm> 3220 <para>If the name of the grammar file ends in <literal>.ly</literal>, then 3221 it is assumed to be a literate script. All lines except those 3222 beginning with a <literal>></literal> will be ignored, and the 3223 <literal>></literal> will be stripped from the beginning of all the code 3224 lines. There must be a blank line between each code section 3225 (lines beginning with <literal>></literal>) and comment section. 3226 Grammars not using the literate notation must be in a file with 3227 the <literal>.y</literal> suffix.</para> 3228 3229 <sect1 id="sec-lexical-rules"> 3230 <title>Lexical Rules</title> 3231 3232<para>Identifiers in <application>Happy</application> grammar files must take the following form (using 3233the BNF syntax from the Haskell Report):</para> 3234 3235<programlisting> 3236id ::= alpha { idchar } 3237 | ' { any{^'} | \' } ' 3238 | " { any{^"} | \" } " 3239 3240alpha ::= A | B | ... | Z 3241 | a | b | ... | z 3242 3243idchar ::= alpha 3244 | 0 | 1 | ... | 9 3245 | _ 3246</programlisting> 3247 3248 </sect1> 3249 3250 <sect1 id="sec-module-header"> 3251 <title>Module Header</title> 3252 3253 <indexterm> 3254 <primary>module</primary> 3255 <secondary>header</secondary> 3256 </indexterm> 3257 <para>This section is optional, but if included takes the 3258 following form:</para> 3259 3260<programlisting> 3261{ 3262<Haskell module header> 3263} 3264</programlisting> 3265 3266 <para>The Haskell module header contains the module name, 3267 exports, and imports. No other code is allowed in the 3268 header—this is because <application>Happy</application> may need to include 3269 its own <literal>import</literal> statements directly after the user 3270 defined header.</para> 3271 3272 </sect1> 3273 3274 <sect1 id="sec-directives"> 3275 <title>Directives</title> 3276 3277 <para>This section contains a number of lines of the form:</para> 3278 3279<programlisting> 3280%<directive name> <argument> ... 3281</programlisting> 3282 3283 <para>The statements here are all annotations to help 3284 <application>Happy</application> generate the Haskell code for the grammar. 3285 Some of them are optional, and some of them are required.</para> 3286 3287 <sect2 id="sec-token-type"> 3288 <title>Token Type</title> 3289 3290<programlisting> 3291%tokentype { <valid Haskell type> } 3292</programlisting> 3293 3294 <indexterm> 3295 <primary><literal>%tokentype</literal></primary> 3296 </indexterm> 3297 <para>(mandatory) The <literal>%tokentype</literal> directive gives the 3298 type of the tokens passed from the lexical analyser to the 3299 parser (in order that <application>Happy</application> can supply types for 3300 functions and data in the generated parser).</para> 3301 3302 </sect2> 3303 3304 <sect2 id="sec-tokens"> 3305 <title>Tokens</title> 3306 3307<programlisting> 3308%token <name> { <Haskell pattern> } 3309 <name> { <Haskell pattern> } 3310 ... 3311</programlisting> 3312 3313 <indexterm> 3314 <primary><literal>%token</literal></primary> 3315 </indexterm> 3316 <para>(mandatory) The <literal>%token</literal> directive is used to 3317 tell <application>Happy</application> about all the terminal symbols used 3318 in the grammar. Each terminal has a name, by which it is 3319 referred to in the grammar itself, and a Haskell 3320 representation enclosed in braces. Each of the patterns must 3321 be of the same type, given by the <literal>%tokentype</literal> 3322 directive.</para> 3323 3324 <para>The name of each terminal follows the lexical rules for 3325 <application>Happy</application> identifiers given above. There are no 3326 lexical differences between terminals and non-terminals in the 3327 grammar, so it is recommended that you stick to a convention; 3328 for example using upper case letters for terminals and lower 3329 case for non-terminals, or vice-versa.</para> 3330 3331 <para><application>Happy</application> will give you a warning if you try 3332 to use the same identifier both as a non-terminal and a 3333 terminal, or introduce an identifier which is declared as 3334 neither.</para> 3335 3336 <para>To save writing lots of projection functions that map 3337 tokens to their components, you can include 3338 <literal>$$</literal> in your Haskell pattern. For 3339 example:</para> 3340 <indexterm> 3341 <primary><literal>$$</literal></primary> 3342 </indexterm> 3343 3344<programlisting> 3345%token INT { TokenInt $$ } 3346 ... 3347</programlisting> 3348 3349<para>This makes the semantic value of <literal>INT</literal> refer to the first argument 3350of <literal>TokenInt</literal> rather than the whole token, eliminating the need for 3351any projection function.</para> 3352 3353 </sect2> 3354 3355 <sect2 id="sec-parser-name"> 3356 <title>Parser Name</title> 3357 3358<programlisting> 3359%name <Haskell identifier> [ <non-terminal> ] 3360... 3361</programlisting> 3362 <indexterm> 3363 <primary><literal>%name</literal></primary> 3364 </indexterm> 3365 3366 <para>(optional) The <literal>%name</literal> directive is followed by 3367 a valid Haskell identifier, and gives the name of the 3368 top-level parsing function in the generated parser. This is 3369 the only function that needs to be exported from a parser 3370 module.</para> 3371 3372 <para>If the <literal>%name</literal> directive is omitted, it 3373 defaults to <literal>happyParse</literal>.</para> 3374 <indexterm> 3375 <primary><function>happyParse</function></primary> 3376 </indexterm> 3377 3378 <para>The <literal>%name</literal> directive takes an optional 3379 second parameter which specifies the top-level non-terminal 3380 which is to be parsed. If this parameter is omitted, it 3381 defaults to the first non-terminal defined in the 3382 grammar.</para> 3383 3384 <para>Multiple <literal>%name</literal> directives may be 3385 given, specifying multiple parser entry points for this 3386 grammar (see <xref linkend="sec-multiple-parsers"/>). When 3387 multiple <literal>%name</literal> directives are given, they 3388 must all specify explicit non-terminals.</para> 3389 </sect2> 3390 3391 <sect2 id="sec-partial-parsers"> 3392 <title>Partial Parsers</title> 3393 3394<programlisting> 3395%partial <Haskell identifier> [ <non-terminal> ] 3396... 3397</programlisting> 3398 <indexterm> 3399 <primary><literal>%partial</literal></primary> 3400 </indexterm> 3401 3402 <para>The <literal>%partial</literal> directive can be used instead of 3403 <literal>%name</literal>. It indicates that the generated parser 3404 should be able to parse an initial portion of the input. In 3405 contrast, a parser specified with <literal>%name</literal> will only 3406 parse the entire input.</para> 3407 3408 <para>A parser specified with <literal>%partial</literal> will stop 3409 parsing and return a result as soon as there exists a complete parse, 3410 and no more of the input can be parsed. It does this by accepting 3411 the parse if it is followed by the <literal>error</literal> token, 3412 rather than insisting that the parse is followed by the 3413 end of the token stream (or the <literal>eof</literal> token in the 3414 case of a <literal>%lexer</literal> parser).</para> 3415 </sect2> 3416 3417 <sect2 id="sec-monad-decl"> 3418 <title>Monad Directive</title> 3419 3420<programlisting> 3421%monad { <type> } { <then> } { <return> } 3422</programlisting> 3423 <indexterm> 3424 <primary><literal>%monad</literal></primary> 3425 </indexterm> 3426 3427 <para>(optional) The <literal>%monad</literal> directive takes three 3428 arguments: the type constructor of the monad, the 3429 <literal>then</literal> (or <literal>bind</literal>) operation, and the 3430 <literal>return</literal> (or <literal>unit</literal>) operation. The type 3431 constructor can be any type with kind <literal>* -> *</literal>.</para> 3432 3433 <para>Monad declarations are described in more detail in <xref 3434 linkend="sec-monads"/>.</para> 3435 3436 </sect2> 3437 3438 <sect2 id="sec-lexer-decl"> 3439 <title>Lexical Analyser</title> 3440 3441<programlisting> 3442%lexer { <lexer> } { <eof> } 3443</programlisting> 3444 <indexterm> 3445 <primary><literal>%lexer</literal></primary> 3446 </indexterm> 3447 3448 <para>(optional) The <literal>%lexer</literal> directive takes two 3449 arguments: <literal><lexer></literal> is the name of the lexical 3450 analyser function, and <literal><eof></literal> is a token that 3451 is to be treated as the end of file.</para> 3452 3453 <para>Lexer declarations are described in more detail in <xref 3454 linkend="sec-lexers"/>.</para> 3455 3456 </sect2> 3457 3458 <sect2 id="sec-prec-decls"> 3459 <title>Precedence declarations</title> 3460 3461<programlisting> 3462%left <name> ... 3463%right <name> ... 3464%nonassoc <name> ... 3465</programlisting> 3466 <indexterm> 3467 <primary><literal>%left</literal> directive</primary> 3468 </indexterm> 3469 <indexterm> 3470 <primary><literal>%right</literal> directive</primary> 3471 </indexterm> 3472 <indexterm> 3473 <primary><literal>%nonassoc</literal> directive</primary> 3474 </indexterm> 3475 3476 <para>These declarations are used to specify the precedences 3477 and associativity of tokens. The precedence assigned by a 3478 <literal>%left</literal>, <literal>%right</literal> or 3479 <literal>%nonassoc</literal> declaration is defined to be 3480 higher than the precedence assigned by all declarations 3481 earlier in the file, and lower than the precedence assigned by 3482 all declarations later in the file.</para> 3483 3484 <para>The associativity of a token relative to tokens in the 3485 same <literal>%left</literal>, <literal>%right</literal>, or 3486 <literal>%nonassoc</literal> declaration is to the left, to 3487 the right, or non-associative respectively.</para> 3488 3489 <para>Precedence declarations are described in more detail in 3490 <xref linkend="sec-Precedences"/>.</para> 3491 </sect2> 3492 3493 <sect2 id="sec-expect"> 3494 <title>Expect declarations</title> 3495<programlisting> 3496%expect <number> 3497</programlisting> 3498 <indexterm> 3499 <primary><literal>%expect</literal> directive</primary> 3500 </indexterm> 3501 3502 <para>(optional) More often than not the grammar you write 3503 will have conflicts. These conflicts generate warnings. But 3504 when you have checked the warnings and made sure that Happy 3505 handles them correctly these warnings are just annoying. The 3506 <literal>%expect</literal> directive gives a way of avoiding 3507 them. Declaring <literal>%expect 3508 <replaceable>n</replaceable></literal> is a way of telling 3509 Happy “There are exactly <replaceable>n</replaceable> 3510 shift/reduce conflicts and zero reduce/reduce conflicts in 3511 this grammar. I promise I have checked them and they are 3512 resolved correctly”. When processing the grammar, Happy 3513 will check the actual number of conflicts against the 3514 <literal>%expect</literal> declaration if any, and if there is 3515 a discrepancy then an error will be reported.</para> 3516 3517 <para>Happy's <literal>%expect</literal> directive works 3518 exactly like that of yacc.</para> 3519 </sect2> 3520 3521 <sect2 id="sec-error-directive"> 3522 <title>Error declaration</title> 3523 3524<programlisting> 3525%error { <identifier> } 3526</programlisting> 3527 <indexterm> 3528 <primary><literal>%error</literal></primary> 3529 </indexterm> 3530 3531 <para>Specifies the function to be called in the event of a 3532 parse error. The type of <literal><identifier></literal> varies 3533 depending on the presence of <literal>%lexer</literal> (see 3534 <xref linkend="sec-monad-summary" />) and <literal>%errorhandlertype</literal> 3535 (see the following).</para> 3536 </sect2> 3537 3538 <sect2 id="sec-errorhandlertype-directive"> 3539 <title>Additional error information</title> 3540 3541<programlisting> 3542%errorhandlertype (explist | default) 3543</programlisting> 3544 3545 <indexterm> 3546 <primary><literal>%errorhandlertype</literal></primary> 3547 </indexterm> 3548 3549 <para>(optional) The expected type of the user-supplied error handling can be 3550 applied with additional information. By default, no information is added, for 3551 compatibility with previous versions. However, if <literal>explist</literal> 3552 is provided with this directive, then the first application will be of 3553 type <literal>[String]</literal>, providing a description of possible tokens 3554 that would not have failed the parser in place of the token that has caused 3555 the error. 3556 </para> 3557 </sect2> 3558 3559 <sect2 id="sec-attributes"> 3560 <title>Attribute Type Declaration</title> 3561<programlisting> 3562%attributetype { <valid Haskell type declaration> } 3563</programlisting> 3564 <indexterm> 3565 <primary><literal>%attributetype</literal> directive</primary> 3566 </indexterm> 3567 3568 <para>(optional) This directive allows you to declare the type of the 3569 attributes record when defining an attribute grammar. If this declaration 3570 is not given, Happy will choose a default. This declaration may only 3571 appear once in a grammar. 3572 </para> 3573 <para> 3574 Attribute grammars are explained in <xref linkend="sec-AttributeGrammar"/>. 3575 </para> 3576 </sect2> 3577 3578 <sect2 id="sec-attribute"> 3579 <title>Attribute declaration</title> 3580<programlisting> 3581%attribute <Haskell identifier> { <valid Haskell type> } 3582</programlisting> 3583 <indexterm> 3584 <primary><literal>%attribute</literal> directive</primary> 3585 </indexterm> 3586 3587 <para>The presence of one or more of these directives declares that the 3588 grammar is an attribute grammar. The first attribute listed becomes the 3589 default attribute. Each <literal>%attribute</literal> directive generates a 3590 field in the attributes record with the given label and type. If there 3591 is an <literal>%attributetype</literal> declaration in the grammar which 3592 introduces type variables, then the type of an attribute may mention any 3593 such type variables. 3594 </para> 3595 3596 <para> 3597 Attribute grammars are explained in <xref linkend="sec-AttributeGrammar"/>. 3598 </para> 3599 </sect2> 3600 3601 </sect1> 3602 3603 <sect1 id="sec-grammar"> 3604 <title>Grammar</title> 3605 3606 <para>The grammar section comes after the directives, separated 3607 from them by a double-percent (<literal>%%</literal>) symbol. 3608 This section contains a number of 3609 <emphasis>productions</emphasis>, each of which defines a single 3610 non-terminal. Each production has the following syntax:</para> 3611 <indexterm> 3612 <primary><literal>%%</literal></primary> 3613 </indexterm> 3614 3615<programlisting> 3616<non-terminal> [ :: { <type> } ] 3617 : <id> ... {[%] <expression> } 3618 [ | <id> ... {[%] <expression> } 3619 ... ] 3620</programlisting> 3621 3622 <para>The first line gives the non-terminal to be defined by the 3623 production and optionally its type (type signatures for 3624 productions are discussed in <xref 3625 linkend="sec-type-signatures"/>).</para> 3626 3627 <para>Each production has at least one, and possibly many 3628 right-hand sides. Each right-hand side consists of zero or more 3629 symbols (terminals or non-terminals) and a Haskell expression 3630 enclosed in braces.</para> 3631 3632 <para>The expression represents the semantic value of the 3633 non-terminal, and may refer to the semantic values of the 3634 symbols in the right-hand side using the meta-variables 3635 <literal>$1 ... $n</literal>. It is an error to 3636 refer to <literal>$i</literal> when <literal>i</literal> 3637 is larger than the number of symbols on the right hand side of 3638 the current rule. The symbol <literal>$</literal> may be 3639 inserted literally in the Haskell expression using the sequence 3640 <literal>\$</literal> (this isn't necessary inside a 3641 string or character literal).</para> 3642 3643 <para>Additionally, the sequence <literal>$></literal> 3644 can be used to represent the value of the rightmost symbol.</para> 3645 3646 <para>A semantic value of the form <literal>{% ... }</literal> is a 3647 <emphasis>monadic action</emphasis>, and is only valid when the grammar 3648 file contains a <literal>%monad</literal> directive (<xref 3649 linkend="sec-monad-decl"/>). Monadic actions are discussed in 3650 <xref linkend="sec-monads"/>.</para> 3651 <indexterm> 3652 <primary>monadic</primary> 3653 <secondary>action</secondary> 3654 </indexterm> 3655 3656 <para>Remember that all the expressions for a production must 3657 have the same type.</para> 3658 3659 <sect2 id="sec-param-prods"> 3660 <title>Parameterized Productions</title> 3661 <para> 3662 Starting from version 1.17.1, <application>Happy</application> supports 3663 <emphasis>parameterized productions</emphasis> which provide a 3664 convenient notation for capturing recurring patterns in context free 3665 grammars. This gives the benefits of something similar to parsing 3666 combinators in the context of <application>Happy</application> 3667 grammars. 3668 </para> 3669 <para>This functionality is best illustrated with an example: 3670<programlisting> 3671opt(p) : p { Just $1 } 3672 | { Nothing } 3673 3674rev_list1(p) : p { [$1] } 3675 | rev_list1(p) p { $2 : $1 } 3676</programlisting> 3677 The first production, <literal>opt</literal>, is used for optional 3678 components of a grammar. It is just like <literal>p?</literal> in 3679 regular expressions or EBNF. The second production, 3680 <literal>rev_list1</literal>, is for parsing a list of 1 or more 3681 occurrences of <literal>p</literal>. Parameterized productions are 3682 just like ordinary productions, except that they have parameter in 3683 parenthesis after the production name. Multiple parameters should 3684 be separated by commas: 3685<programlisting> 3686fst(p,q) : p q { $1 } 3687snd(p,q) : p q { $2 } 3688both(p,q) : p q { ($1,$2) } 3689</programlisting> 3690 </para> 3691 3692 <para>To use a parameterized production, we have to pass values for the 3693 parameters, as if we are calling a function. The parameters can be 3694 either terminals, non-terminals, or other instantiations of 3695 parameterized productions. Here are some examples: 3696<programlisting> 3697list1(p) : rev_list1(p) { reverse $1 } 3698list(p) : list1(p) { $1 } 3699 | { [] } 3700</programlisting> 3701 The first production uses <literal>rev_list</literal> to define 3702 a production that behaves like <literal>p+</literal>, returning 3703 a list of elements in the same order as they occurred in the input. 3704 The second one, <literal>list</literal> is like <literal>p*</literal>. 3705 </para> 3706 3707 <para>Parameterized productions are implemented as a preprocessing 3708 pass in Happy: each instantiation of a production turns into a 3709 separate non-terminal, but are careful to avoid generating the 3710 same rule multiple times, as this would lead to an ambiguous grammar. 3711 Consider, for example, the following parameterized rule: 3712<programlisting> 3713sep1(p,q) : p list(snd(q,p)) { $1 : $2 } 3714</programlisting> 3715 The rules that would be generated for <literal>sep1(EXPR,SEP)</literal> 3716<programlisting> 3717sep1(EXPR,SEP) 3718 : EXPR list(snd(SEP,EXPR)) { $1 : $2 } 3719 3720list(snd(SEP,EXPR)) 3721 : list1(snd(SEP,EXPR)) { $1 } 3722 | { [] } 3723 3724list1(snd(SEP,EXPR)) 3725 : rev_list1(snd(SEP,EXPR)) { reverse $1 } 3726 3727rev_list1(snd(SEP,EXPR)) 3728 : snd(SEP,EXPR)) { [$1] } 3729 | rev_list1(snd(SEP,EXPR)) snd(SEP,EXPR) { $2 : $1 } 3730 3731snd(SEP,EXPR) 3732 : SEP EXPR { $2 } 3733</programlisting> 3734 Note that this is just a normal grammar, with slightly strange names 3735 for the non-terminals. 3736 </para> 3737 3738 <para>A drawback of the current implementation is that it does not 3739 support type signatures for the parameterized productions, that 3740 depend on the types of the parameters. We plan to implement that 3741 in the future---the current workaround is to omit the type signatures 3742 for such rules. 3743 </para> 3744 </sect2> 3745 3746 </sect1> 3747 3748 <sect1 id="sec-module-trailer"> 3749 <title>Module Trailer</title> 3750 <indexterm> 3751 <primary>module</primary> 3752 <secondary>trailer</secondary> 3753 </indexterm> 3754 3755 <para>The module trailer is optional, comes right at the end of 3756 the grammar file, and takes the same form as the module 3757 header:</para> 3758 3759<programlisting> 3760{ 3761<Haskell code> 3762} 3763</programlisting> 3764 3765 <para>This section is used for placing auxiliary definitions 3766 that need to be in the same module as the parser. In small 3767 parsers, it often contains a hand-written lexical analyser too. 3768 There is no restriction on what can be placed in the module 3769 trailer, and any code in there is copied verbatim into the 3770 generated parser file.</para> 3771 3772 </sect1> 3773 </chapter> 3774 3775 <chapter id="sec-info-files"> 3776 <title>Info Files</title> 3777 <indexterm> 3778 <primary>info files</primary> 3779 </indexterm> 3780 3781 <para> 3782 Happy info files, generated using the <literal>-i</literal> flag, 3783 are your most important tool for debugging errors in your grammar. 3784 Although they can be quite verbose, the general concept behind 3785 them is quite simple. 3786 </para> 3787 3788 <para> 3789 An info file contains the following information: 3790 </para> 3791 3792 <orderedlist> 3793 <listitem> 3794 <para>A summary of all shift/reduce and reduce/reduce 3795 conflicts in the grammar.</para> 3796 </listitem> 3797 <listitem> 3798 <para>Under section <literal>Grammar</literal>, a summary of all the rules in the grammar. These rules correspond directly to your input file, absent the actual Haskell code that is to be run for each rules. A rule is written in the form <literal><non-terminal> -> <id> ...</literal></para> 3799 </listitem> 3800 <listitem> 3801 <para>Under section <literal>Terminals</literal>, a summary of all the terminal tokens you may run against, as well as a the Haskell pattern which matches against them. This corresponds directly to the contents of your <literal>%token</literal> directive (<xref linkend="sec-tokens"/>).</para> 3802 </listitem> 3803 <listitem> 3804 <para>Under section <literal>Non-terminals</literal>, a summary of which rules apply to which productions. This is generally redundant with the <literal>Grammar</literal> section.</para> 3805 </listitem> 3806 <listitem> 3807 <para>The primary section <literal>States</literal>, which describes the state-machine Happy built for your grammar, and all of the transitions for each state.</para> 3808 </listitem> 3809 <listitem> 3810 <para>Finally, some statistics <literal>Grammar Totals</literal> at the end of the file.</para> 3811 </listitem> 3812 </orderedlist> 3813 <para>In general, you will be most interested in the <literal>States</literal> section, as it will give you information, in particular, about any conflicts your grammar may have.</para> 3814 3815 <sect1 id="sec-info-files-states"> 3816 <title>States</title> 3817 <para>Although Happy does its best to insulate you from the 3818 vagaries of parser generation, it's important to know a little 3819 about how shift-reduce parsers work in order to be able to 3820 interpret the entries in the <literal>States</literal> 3821 section.</para> 3822 3823 <para>In general, a shift-reduce parser operates by maintaining 3824 parse stack, which tokens and productions are shifted onto or 3825 reduced off of. The parser maintains a state machine, which 3826 accepts a token, performs some shift or reduce, and transitions 3827 to a new state for the next token. Importantly, these states 3828 represent <emphasis>multiple</emphasis> possible productions, 3829 because in general the parser does not know what the actual 3830 production for the tokens it's parsing is going to be. 3831 There's no direct correspondence between the state-machine 3832 and the input grammar; this is something you have to 3833 reverse engineer.</para> 3834 3835 <para>With this knowledge in mind, we can look at two example states 3836 from the example grammar from <xref linkend="sec-using" />: 3837 </para> 3838 3839<programlisting> 3840State 5 3841 3842 Exp1 -> Term . (rule 5) 3843 Term -> Term . '*' Factor (rule 6) 3844 Term -> Term . '/' Factor (rule 7) 3845 3846 in reduce using rule 5 3847 '+' reduce using rule 5 3848 '-' reduce using rule 5 3849 '*' shift, and enter state 11 3850 '/' shift, and enter state 12 3851 ')' reduce using rule 5 3852 %eof reduce using rule 5 3853 3854State 9 3855 3856 Factor -> '(' . Exp ')' (rule 11) 3857 3858 let shift, and enter state 2 3859 int shift, and enter state 7 3860 var shift, and enter state 8 3861 '(' shift, and enter state 9 3862 3863 Exp goto state 10 3864 Exp1 goto state 4 3865 Term goto state 5 3866 Factor goto state 6 3867</programlisting> 3868 3869 <para>For each state, the first set of lines describes the 3870 <emphasis>rules</emphasis> which correspond to this state. A 3871 period <literal>.</literal> is inserted in the production to 3872 indicate where, if this is indeed the correct production, we 3873 would have parsed up to. In state 5, there are multiple rules, 3874 so we don't know if we are parsing an <literal>Exp1</literal>, a 3875 multiplication or a division (however, we do know there is a 3876 <literal>Term</literal> on the parse stack); in state 9, there 3877 is only one rule, so we know we are definitely parsing a 3878 <literal>Factor</literal>.</para> 3879 3880 <para>The next set of lines specifies the action and state 3881 transition that should occur given a token. For example, if in 3882 state 5 we process the <literal>'*'</literal> token, this token 3883 is shifted onto the parse stack and we transition to the state 3884 corresponding to the rule <literal>Term -> Term '*' . 3885 Factor</literal> (matching the token disambiguated which state 3886 we are in.)</para> 3887 3888 <para>Finally, for states which shift on non-terminals, 3889 there will be a last set of lines saying what should be done 3890 after the non-terminal has been fully parsed; this information 3891 is effectively the stack for the parser. When a reduce occurs, 3892 these goto entries are used to determine what the next 3893 state should be.</para> 3894 3895 <!-- Probably could improve this section by walking through 3896 parsing --> 3897 3898 </sect1> 3899 3900 <sect1 id="sec-info-files-conflicts"> 3901 <title>Interpreting conflicts</title> 3902 3903 <para>When you have a conflict, you will see an entry like this 3904 in your info file:</para> 3905 3906<programlisting> 3907State 432 3908 3909 atype -> SIMPLEQUOTE '[' . comma_types0 ']' (rule 318) 3910 sysdcon -> '[' . ']' (rule 613) 3911 3912 '_' shift, and enter state 60 3913 'as' shift, and enter state 16 3914 3915... 3916 3917 ']' shift, and enter state 381 3918 (reduce using rule 328) 3919 3920... 3921</programlisting> 3922 3923 <para>On large, complex grammars, determining what the conflict is 3924 can be a bit of an art, since the state with the conflict may 3925 not have enough information to determine why a conflict is 3926 occurring).</para> 3927 3928 <para>In some cases, the rules associated with the state with 3929 the conflict will immediately give you enough guidance to 3930 determine what the ambiguous syntax is. 3931 For example, in the miniature shift/reduce conflict 3932 described in <xref linkend="sec-conflict-tips" />, 3933 the conflict looks like this:</para> 3934 3935<programlisting> 3936State 13 3937 3938 exp -> exp . '+' exp0 (rule 1) 3939 exp0 -> if exp then exp else exp . (rule 3) 3940 3941 then reduce using rule 3 3942 else reduce using rule 3 3943 '+' shift, and enter state 7 3944 (reduce using rule 3) 3945 3946 %eof reduce using rule 3 3947</programlisting> 3948 3949<para>Here, rule 3 makes it easy to imagine that we had been parsing a 3950 statement like <literal>if 1 then 2 else 3 + 4</literal>; the conflict 3951 arises from whether or not we should shift (thus parsing as 3952 <literal>if 1 then 2 else (3 + 4)</literal>) or reduce (thus parsing 3953 as <literal>(if 1 then 2 else 3) + 4</literal>).</para> 3954 3955<para>Sometimes, there's not as much helpful context in the error message; 3956take this abridged example from GHC's parser:</para> 3957 3958<programlisting> 3959State 49 3960 3961 type -> btype . (rule 281) 3962 type -> btype . '->' ctype (rule 284) 3963 3964 '->' shift, and enter state 472 3965 (reduce using rule 281) 3966</programlisting> 3967 3968<para>A pair of rules like this doesn't always result in a shift/reduce 3969 conflict: to reduce with rule 281 implies that, in some context when 3970 parsing the non-terminal <literal>type</literal>, it is possible for 3971 an <literal>'->'</literal> to occur immediately afterwards (indeed 3972 these source rules are factored such that there is no rule of the form 3973 <literal>... -> type '->' ...</literal>).</para> 3974 3975<para>The best way this author knows how to sleuth this out is to 3976 look for instances of the token and check if any of the preceeding 3977 non-terminals could terminate in a type:</para> 3978 3979<programlisting> 3980 texp -> exp '->' texp (500) 3981 exp -> infixexp '::' sigtype (414) 3982 sigtype -> ctype (260) 3983 ctype -> type (274) 3984</programlisting> 3985 3986<para>As it turns out, this shift/reduce conflict results from 3987 ambiguity for <emphasis>view patterns</emphasis>, as in 3988 the code sample <literal>case v of { x :: T -> T ... }</literal>.</para> 3989 3990 </sect1> 3991 3992 </chapter> 3993 3994 <chapter id="sec-tips"> 3995 <title>Tips</title> 3996 3997 <para>This section contains a lot of accumulated lore about using 3998 <application>Happy</application>.</para> 3999 4000 <sect1 id="sec-performance-tips"> 4001 <title>Performance Tips</title> 4002 4003 <para>How to make your parser go faster:</para> 4004 4005 <itemizedlist> 4006 4007 <listitem> 4008 <para> If you are using GHC 4009 <indexterm> 4010 <primary>GHC</primary> 4011 </indexterm> 4012 , generate parsers using the 4013 <literal>-a -g -c</literal> options, and compile them using GHC with 4014 the <literal>-fglasgow-exts</literal> option. This is worth a 4015 <emphasis>lot</emphasis>, in terms of compile-time, 4016 execution speed and binary size.<footnote><para>omitting the 4017 <literal>-a</literal> may generate slightly faster parsers, 4018 but they will be much bigger.</para></footnote></para> 4019 </listitem> 4020 4021 <listitem> 4022 <para> The lexical analyser is usually the most performance 4023 critical part of a parser, so it's worth spending some time 4024 optimising this. Profiling tools are essential here. In 4025 really dire circumstances, resort to some of the hacks that 4026 are used in the Glasgow Haskell Compiler's interface-file 4027 lexer.</para> 4028 </listitem> 4029 4030 <listitem> 4031 <para> Simplify the grammar as much as possible, as this 4032 reduces the number of states and reduction rules that need 4033 to be applied.</para> 4034 </listitem> 4035 4036 <listitem> 4037 <para> Use left recursion rather than right recursion 4038 <indexterm> 4039 <primary>recursion, left vs. right</primary> 4040 </indexterm> 4041 wherever possible. While not strictly a performance issue, 4042 this affects the size of the parser stack, which is kept on 4043 the heap and thus needs to be garbage collected.</para> 4044 </listitem> 4045 4046 </itemizedlist> 4047 4048 4049 </sect1> 4050 4051 <sect1 id="sec-compilation-time"> 4052 <title>Compilation-Time Tips</title> 4053 4054 <para>We have found that compiling parsers generated by 4055 <application>Happy</application> can take a large amount of time/memory, so 4056 here's some tips on making things more sensible:</para> 4057 4058 <itemizedlist> 4059 4060 <listitem> 4061 <para> Include as little code as possible in the module 4062 trailer. This code is included verbatim in the generated 4063 parser, so if any of it can go in a separate module, do 4064 so.</para> 4065 </listitem> 4066 4067 <listitem> 4068 <para> Give type signatures 4069 <indexterm> 4070 <primary>type</primary> 4071 <secondary>signatures in grammar</secondary> 4072 </indexterm> 4073 for everything (see <xref 4074 linkend="sec-type-signatures"/>. This is reported to improve 4075 things by about 50%. If there is a type signature for every 4076 single non-terminal in the grammar, then <application>Happy</application> 4077 automatically generates type signatures for most functions 4078 in the parser.</para> 4079 </listitem> 4080 4081 <listitem> 4082 <para> Simplify the grammar as much as possible (applies to 4083 everything, this one).</para> 4084 </listitem> 4085 4086 <listitem> 4087 <para> Use a recent version of GHC. Versions from 4.04 4088 onwards have lower memory requirements for compiling 4089 <application>Happy</application>-generated parsers.</para> 4090 </listitem> 4091 4092 <listitem> 4093 <para> Using <application>Happy</application>'s <literal>-g -a -c</literal> 4094 options when generating parsers to be compiled with GHC will 4095 help considerably.</para> 4096 </listitem> 4097 4098 </itemizedlist> 4099 4100 </sect1> 4101 4102 <sect1 id="sec-finding-errors"> 4103 <title>Finding Type Errors</title> 4104 4105 <indexterm> 4106 <primary>type</primary> 4107 <secondary>errors, finding</secondary> 4108 </indexterm> 4109 4110 <para>Finding type errors in grammar files is inherently 4111 difficult because the code for reductions is moved around before 4112 being placed in the parser. We currently have no way of passing 4113 the original filename and line numbers to the Haskell compiler, 4114 so there is no alternative but to look at the parser and match 4115 the code to the grammar file. An info file (generated by the 4116 <literal>-i</literal> option) can be helpful here.</para> 4117 4118 <indexterm> 4119 <primary>type</primary> 4120 <secondary>signatures in grammar</secondary> 4121 </indexterm> 4122 4123 <para>Type signature sometimes help by pinning down the 4124 particular error to the place where the mistake is made, not 4125 half way down the file. For each production in the grammar, 4126 there's a bit of code in the generated file that looks like 4127 this:</para> 4128 4129<programlisting> 4130HappyAbsSyn<n> ( E ) 4131</programlisting> 4132 <indexterm> 4133 <primary><literal>HappyAbsSyn</literal></primary> 4134 </indexterm> 4135 4136 <para>where <literal>E</literal> is the Haskell expression from the 4137 grammar file (with <literal>$n</literal> replaced by 4138 <literal>happy_var_n</literal>). If there is a type signature for this 4139 production, then <application>Happy</application> will have taken it into 4140 account when declaring the HappyAbsSyn datatype, and errors in 4141 <literal>E</literal> will be caught right here. Of course, the error may 4142 be really caused by incorrect use of one of the 4143 <literal>happy_var_n</literal> variables.</para> 4144 4145 <para>(this section will contain more info as we gain experience 4146 with creating grammar files. Please send us any helpful tips 4147 you find.)</para> 4148 4149 </sect1> 4150 4151 <sect1 id="sec-conflict-tips"> 4152 <title>Conflict Tips</title> 4153 <indexterm> 4154 <primary>conflicts</primary> 4155 </indexterm> 4156 4157 <para>Conflicts arise from ambiguities in the grammar. That is, 4158 some input sequences may possess more than one parse. 4159 Shift/reduce conflicts are benign in the sense that they are 4160 easily resolved (<application>Happy</application> automatically selects the 4161 shift action, as this is usually the intended one). 4162 Reduce/reduce conflicts are more serious. A reduce/reduce 4163 conflict implies that a certain sequence of tokens on the input 4164 can represent more than one non-terminal, and the parser is 4165 uncertain as to which reduction rule to use. It will select the 4166 reduction rule uppermost in the grammar file, so if you really 4167 must have a reduce/reduce conflict you can select which rule 4168 will be used by putting it first in your grammar file.</para> 4169 4170 <para>It is usually possible to remove conflicts from the 4171 grammar, but sometimes this is at the expense of clarity and 4172 simplicity. Here is a cut-down example from the grammar of 4173 Haskell (1.2):</para> 4174 4175<programlisting> 4176exp : exp op exp0 4177 | exp0 4178 4179exp0 : if exp then exp else exp 4180 ... 4181 | atom 4182 4183atom : var 4184 | integer 4185 | '(' exp ')' 4186 ... 4187</programlisting> 4188 4189 <para>This grammar has a shift/reduce conflict, due to the 4190 following ambiguity. In an input such as</para> 4191 4192<programlisting> 4193if 1 then 2 else 3 + 4 4194</programlisting> 4195 4196 <para>the grammar doesn't specify whether the parse should be</para> 4197 4198<programlisting> 4199if 1 then 2 else (3 + 4) 4200</programlisting> 4201 4202 <para>or</para> 4203 4204<programlisting> 4205(if 1 then 2 else 3) + 4 4206</programlisting> 4207 4208 <para>and the ambiguity shows up as a shift/reduce conflict on 4209 reading the 'op' symbol. In this case, the first parse is the 4210 intended one (the 'longest parse' rule), which corresponds to 4211 the shift action. Removing this conflict relies on noticing 4212 that the expression on the left-hand side of an infix operator 4213 can't be an <literal>exp0</literal> (the grammar previously said 4214 otherwise, but since the conflict was resolved as shift, this 4215 parse was not allowed). We can reformulate the 4216 <literal>exp</literal> rule as:</para> 4217 4218<programlisting> 4219exp : atom op exp 4220 | exp0 4221</programlisting> 4222 4223 <para>and this removes the conflict, but at the expense of some 4224 stack space while parsing (we turned a left-recursion into a 4225 right-recursion). There are alternatives using left-recursion, 4226 but they all involve adding extra states to the parser, so most 4227 programmers will prefer to keep the conflict in favour of a 4228 clearer and more efficient parser.</para> 4229 4230 <sect2 id="sec-lalr"> 4231 <title>LALR(1) parsers</title> 4232 4233 <para>There are three basic ways to build a shift-reduce 4234 parser. Full LR(1) (the `L' is the direction in which the 4235 input is scanned, the `R' is the way in which the parse is 4236 built, and the `1' is the number of tokens of lookahead) 4237 generates a parser with many states, and is therefore large 4238 and slow. SLR(1) (simple LR(1)) is a cut-down version of 4239 LR(1) which generates parsers with roughly one-tenth as many 4240 states, but lacks the power to parse many grammars (it finds 4241 conflicts in grammars which have none under LR(1)). </para> 4242 4243 <para>LALR(1) (look-ahead LR(1)), the method used by 4244 <application>Happy</application> and 4245 <application>yacc</application>, is a tradeoff between the two. 4246 An LALR(1) parser has the same number of states as an SLR(1) 4247 parser, but it uses a more complex method to calculate the 4248 lookahead tokens that are valid at each point, and resolves 4249 many of the conflicts that SLR(1) finds. However, there may 4250 still be conflicts in an LALR(1) parser that wouldn't be there 4251 with full LR(1).</para> 4252 4253 </sect2> 4254 </sect1> 4255 4256 <sect1 id="sec-happy-ghci"> 4257 <title>Using Happy with <application>GHCi</application></title> 4258 <indexterm><primary><application>GHCi</application></primary> 4259 </indexterm> 4260 4261 <para><application>GHCi</application>'s compilation manager 4262 doesn't understand Happy grammars, but with some creative use of 4263 macros and makefiles we can give the impression that 4264 <application>GHCi</application> is invoking Happy 4265 automatically:</para> 4266 4267 <itemizedlist> 4268 <listitem> 4269 <para>Create a simple makefile, called 4270 <filename>Makefile_happysrcs</filename>:</para> 4271 4272<programlisting>HAPPY = happy 4273HAPPY_OPTS = 4274 4275all: MyParser.hs 4276 4277%.hs: %.y 4278 $(HAPPY) $(HAPPY_OPTS) $< -o $@</programlisting> 4279 </listitem> 4280 4281 <listitem> 4282 <para>Create a macro in GHCi to replace the 4283 <literal>:reload</literal> command, like so (type this all 4284 on one line):</para> 4285 4286<screen>:def myreload (\_ -> System.system "make -f Makefile_happysrcs" 4287 >>= \rr -> case rr of { System.ExitSuccess -> return ":reload" ; 4288 _ -> return "" })</screen> 4289 </listitem> 4290 4291 <listitem> 4292 <para>Use <literal>:myreload</literal> 4293 (<literal>:my</literal> will do) instead of 4294 <literal>:reload</literal> (<literal>:r</literal>).</para> 4295 </listitem> 4296 </itemizedlist> 4297 </sect1> 4298 4299 <sect1 id="sec-monad-alex"> 4300 <title>Basic monadic Happy use with Alex</title> 4301 <indexterm> 4302 <primary><application>Alex</application></primary> 4303 <secondary>monad</secondary> 4304 </indexterm> 4305 4306 <para> 4307 <application>Alex</application> lexers are often used by 4308 <application>Happy</application> parsers, for example in 4309 GHC. While many of these applications are quite sophisticated, 4310 it is still quite useful to combine the basic 4311 <application>Happy</application> <literal>%monad</literal> 4312 directive with the <application>Alex</application> 4313 <literal>monad</literal> wrapper. By using monads for both, 4314 the resulting parser and lexer can handle errors far more 4315 gracefully than by throwing an exception. 4316 </para> 4317 4318 <para> 4319 The most straightforward way to use a monadic 4320 <application>Alex</application> lexer is to simply use the 4321 <literal>Alex</literal> monad as the 4322 <application>Happy</application> monad: 4323 </para> 4324 4325 <example><title>Lexer.x</title> 4326<programlisting>{ 4327module Lexer where 4328} 4329 4330%wrapper "monad" 4331 4332tokens :- 4333 ... 4334 4335{ 4336data Token = ... | EOF 4337 deriving (Eq, Show) 4338 4339alexEOF = return EOF 4340}</programlisting></example> 4341 <example><title>Parser.y</title> 4342<programlisting>{ 4343module Parser where 4344 4345import Lexer 4346} 4347 4348%name pFoo 4349%tokentype { Token } 4350%error { parseError } 4351%monad { Alex } { >>= } { return } 4352%lexer { lexer } { EOF } 4353 4354%token 4355 ... 4356 4357%% 4358 ... 4359 4360parseError :: Token -> Alex a 4361parseError _ = do 4362 ((AlexPn _ line column), _, _, _) <- alexGetInput 4363 alexError ("parse error at line " ++ (show line) ++ ", column " ++ (show column)) 4364 4365lexer :: (Token -> Alex a) -> Alex a 4366lexer = (alexMonadScan >>=) 4367}</programlisting></example> 4368 4369 <para> 4370 We can then run the finished parser in the 4371 <literal>Alex</literal> monad using 4372 <literal>runAlex</literal>, which returns an 4373 <literal>Either</literal> value rather than throwing an 4374 exception in case of a parse or lexical error: 4375 </para> 4376 4377<programlisting> 4378import qualified Lexer as Lexer 4379import qualified Parser as Parser 4380 4381parseFoo :: String -> Either String Foo 4382parseFoo s = Lexer.runAlex s Parser.pFoo 4383</programlisting> 4384 4385 </sect1> 4386 </chapter> 4387 <index/> 4388</book> 4389