1<?xml version="1.0" encoding="iso-8859-1"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
3   "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
4
5<book id="happy">
6  <bookinfo>
7    <date>2001-4-27</date>
8    <title>Happy User Guide</title>
9    <author>
10      <firstname>Simon</firstname>
11      <surname>Marlow</surname>
12    </author>
13    <author>
14      <firstname>Andy</firstname>
15      <surname>Gill</surname>
16    </author>
17    <address><email>simonmar@microsoft.com</email></address>
18    <copyright>
19      <year>1997-2009</year>
20      <holder>Simon Marlow</holder>
21    </copyright>
22    <abstract>
23      <para>This document describes Happy, the Haskell Parser
24	Generator, version 1.18.</para>
25    </abstract>
26  </bookinfo>
27
28  <!-- Table of contents -->
29  <toc></toc>
30
31<!-- Introduction ========================================================= -->
32
33  <chapter id="happy-introduction">
34    <title>Introduction</title>
35
36
37    <para> <application>Happy</application> is a parser generator
38    system for Haskell, similar to the tool
39    <application>yacc</application> for C.  Like
40    <application>yacc</application>, it takes a file containing an
41    annotated BNF specification of a grammar and produces a Haskell
42    module containing a parser for the grammar. </para>
43
44    <indexterm><primary>yacc</primary></indexterm>
45
46    <para> <application>Happy</application> is flexible: you can have several
47    <application>Happy</application> parsers in the same program, and
48    each parser may have multiple entry points.
49    <application>Happy</application> can work in conjunction with a
50    lexical analyser supplied by the user (either hand-written or
51    generated by another program), or it can parse a stream of
52    characters directly (but this isn't practical in most cases).  In
53    a future version we hope to include a lexical analyser generator
54    with <application>Happy</application> as a single package. </para>
55
56    <para> Parsers generated by <application>Happy</application> are
57    fast; generally faster than an equivalent parser written using
58    parsing combinators or similar tools.  Furthermore, any future
59    improvements made to <application>Happy</application> will benefit
60    an existing grammar, without need for a rewrite. </para>
61
62    <para> <application>Happy</application> is sufficiently powerful
63    to parse full Haskell
64    - <ulink url="http://www.haskell.org/ghc">GHC</ulink> itself uses
65    a Happy parser.</para>
66
67    <indexterm><primary><literal>hsparser</literal></primary></indexterm>
68    <indexterm>
69      <primary>Haskell parser</primary>
70      <see><literal>hsparser</literal></see>
71    </indexterm>
72
73    <para> <application>Happy</application> can currently generate
74    four types of parser from a given grammar, the intention being
75    that we can experiment with different kinds of functional code to
76    see which is the best, and compiler writers can use the different
77    types of parser to tune their compilers.  The types of parser
78    supported are: </para>
79
80    <orderedlist>
81
82      <listitem id="item-default-backend">
83        <para><quote>standard</quote> Haskell 98 (should work with any compiler
84	that compiles Haskell 98).</para>
85      </listitem>
86
87      <listitem>
88        <para>standard Haskell using arrays
89	<indexterm scope="all"><primary>arrays</primary></indexterm>
90	<indexterm scope="all"><primary>back-ends</primary><secondary>arrays</secondary></indexterm>
91	(this is not the default
92	because we have found this generates slower parsers than <xref
93	linkend="item-default-backend"/>).</para>
94      </listitem>
95
96      <listitem>
97        <para>Haskell with GHC
98	<indexterm><primary>GHC</primary></indexterm>
99	<indexterm><primary>back-ends</primary><secondary>GHC</secondary></indexterm>
100	(Glasgow Haskell) extensions. This is a
101	slightly faster option than <xref
102	linkend="item-default-backend"/> for Glasgow Haskell
103	users.</para>
104      </listitem>
105
106
107      <listitem>
108	<para>GHC Haskell with string-encoded arrays.  This is the
109	fastest/smallest option for GHC users.  If you're using GHC,
110	the optimum flag settings are <literal>-agc</literal> (see
111	<xref linkend="sec-invoking"/>).</para>
112      </listitem>
113
114    </orderedlist>
115
116    <para>Happy can also generate parsers which will dump debugging
117    information at run time, showing state transitions and the input
118    tokens to the parser.</para>
119
120    <sect1 id="sec-compatibility">
121      <title>Compatibility</title>
122
123      <para> <application>Happy</application> is written in Glasgow Haskell.  This
124      means that (for the time being), you need GHC to compile it.
125      Any version of GHC >= 6.2 should work.</para>
126
127      <para> Remember: parsers produced using
128      <application>Happy</application> should compile without
129      difficulty under any Haskell 98 compiler or interpreter.<footnote><para>With one
130	exception: if you have a production with a polymorphic type signature,
131	then a compiler that supports local universal quantification is
132	required.  See <xref linkend="sec-type-signatures" />.</para>
133	</footnote></para>
134    </sect1>
135
136    <sect1 id="sec-reporting-bugs">
137      <title>Reporting Bugs</title>
138
139      <indexterm>
140	<primary>bugs, reporting</primary>
141      </indexterm>
142
143      <para> Any bugs found in <application>Happy</application> should
144      be reported to me: Simon Marlow
145      <email>marlowsd@gmail.com</email> including all the relevant
146      information: the compiler used to compile
147      <application>Happy</application>, the command-line options used,
148      your grammar file or preferably a cut-down example showing the
149      problem, and a description of what goes wrong.  A patch to fix
150      the problem would also be greatly appreciated. </para>
151
152      <para> Requests for new features should also be sent to the
153      above address, especially if accompanied by patches :-).</para>
154
155    </sect1>
156
157    <sect1 id="sec-license">
158      <title>License</title>
159
160      <indexterm>
161	<primary>License</primary>
162      </indexterm>
163
164      <para> Previous versions of <application>Happy</application>
165      were covered by the GNU general public license.  We're now
166      distributing <application>Happy</application> with a less
167      restrictive BSD-style license.  If this license doesn't work for
168      you, please get in touch.</para>
169
170      <blockquote>
171	<para> Copyright 2009, Simon Marlow and Andy Gill.  All rights
172	reserved. </para>
173
174	<para> Redistribution and use in source and binary forms, with
175	or without modification, are permitted provided that the
176	following conditions are met: </para>
177
178	<itemizedlist>
179	  <listitem>
180	    <para>Redistributions of source code must retain the above
181            copyright notice, this list of conditions and the
182            following disclaimer.</para>
183	  </listitem>
184
185	  <listitem>
186	    <para> Redistributions in binary form must reproduce the
187            above copyright notice, this list of conditions and the
188            following disclaimer in the documentation and/or other
189            materials provided with the distribution.</para>
190	  </listitem>
191	</itemizedlist>
192
193	<para>THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS "AS
194        IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
195        LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
196        FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
197        SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY DIRECT,
198        INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
199        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
200        SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
201        OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
202        LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
203        (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
204        THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
205        OF SUCH DAMAGE.</para>
206      </blockquote>
207    </sect1>
208
209    <sect1 id="sec-obtaining">
210      <title>Obtaining <application>Happy</application></title>
211
212      <para> <application>Happy</application>'s web page can be found at <ulink
213      url="http://www.haskell.org/happy/">http://www.haskell.org/happy/</ulink>.
214      <application>Happy</application> source and binaries can be downloaded from
215      there.</para>
216
217    </sect1>
218
219  </chapter>
220
221<!-- Using Happy =========================================================== -->
222
223  <chapter id="sec-using">
224    <title>Using <application>Happy</application></title>
225
226  <para> Users of <application>Yacc</application> will find
227  <application>Happy</application> quite familiar.  The basic idea is
228  as follows: </para>
229
230  <itemizedlist>
231    <listitem>
232      <para>Define the grammar you want to parse in a
233      <application>Happy</application> grammar file. </para>
234    </listitem>
235
236    <listitem>
237      <para> Run the grammar through <application>Happy</application>, to generate
238      a compilable Haskell module.</para>
239    </listitem>
240
241    <listitem>
242      <para> Use this module as part of your Haskell program, usually
243      in conjunction with a lexical analyser (a function that splits
244      the input into <quote>tokens</quote>, the basic unit of parsing).</para>
245    </listitem>
246  </itemizedlist>
247
248  <para> Let's run through an example.  We'll implement a parser for a
249  simple expression syntax, consisting of integers, variables, the
250  operators <literal>+</literal>, <literal>-</literal>, <literal>*</literal>,
251  <literal>/</literal>, and the form <literal>let var = exp in exp</literal>.
252  The grammar file starts off like this:</para>
253
254<programlisting>
255{
256module Main where
257}
258</programlisting>
259
260    <para>At the top of the file is an optional <firstterm>module
261    header</firstterm>,
262      <indexterm>
263	<primary>module</primary>
264	<secondary>header</secondary>
265      </indexterm>
266    which is just a Haskell module header enclosed in braces.  This
267    code is emitted verbatim into the generated module, so you can put
268    any Haskell code here at all.  In a grammar file, Haskell code is
269    always contained between curly braces to distinguish it from the
270    grammar.</para>
271
272    <para>In this case, the parser will be a standalone program so
273    we'll call the module <literal>Main</literal>.</para>
274
275    <para>Next comes a couple of declarations:</para>
276
277<programlisting>
278%name calc
279%tokentype { Token }
280%error { parseError }
281</programlisting>
282
283    <indexterm>
284      <primary><literal>%name</literal></primary>
285    </indexterm>
286    <indexterm>
287      <primary><literal>%tokentype</literal></primary>
288    </indexterm>
289    <indexterm>
290      <primary><literal>%error</literal></primary>
291    </indexterm>
292
293    <para>The first line declares the name of the parsing function
294    that <application>Happy</application> will generate, in this case
295    <literal>calc</literal>.  In many cases, this is the only symbol you need
296    to export from the module.</para>
297
298    <para>The second line declares the type of tokens that the parser
299    will accept.  The parser (i.e. the function
300    <function>calc</function>) will be of type <literal>[Token] ->
301    T</literal>, where <literal>T</literal> is the return type of the
302    parser, determined by the production rules below.</para>
303
304    <para>The <literal>%error</literal> directive tells Happy the name
305    of a function it should call in the event of a parse error.  More
306    about this later.</para>
307
308    <para>Now we declare all the possible tokens:</para>
309
310<programlisting>
311%token
312      let             { TokenLet }
313      in              { TokenIn }
314      int             { TokenInt $$ }
315      var             { TokenVar $$ }
316      '='             { TokenEq }
317      '+'             { TokenPlus }
318      '-'             { TokenMinus }
319      '*'             { TokenTimes }
320      '/'             { TokenDiv }
321      '('             { TokenOB }
322      ')'             { TokenCB }
323</programlisting>
324
325    <indexterm>
326      <primary><literal>%token</literal></primary>
327    </indexterm>
328
329    <para>The symbols on the left are the tokens as they will be
330    referred to in the rest of the grammar, and to the right of each
331    token enclosed in braces is a Haskell pattern that matches the
332    token.  The parser will expect to receive a stream of tokens, each
333    of which will match one of the given patterns (the definition of
334    the <literal>Token</literal> datatype is given later).</para>
335
336    <para>The <literal>&dollar;&dollar;</literal> symbol is a placeholder that
337    represents the <emphasis>value</emphasis> of this token.  Normally the value
338    of a token is the token itself, but by using the
339    <literal>&dollar;&dollar;</literal> symbol you can specify some component
340    of the token object to be the value. </para>
341
342    <indexterm>
343      <primary><literal>&dollar;&dollar;</literal></primary>
344    </indexterm>
345
346    <para>Like yacc, we include <literal>%%</literal> here, for no real
347    reason.</para>
348
349<programlisting>
350%%
351</programlisting>
352
353    <para>Now we have the production rules for the grammar.</para>
354
355<programlisting>
356Exp   : let var '=' Exp in Exp  { Let $2 $4 $6 }
357      | Exp1                    { Exp1 $1 }
358
359Exp1  : Exp1 '+' Term           { Plus $1 $3 }
360      | Exp1 '-' Term           { Minus $1 $3 }
361      | Term                    { Term $1 }
362
363Term  : Term '*' Factor         { Times $1 $3 }
364      | Term '/' Factor         { Div $1 $3 }
365      | Factor                  { Factor $1 }
366
367Factor
368      : int                     { Int $1 }
369      | var                     { Var $1 }
370      | '(' Exp ')'             { Brack $2 }
371</programlisting>
372
373    <indexterm>
374      <primary>non-terminal</primary>
375    </indexterm>
376    <para>Each production consists of a <firstterm>non-terminal</firstterm>
377    symbol on the left, followed by a colon, followed by one or more
378    expansions on the right, separated by <literal>|</literal>.  Each expansion
379    has some Haskell code associated with it, enclosed in braces as
380    usual.</para>
381
382    <para>The way to think about a parser is with each symbol having a
383    <quote>value</quote>: we defined the values of the tokens above, and the
384    grammar defines the values of non-terminal symbols in terms of
385    sequences of other symbols (either tokens or non-terminals).  In a
386    production like this:</para>
387
388<programlisting>
389n   : t_1 ... t_n   { E }
390</programlisting>
391
392    <para>whenever the parser finds the symbols <literal>t_1...t_n</literal> in
393    the token stream, it constructs the symbol <literal>n</literal> and gives
394    it the value <literal>E</literal>, which may refer to the values of
395    <literal>t_1...t_n</literal> using the symbols
396    <literal>&dollar;1...&dollar;n</literal>.</para>
397
398    <para>The parser reduces the input using the rules in the grammar
399    until just one symbol remains: the first symbol defined in the
400    grammar (namely <literal>Exp</literal> in our example).  The value of this
401    symbol is the return value from the parser.</para>
402
403    <para>To complete the program, we need some extra code.  The
404    grammar file may optionally contain a final code section, enclosed
405    in curly braces.</para>
406
407<programlisting>{</programlisting>
408
409    <para>All parsers must include a function to be called in the
410    event of a parse error.  In the <literal>%error</literal>
411    directive earlier, we specified that the function to be called on
412    a parse error is <literal>parseError</literal>:</para>
413
414<programlisting>
415parseError :: [Token] -> a
416parseError _ = error "Parse error"
417</programlisting>
418
419    <para>Note that <literal>parseError</literal> must be polymorphic
420    in its return type <literal>a</literal>, which usually means it
421    must be a call to <literal>error</literal>.  We'll see in <xref
422    linkend="sec-monads"/> how to wrap the parser in a monad so that we
423    can do something more sensible with errors.  It's also possible to
424    keep track of line numbers in the parser for use in error
425    messages, this is described in <xref
426    linkend="sec-line-numbers"/>.</para>
427
428    <para>Next we can declare the data type that represents the parsed
429    expression:</para>
430
431<programlisting>
432data Exp
433      = Let String Exp Exp
434      | Exp1 Exp1
435      deriving Show
436
437data Exp1
438      = Plus Exp1 Term
439      | Minus Exp1 Term
440      | Term Term
441      deriving Show
442
443data Term
444      = Times Term Factor
445      | Div Term Factor
446      | Factor Factor
447      deriving Show
448
449data Factor
450      = Int Int
451      | Var String
452      | Brack Exp
453      deriving Show
454</programlisting>
455
456    <para>And the data structure for the tokens...</para>
457
458<programlisting>
459data Token
460      = TokenLet
461      | TokenIn
462      | TokenInt Int
463      | TokenVar String
464      | TokenEq
465      | TokenPlus
466      | TokenMinus
467      | TokenTimes
468      | TokenDiv
469      | TokenOB
470      | TokenCB
471 deriving Show
472</programlisting>
473
474    <para>... and a simple lexer that returns this data
475    structure.</para>
476
477<programlisting>
478lexer :: String -> [Token]
479lexer [] = []
480lexer (c:cs)
481      | isSpace c = lexer cs
482      | isAlpha c = lexVar (c:cs)
483      | isDigit c = lexNum (c:cs)
484lexer ('=':cs) = TokenEq : lexer cs
485lexer ('+':cs) = TokenPlus : lexer cs
486lexer ('-':cs) = TokenMinus : lexer cs
487lexer ('*':cs) = TokenTimes : lexer cs
488lexer ('/':cs) = TokenDiv : lexer cs
489lexer ('(':cs) = TokenOB : lexer cs
490lexer (')':cs) = TokenCB : lexer cs
491
492lexNum cs = TokenInt (read num) : lexer rest
493      where (num,rest) = span isDigit cs
494
495lexVar cs =
496   case span isAlpha cs of
497      ("let",rest) -> TokenLet : lexer rest
498      ("in",rest)  -> TokenIn : lexer rest
499      (var,rest)   -> TokenVar var : lexer rest
500</programlisting>
501
502    <para>And finally a top-level function to take some input, parse
503    it, and print out the result.</para>
504
505<programlisting>
506main = getContents >>= print . calc . lexer
507}
508</programlisting>
509
510    <para>And that's it! A whole lexer, parser and grammar in a few
511    dozen lines.  Another good example is <application>Happy</application>'s own
512    parser. Several features in <application>Happy</application> were developed
513    using this as an example.</para>
514
515    <indexterm>
516      <primary>info file</primary>
517    </indexterm>
518
519    <para>To generate the Haskell module for this parser, type the
520    command <command>happy example.y</command> (where
521    <filename>example.y</filename> is the name of the grammar file).
522    The Haskell module will be placed in a file named
523    <filename>example.hs</filename>.  Additionally, invoking the
524    command <command>happy example.y -i</command> will produce the
525    file <filename>example.info</filename> which contains detailed information
526    about the parser, including states and reduction rules (see <xref
527    linkend="sec-info-files"/>).  This can be invaluable for debugging
528    parsers, but requires some knowledge of the operation of a
529    shift-reduce parser. </para>
530
531    <sect1 id="sec-other-datatypes">
532      <title>Returning other datatypes</title>
533
534      <para>In the above example, we used a data type to represent the
535      syntax being parsed.  However, there's no reason why it has to
536      be this way: you could calculate the value of the expression on
537      the fly, using productions like this:</para>
538
539<programlisting>
540Term  : Term '*' Factor         { $1 * $3 }
541      | Term '/' Factor         { $1 / $3 }
542      | Factor                  { $1 }
543</programlisting>
544
545      <para>The value of a <literal>Term</literal> would be the value of the
546      expression itself, and the parser could return an integer.  </para>
547
548      <para>This works for simple expression types, but our grammar
549      includes variables and the <literal>let</literal> syntax.  How do we know
550      the value of a variable while we're parsing it?  We don't, but
551      since the Haskell code for a production can be anything at all,
552      we could make it a function that takes an environment of
553      variable values, and returns the computed value of the
554      expression:</para>
555
556<programlisting>
557Exp   : let var '=' Exp in Exp  { \p -> $6 (($2,$4 p):p) }
558      | Exp1                    { $1 }
559
560Exp1  : Exp1 '+' Term           { \p -> $1 p + $3 p }
561      | Exp1 '-' Term           { \p -> $1 p - $3 p }
562      | Term                    { $1 }
563
564Term  : Term '*' Factor         { \p -> $1 p * $3 p }
565      | Term '/' Factor         { \p -> $1 p `div` $3 p }
566      | Factor                  { $1 }
567
568Factor
569      : int                     { \p -> $1 }
570      | var                     { \p -> case lookup $1 p of
571	                                    Nothing -> error "no var"
572					    Just i  -> i }
573      | '(' Exp ')'             { $2 }
574</programlisting>
575
576      <para>The value of each production is a function from an
577      environment <emphasis>p</emphasis> to a value.  When parsing a
578      <literal>let</literal> construct, we extend the environment with the new
579      binding to find the value of the body, and the rule for
580      <literal>var</literal> looks up its value in the environment.  There's
581      something you can't do in <literal>yacc</literal> :-)</para>
582
583    </sect1>
584
585    <sect1 id="sec-sequences">
586      <title>Parsing sequences</title>
587
588      <para>A common feature in grammars is a <emphasis>sequence</emphasis> of a
589      particular syntactic element.  In EBNF, we'd write something
590      like <literal>n+</literal> to represent a sequence of one or more
591      <literal>n</literal>s, and <literal>n*</literal> for zero or more.
592      <application>Happy</application> doesn't support this syntax explicitly, but
593      you can define the equivalent sequences using simple
594      productions.</para>
595
596      <para>For example, the grammar for <application>Happy</application> itself
597      contains a rule like this:</para>
598
599<programlisting>
600prods : prod                   { [$1] }
601      | prods prod             { $2 : $1 }
602</programlisting>
603
604      <para>In other words, a sequence of productions is either a
605      single production, or a sequence of productions followed by a
606      single production.  This recursive rule defines a sequence of
607      one or more productions.</para>
608
609      <para>One thing to note about this rule is that we used
610      <emphasis>left recursion</emphasis> to define it - we could have written
611      it like this:</para>
612
613      <indexterm>
614	<primary>recursion, left vs. right</primary>
615      </indexterm>
616
617<programlisting>
618prods : prod                  { [$1] }
619      | prod prods            { $1 : $2 }
620</programlisting>
621
622      <para>The only reason we used left recursion is that
623      <application>Happy</application> is more efficient at parsing left-recursive
624      rules; they result in a constant stack-space parser, whereas
625      right-recursive rules require stack space proportional to the
626      length of the list being parsed.  This can be extremely
627      important where long sequences are involved, for instance in
628      automatically generated output.  For example, the parser in GHC
629      used to use right-recursion to parse lists, and as a result it
630      failed to parse some <application>Happy</application>-generated modules due
631      to running out of stack space!</para>
632
633      <para>One implication of using left recursion is that the resulting
634      list comes out reversed, and you have to reverse it again to get
635      it in the original order.  Take a look at the
636      <application>Happy</application> grammar for Haskell for many examples of
637      this.</para>
638
639      <para>Parsing sequences of zero or more elements requires a
640      trivial change to the above pattern:</para>
641
642<programlisting>
643prods : {- empty -}           { [] }
644      | prods prod            { $2 : $1 }
645</programlisting>
646
647      <para>Yes - empty productions are allowed.  The normal
648      convention is to include the comment <literal>{- empty -}</literal> to
649      make it more obvious to a reader of the code what's going
650      on.</para>
651
652      <sect2 id="sec-separators">
653	<title>Sequences with separators</title>
654
655	<para>A common type of sequence is one with a
656        <emphasis>separator</emphasis>: for instance function bodies in C
657        consist of statements separated by semicolons.  To parse this
658        kind of sequence we use a production like this:</para>
659
660<programlisting>
661stmts : stmt                   { [$1] }
662      | stmts ';' stmt         { $3 : $1 }
663</programlisting>
664
665	<para>If the <literal>;</literal> is to be a <emphasis>terminator</emphasis>
666        rather than a separator (i.e. there should be one following
667        each statement), we can remove the semicolon from the above
668        rule and redefine <literal>stmt</literal> as</para>
669
670<programlisting>
671stmt : stmt1 ';'              { $1 }
672</programlisting>
673
674	<para>where <literal>stmt1</literal> is the real definition of statements.</para>
675
676        <para>We might like to allow extra semicolons between
677        statements, to be a bit more liberal in what we allow as legal
678        syntax.  We probably just want the parser to ignore these
679        extra semicolons, and not generate a ``null statement'' value
680        or something.  The following rule parses a sequence of zero or
681        more statements separated by semicolons, in which the
682        statements may be empty:</para>
683
684<programlisting>
685stmts : stmts ';' stmt          { $3 : $1 }
686      | stmts ';'               { $1 }
687      | stmt			{ [$1] }
688      | {- empty -}		{ [] }
689</programlisting>
690
691	<para>Parsing sequences of <emphasis>one</emphasis> or more possibly
692	null statements is left as an exercise for the reader...</para>
693
694    </sect2>
695    </sect1>
696
697<!--
698    <sect1 id="sec-ambiguities">
699      <title>Ambiguities</title>
700
701      <para>(section under construction)</para>
702
703    </sect1>
704-->
705
706    <sect1 id="sec-Precedences">
707      <title>Using Precedences</title>
708      <indexterm><primary>precedences</primary></indexterm>
709      <indexterm><primary>associativity</primary></indexterm>
710
711      <para>Going back to our earlier expression-parsing example,
712      wouldn't it be nicer if we didn't have to explicitly separate
713      the expressions into terms and factors, merely to make it
714      clear that <literal>'*'</literal> and <literal>'/'</literal>
715      operators bind more tightly than <literal>'+'</literal> and
716      <literal>'-'</literal>?</para>
717
718      <para>We could just change the grammar as follows (making the
719      appropriate changes to the expression datatype too):</para>
720
721<programlisting>
722Exp   : let var '=' Exp in Exp  { Let $2 $4 $6 }
723      | Exp '+' Exp             { Plus $1 $3 }
724      | Exp '-' Exp             { Minus $1 $3 }
725      | Exp '*' Exp             { Times $1 $3 }
726      | Exp '/' Exp             { Div $1 $3 }
727      | '(' Exp ')'             { Brack $2 }
728      | int                     { Int $1 }
729      | var                     { Var $1 }
730</programlisting>
731
732      <para>but now Happy will complain that there are shift/reduce
733      conflicts because the grammar is ambiguous - we haven't
734      specified whether e.g. <literal>1 + 2 * 3</literal> is to be
735      parsed as <literal>1 + (2 * 3)</literal> or <literal>(1 + 2) *
736      3</literal>.  Happy allows these ambiguities to be resolved by
737      specifying the <firstterm>precedences</firstterm> of the
738      operators involved using directives in the
739      header<footnote><para>Users of <literal>yacc</literal> will find
740      this familiar, Happy's precedence scheme works in exactly the
741      same way.</para></footnote>:</para>
742
743<programlisting>
744...
745%right in
746%left '+' '-'
747%left '*' '/'
748%%
749...
750</programlisting>
751<indexterm><primary><literal>%left</literal> directive</primary></indexterm>
752<indexterm><primary><literal>%right</literal> directive</primary></indexterm>
753<indexterm><primary><literal>%nonassoc</literal> directive</primary></indexterm>
754
755      <para>The <literal>%left</literal> or <literal>%right</literal>
756      directive is followed by a list of terminals, and declares all
757      these tokens to be left or right-associative respectively.  The
758      precedence of these tokens with respect to other tokens is
759      established by the order of the <literal>%left</literal> and
760      <literal>%right</literal> directives: earlier means lower
761      precedence.  A higher precedence causes an operator to bind more
762      tightly; in our example above, because <literal>'*'</literal>
763      has a higher precedence than <literal>'+'</literal>, the
764      expression <literal>1 + 2 * 3</literal> will parse as <literal>1
765      + (2 * 3)</literal>.</para>
766
767      <para>What happens when two operators have the same precedence?
768      This is when the <firstterm>associativity</firstterm> comes into
769      play.  Operators specified as left associative will cause
770      expressions like <literal>1 + 2 - 3</literal> to parse as
771      <literal>(1 + 2) - 3</literal>, whereas right-associative
772      operators would parse as <literal>1 + (2 - 3)</literal>.  There
773      is also a <literal>%nonassoc</literal> directive which indicates
774      that the specified operators may not be used together.  For
775      example, if we add the comparison operators
776      <literal>'>'</literal> and <literal>'&lt;'</literal> to our
777      grammar, then we would probably give their precedence as:</para>
778
779<programlisting>...
780%right in
781%nonassoc '>' '&lt;'
782%left '+' '-'
783%left '*' '/'
784%%
785...</programlisting>
786
787      <para>which indicates that <literal>'>'</literal> and
788      <literal>'&lt;'</literal> bind less tightly than the other
789      operators, and the non-associativity causes expressions such as
790      <literal>1 > 2 > 3</literal> to be disallowed.</para>
791
792      <sect2 id="how-precedence-works">
793	<title>How precedence works</title>
794
795	<para>The precedence directives, <literal>%left</literal>,
796	<literal>%right</literal> and <literal>%nonassoc</literal>,
797	assign precedence levels to the tokens in the declaration.  A
798	rule in the grammar may also have a precedence: if the last
799	terminal in the right hand side of the rule has a precedence,
800	then this is the precedence of the whole rule.</para>
801
802	<para>The precedences are used to resolve ambiguities in the
803	grammar.  If there is a shift/reduce conflict, then the
804	precedence of the rule and the lookahead token are examined in
805	order to resolve the conflict:</para>
806
807	<itemizedlist>
808	  <listitem>
809	    <para>If the precedence of the rule is higher, then the
810	    conflict is resolved as a reduce.</para>
811	  </listitem>
812	  <listitem>
813	    <para>If the precedence of the lookahead token is higher,
814	    then the conflict is resolved as a shift.</para>
815	  </listitem>
816	  <listitem>
817	    <para>If the precedences are equal, then</para>
818	    <itemizedlist>
819		<listitem>
820		<para>If the token is left-associative, then reduce</para>
821	      </listitem>
822	      <listitem>
823		<para>If the token is right-associative, then shift</para>
824	      </listitem>
825	      <listitem>
826		<para>If the token is non-associative, then fail</para>
827	      </listitem>
828	    </itemizedlist>
829	  </listitem>
830	  <listitem>
831	    <para>If either the rule or the token has no precedence,
832	    then the default is to shift (these conflicts are reported
833	    by Happy, whereas ones that are automatically resolved by
834	    the precedence rules are not).</para>
835	  </listitem>
836	</itemizedlist>
837      </sect2>
838
839      <sect2 id="context-precedence">
840	<title>Context-dependent Precedence</title>
841
842	<para>The precedence of an individual rule can be overriden,
843	using <firstterm>context precedence</firstterm>.  This is
844	useful when, for example, a particular token has a different
845	precedence depending on the context.  A common example is the
846	minus sign: it has high precedence when used as prefix
847	negation, but a lower precedence when used as binary
848	subtraction.</para>
849
850	<para>We can implement this in Happy as follows:</para>
851
852<programlisting>%right in
853%nonassoc '>' '&lt;'
854%left '+' '-'
855%left '*' '/'
856%left NEG
857%%
858
859Exp   : let var '=' Exp in Exp  { Let $2 $4 $6 }
860      | Exp '+' Exp             { Plus $1 $3 }
861      | Exp '-' Exp             { Minus $1 $3 }
862      | Exp '*' Exp             { Times $1 $3 }
863      | Exp '/' Exp             { Div $1 $3 }
864      | '(' Exp ')'             { Brack $2 }
865      | '-' Exp %prec NEG       { Negate $2 }
866      | int                     { Int $1 }
867      | var                     { Var $1 }</programlisting>
868<indexterm><primary><literal>%prec</literal> directive</primary></indexterm>
869
870	<para>We invent a new token <literal>NEG</literal> as a
871	placeholder for the precedence of our prefix negation rule.
872	The <literal>NEG</literal> token doesn't need to appear in
873	a <literal>%token</literal> directive.  The prefix negation
874	rule has a <literal>%prec NEG</literal> directive attached,
875	which overrides the default precedence for the rule (which
876	would normally be the precedence of '-') with the precedence
877	of <literal>NEG</literal>.</para>
878      </sect2>
879
880      <sect2 id="shift-directive">
881        <title>The %shift directive for lowest precedence rules</title>
882        <para>
883          Rules annotated with the <literal>%shift</literal> directive
884          have the lowest possible precedence and are non-associative.
885          A shift/reduce conflict that involves such a rule is resolved as a shift.
886
887          One can think of <literal>%shift</literal> as
888          <literal>%prec SHIFT</literal> such that <literal>SHIFT</literal>
889          has lower precedence than any other token.
890        </para>
891        <para>
892          This is useful in conjunction with
893          <literal>%expect 0</literal> to explicitly point out all rules in the grammar that
894          result in conflicts, and thereby resolve such conflicts.
895        </para>
896      </sect2>
897
898    </sect1>
899
900    <sect1 id="sec-type-signatures">
901      <title>Type Signatures</title>
902
903      <indexterm>
904	<primary>type</primary>
905	<secondary>signatures in grammar</secondary>
906      </indexterm>
907
908      <para><application>Happy</application> allows you to include type signatures
909      in the grammar file itself, to indicate the type of each
910      production.  This has several benefits:</para>
911
912      <itemizedlist>
913	<listitem>
914	  <para> Documentation: including types in the grammar helps
915          to document the grammar for someone else (and indeed
916          yourself) reading the code.</para>
917	</listitem>
918
919	<listitem>
920	  <para> Fixing type errors in the generated module can become
921          slightly easier if <application>Happy</application> has inserted type
922          signatures for you.  This is a slightly dubious benefit,
923          since type errors in the generated module are still somewhat
924          difficult to find.  </para>
925	</listitem>
926
927	<listitem>
928	  <para> Type signatures generally help the Haskell compiler
929          to compile the parser faster.  This is important when really
930          large grammar files are being used.</para>
931	</listitem>
932      </itemizedlist>
933
934      <para>The syntax for type signatures in the grammar file is as
935      follows:</para>
936
937<programlisting>
938stmts   :: { [ Stmt ] }
939stmts   : stmts stmt                { $2 : $1 }
940	| stmt                      { [$1] }
941</programlisting>
942
943      <para>In fact, you can leave out the superfluous occurrence of
944      <literal>stmts</literal>:</para>
945
946<programlisting>
947stmts   :: { [ Stmt ] }
948	: stmts stmt                { $2 : $1 }
949	| stmt                      { [$1] }
950</programlisting>
951
952      <para>Note that currently, you have to include type signatures
953      for <emphasis>all</emphasis> the productions in the grammar to benefit
954      from the second and third points above.  This is due to boring
955      technical reasons, but it is hoped that this restriction can be
956      removed in the future.</para>
957
958      <para>It is possible to have productions with polymorphic or overloaded
959	types.  However, because the type of each production becomes the
960	argument type of a constructor in an algebraic datatype in the
961	generated source file, compiling the generated file requires a compiler
962	that supports local universal quantification.  GHC (with the
963	<option>-fglasgow-exts</option> option) and Hugs are known to support
964	this.</para>
965    </sect1>
966
967    <sect1 id="sec-monads">
968      <title>Monadic Parsers</title>
969
970      <indexterm>
971	<primary>monadic</primary>
972	<secondary>parsers</secondary>
973      </indexterm>
974
975      <para><application>Happy</application> has support for threading a monad
976      through the generated parser.  This might be useful for several
977      reasons:</para>
978
979      <itemizedlist>
980
981	<listitem>
982          <para> Handling parse errors
983	  <indexterm>
984	    <primary>parse errors</primary>
985	    <secondary>handling</secondary>
986	  </indexterm>
987<!--	  <indexterm>
988	    <primary>error</primary>
989	    <secondary>parse</secondary>
990	    <see>parse errors</see>
991	  </indexterm>
992-->
993	  by using an exception monad
994          (see <xref linkend="sec-exception"/>).</para>
995	</listitem>
996
997	<listitem>
998          <para> Keeping track of line numbers
999	  <indexterm>
1000	    <primary>line numbers</primary>
1001	  </indexterm>
1002	  in the input file, for
1003          example for use in error messages (see <xref
1004          linkend="sec-line-numbers"/>).</para>
1005	</listitem>
1006
1007	<listitem>
1008	  <para> Performing IO operations during parsing.</para>
1009	</listitem>
1010
1011	<listitem>
1012	  <para> Parsing languages with context-dependencies (such as
1013          C) require some state in the parser.</para>
1014	</listitem>
1015
1016</itemizedlist>
1017
1018      <para>Adding monadic support to your parser couldn't be simpler.
1019      Just add the following directive to the declaration section of
1020      the grammar file:</para>
1021
1022<programlisting>
1023%monad { &lt;type&gt; } [ { &lt;then&gt; } { &lt;return&gt; } ]
1024</programlisting>
1025
1026      <indexterm>
1027	<primary><literal>%monad</literal></primary>
1028      </indexterm>
1029
1030      <para>where <literal>&lt;type&gt;</literal> is the type constructor for
1031      the monad, <literal>&lt;then&gt;</literal> is the bind operation of the
1032      monad, and <literal>&lt;return&gt;</literal> is the return operation. If
1033      you leave out the names for the bind and return operations,
1034      <application>Happy</application> assumes that <literal>&lt;type&gt;</literal> is an
1035      instance of the standard Haskell type class <literal>Monad</literal> and
1036      uses the overloaded names for the bind and return
1037      operations.</para>
1038
1039      <para>When this declaration is included in the grammar,
1040      <application>Happy</application> makes a couple of changes to the generated
1041      parser: the types of the main parser function and
1042      <literal>parseError</literal> (the function named in
1043      <literal>%error</literal>) become <literal>[Token] -&gt; P a</literal> where
1044      <literal>P</literal> is the monad type constructor, and the function must
1045      be polymorphic in <literal>a</literal>.  In other words,
1046      <application>Happy</application> adds an application of the
1047      <literal>&lt;return&gt;</literal> operation defined in the declaration
1048      above, around the result of the parser (<literal>parseError</literal> is
1049      affected because it must have the same return type as the
1050      parser).  And that's all it does.</para>
1051
1052      <para>This still isn't very useful: all you can do is return
1053      something of monadic type from <literal>parseError</literal>.  How do you
1054      specify that the productions can also have type <literal>P a</literal>?
1055      Most of the time, you don't want a production to have this type:
1056      you'd have to write explicit <literal>returnP</literal>s everywhere.
1057      However, there may be a few rules in a grammar that need to get
1058      at the monad, so <application>Happy</application> has a special syntax for
1059      monadic actions:</para>
1060
1061<programlisting>
1062n  :  t_1 ... t_n          {% &lt;expr&gt; }
1063</programlisting>
1064
1065      <indexterm>
1066	<primary>monadic</primary>
1067	<secondary>actions</secondary>
1068      </indexterm>
1069      <para>The <literal>%</literal> in the action indicates that this is a
1070      monadic action, with type <literal>P a</literal>, where <literal>a</literal> is
1071      the real return type of the production.  When
1072      <application>Happy</application> reduces one of these rules, it evaluates the
1073      expression </para>
1074
1075<programlisting>
1076&lt;expr&gt; `then` \result -> &lt;continue parsing&gt;
1077</programlisting>
1078
1079      <para><application>Happy</application> uses <literal>result</literal> as the real
1080      semantic value of the production.  During parsing, several
1081      monadic actions might be reduced, resulting in a sequence
1082      like</para>
1083
1084<programlisting>
1085&lt;expr1&gt; `then` \r1 ->
1086&lt;expr2&gt; `then` \r2 ->
1087...
1088return &lt;expr3&gt;
1089</programlisting>
1090
1091      <para>The monadic actions are performed in the order that they
1092      are <emphasis>reduced</emphasis>.  If we consider the parse as a tree,
1093      then reductions happen in a depth-first left-to-right manner.
1094      The great thing about adding a monad to your parser is that it
1095      doesn't impose any performance overhead for normal reductions -
1096      only the monadic ones are translated like this.</para>
1097
1098      <para>Take a look at the Haskell parser for a good illustration
1099      of how to use a monad in your parser: it contains examples of
1100      all the principles discussed in this section, namely parse
1101      errors, a threaded lexer, line/column numbers, and state
1102      communication between the parser and lexer.</para>
1103
1104      <para>The following sections consider a couple of uses for
1105      monadic parsers, and describe how to also thread the monad
1106      through the lexical analyser.</para>
1107
1108      <sect2 id="sec-exception">
1109	<title>Handling Parse Errors</title>
1110	<indexterm>
1111	  <primary>parse errors</primary>
1112	  <secondary>handling</secondary>
1113	</indexterm>
1114
1115      <para>It's not very convenient to just call <literal>error</literal> when
1116      a parse error is detected: in a robust setting, you'd like the
1117      program to recover gracefully and report a useful error message
1118      to the user.  Exceptions (of which errors are a special case)
1119      are normally implemented in Haskell by using an exception monad,
1120      something like:</para>
1121
1122<programlisting>
1123data E a = Ok a | Failed String
1124
1125thenE :: E a -> (a -> E b) -> E b
1126m `thenE` k =
1127   case m of
1128       Ok a     -> k a
1129       Failed e -> Failed e
1130
1131returnE :: a -> E a
1132returnE a = Ok a
1133
1134failE :: String -> E a
1135failE err = Failed err
1136
1137catchE :: E a -> (String -> E a) -> E a
1138catchE m k =
1139   case m of
1140      Ok a     -> Ok a
1141      Failed e -> k e
1142</programlisting>
1143
1144	<para>This monad just uses a string as the error type.  The
1145        functions <literal>thenE</literal> and <literal>returnE</literal> are the usual
1146        bind and return operations of the monad, <literal>failE</literal>
1147        raises an error, and <literal>catchE</literal> is a combinator for
1148        handling exceptions.</para>
1149
1150	<para>We can add this monad to the parser with the declaration</para>
1151
1152<programlisting>
1153%monad { E } { thenE } { returnE }
1154</programlisting>
1155
1156	<para>Now, without changing the grammar, we can change the
1157        definition of <literal>parseError</literal> and have something sensible
1158        happen for a parse error:</para>
1159
1160<programlisting>
1161parseError tokens = failE "Parse error"
1162</programlisting>
1163
1164	<para>The parser now raises an exception in the monad instead
1165	of bombing out on a parse error.</para>
1166
1167	<para>We can also generate errors during parsing.  There are
1168        times when it is more convenient to parse a more general
1169        language than that which is actually intended, and check it
1170        later.  An example comes from Haskell, where the precedence
1171        values in infix declarations must be between 0 and 9:</para>
1172
1173<programlisting>prec :: { Int }
1174      : int    {% if $1 &lt; 0 || $1 > 9
1175	                then failE "Precedence out of range"
1176		        else returnE $1
1177		}</programlisting>
1178
1179	<para>The monadic action allows the check to be placed in the
1180	parser itself, where it belongs.</para>
1181
1182    </sect2>
1183
1184    <sect2 id="sec-lexers">
1185      <title>Threaded Lexers</title>
1186	<indexterm>
1187	  <primary>lexer, threaded</primary>
1188	</indexterm>
1189	<indexterm>
1190	  <primary>monadic</primary>
1191	  <secondary>lexer</secondary>
1192	</indexterm>
1193
1194	<para><application>Happy</application> allows the monad concept to be
1195	extended to the lexical analyser, too.  This has several
1196	useful consequences:</para>
1197
1198	<itemizedlist>
1199	  <listitem>
1200	    <para> Lexical errors can be treated in the same way as
1201            parse errors, using an exception monad.</para>
1202	    <indexterm>
1203	      <primary>parse errors</primary>
1204	      <secondary>lexical</secondary>
1205	    </indexterm>
1206	  </listitem>
1207	  <listitem>
1208	    <para> Information such as the current file and line
1209            number can be communicated between the lexer and
1210            parser. </para>
1211	  </listitem>
1212	  <listitem>
1213	    <para> General state communication between the parser and
1214            lexer - for example, implementation of the Haskell layout
1215            rule requires this kind of interaction.
1216            </para>
1217	  </listitem>
1218	  <listitem>
1219	    <para> IO operations can be performed in the lexer - this
1220            could be useful for following import/include declarations
1221            for instance.</para>
1222	  </listitem>
1223	</itemizedlist>
1224
1225	<para>A monadic lexer is requested by adding the following
1226	declaration to the grammar file:</para>
1227
1228<programlisting>
1229%lexer { &lt;lexer&gt; } { &lt;eof&gt; }
1230</programlisting>
1231
1232	<indexterm>
1233	  <primary><literal>%lexer</literal></primary>
1234	</indexterm>
1235
1236	<para>where <literal>&lt;lexer&gt;</literal> is the name of the lexical
1237        analyser function, and <literal>&lt;eof&gt;</literal> is a token that
1238        is to be treated as the end of file.</para>
1239
1240	<para>When using a monadic lexer, the parser no longer reads a
1241        list of tokens.  Instead, it calls the lexical analysis
1242        function for each new token to be read.  This has the side
1243        effect of eliminating the intermediate list of tokens, which
1244        is a slight performance win.</para>
1245
1246	<para>The type of the main parser function is now just
1247        <literal>P a</literal> - the input is being handled completely
1248        within the monad.</para>
1249
1250	<para>The type of <literal>parseError</literal> becomes
1251	<literal>Token -&gt; P a</literal>; that is it takes Happy's
1252	current lookahead token as input.  This can be useful, because
1253	the error function probably wants to report the token at which
1254	the parse error occurred, and otherwise the lexer would have
1255	to store this token in the monad.</para>
1256
1257	<para>The lexical analysis function must have the following
1258	type:</para>
1259
1260<programlisting>
1261lexer :: (Token -> P a) -> P a
1262</programlisting>
1263
1264	<para>where <literal>P</literal> is the monad type constructor declared
1265        with <literal>%monad</literal>, and <literal>a</literal> can be replaced by the
1266        parser return type if desired.</para>
1267
1268	<para>You can see from this type that the lexer takes a
1269        <emphasis>continuation</emphasis> as an argument.  The lexer is to find
1270        the next token, and pass it to this continuation to carry on
1271        with the parse.  Obviously, we need to keep track of the input
1272        in the monad somehow, so that the lexer can do something
1273        different each time it's called!</para>
1274
1275	<para>Let's take the exception monad above, and extend it to
1276        add the input string so that we can use it with a threaded
1277        lexer.</para>
1278
1279<programlisting>
1280data ParseResult a = Ok a | Failed String
1281type P a = String -> ParseResult a
1282
1283thenP :: P a -> (a -> P b) -> P b
1284m `thenP` k = \s ->
1285   case m s of
1286       Ok a     -> k a s
1287       Failed e -> Failed e
1288
1289returnP :: a -> P a
1290returnP a = \s -> Ok a
1291
1292failP :: String -> P a
1293failP err = \s -> Failed err
1294
1295catchP :: P a -> (String -> P a) -> P a
1296catchP m k = \s ->
1297   case m s of
1298      Ok a     -> Ok a
1299      Failed e -> k e s
1300</programlisting>
1301
1302	<para>Notice that this isn't a real state monad - the input
1303        string just gets passed around, not returned.  Our lexer will
1304        now look something like this:</para>
1305
1306<programlisting>
1307lexer :: (Token -> P a) -> P a
1308lexer cont s =
1309    ... lexical analysis code ...
1310    cont token s'
1311</programlisting>
1312
1313	<para>the lexer grabs the continuation and the input string,
1314        finds the next token <literal>token</literal>, and passes it together
1315        with the remaining input string <literal>s'</literal> to the
1316        continuation.</para>
1317
1318	<para>We can now indicate lexical errors by ignoring the
1319        continuation and calling <literal>failP "error message" s</literal>
1320        within the lexer (don't forget to pass the input string to
1321        make the types work out).</para>
1322
1323	<para>This may all seem a bit weird.  Why, you ask, doesn't
1324        the lexer just have type <literal>P Token</literal>?  It was
1325        done this way for performance reasons - this formulation
1326        sometimes means that you can use a reader monad instead of a
1327        state monad for <literal>P</literal>, and the reader monad
1328        might be faster.  It's not at all clear that this reasoning
1329        still holds (or indeed ever held), and it's entirely possible
1330        that the use of a continuation here is just a
1331        misfeature.</para>
1332
1333        <para>If you want a lexer of type <literal>P Token</literal>,
1334        then just define a wrapper to deal with the
1335        continuation:</para>
1336
1337<programlisting>
1338lexwrap :: (Token -> P a) -> P a
1339lexwrap cont = real_lexer `thenP` \token -> cont token
1340</programlisting>
1341
1342      <sect3>
1343	<title>Monadic productions with %lexer</title>
1344
1345	<para>The <literal>{% ... }</literal> actions work fine with
1346	<literal>%lexer</literal>, but additionally there are two more
1347	forms which are useful in certain cases.  Firstly:</para>
1348
1349<programlisting>
1350n  :  t_1 ... t_n          {%^ &lt;expr&gt; }
1351</programlisting>
1352
1353	<para>In this case, <literal>&lt;expr&gt;</literal> has type
1354	<literal>Token -> P a</literal>.  That is, Happy passes the
1355	current lookahead token to the monadic action
1356	<literal>&lt;expr&gt;</literal>.  This is a useful way to get
1357	hold of Happy's current lookahead token without having to
1358	store it in the monad.</para>
1359
1360<programlisting>
1361n  :  t_1 ... t_n          {%% &lt;expr&gt; }
1362</programlisting>
1363
1364	<para>This is a slight variant on the previous form.  The type
1365	of <literal>&lt;expr&gt;</literal> is the same, but in this
1366	case the lookahead token is actually discarded and a new token
1367	is read from the input.  This can be useful when you want to
1368	change the next token and continue parsing.</para>
1369      </sect3>
1370    </sect2>
1371
1372    <sect2 id="sec-line-numbers">
1373      <title>Line Numbers</title>
1374
1375	<indexterm>
1376	  <primary>line numbers</primary>
1377	</indexterm>
1378
1379	<indexterm>
1380	  <primary><literal>%newline</literal></primary>
1381	</indexterm>
1382	<para>Previous versions of <application>Happy</application> had a
1383        <literal>%newline</literal> directive that enabled simple line numbers
1384        to be counted by the parser and referenced in the actions.  We
1385        warned you that this facility may go away and be replaced by
1386        something more general, well guess what? :-)</para>
1387
1388	<para>Line numbers can now be dealt with quite
1389        straightforwardly using a monadic parser/lexer combination.
1390        Ok, we have to extend the monad a bit more:</para>
1391
1392<programlisting>
1393type LineNumber = Int
1394type P a = String -> LineNumber -> ParseResult a
1395
1396getLineNo :: P LineNumber
1397getLineNo = \s l -> Ok l
1398</programlisting>
1399
1400	<para>(the rest of the functions in the monad follow by just
1401        adding the extra line number argument in the same way as the
1402        input string).  Again, the line number is just passed down,
1403        not returned: this is OK because of the continuation-based
1404        lexer that can change the line number and pass the new one to
1405        the continuation.</para>
1406
1407	<para>The lexer can now update the line number as follows:</para>
1408
1409<programlisting>
1410lexer cont s =
1411  case s of
1412     '\n':s  ->  \line -> lexer cont s (line + 1)
1413     ... rest of lexical analysis ...
1414</programlisting>
1415
1416	<para>It's as simple as that.  Take a look at
1417        <application>Happy</application>'s own parser if you have the sources lying
1418        around, it uses a monad just like the one above.</para>
1419
1420        <para>Reporting the line number of a parse error is achieved
1421        by changing <literal>parseError</literal> to look something like
1422        this:</para>
1423
1424<programlisting>
1425parseError :: Token -> P a
1426parseError = getLineNo `thenP` \line ->
1427             failP (show line ++ ": parse error")
1428</programlisting>
1429
1430	<para>We can also get hold of the line number during parsing,
1431        to put it in the parsed data structure for future reference.
1432        A good way to do this is to have a production in the grammar
1433        that returns the current line number: </para>
1434
1435<programlisting>lineno :: { LineNumber }
1436        : {- empty -}      {% getLineNo }</programlisting>
1437
1438	<para>The semantic value of <literal>lineno</literal> is the line
1439        number of the last token read - this will always be the token
1440        directly following the <literal>lineno</literal> symbol in the grammar,
1441        since <application>Happy</application> always keeps one lookahead token in
1442        reserve.</para>
1443
1444      </sect2>
1445
1446      <sect2 id="sec-monad-summary">
1447	<title>Summary</title>
1448
1449	<para>The types of various functions related to the parser are
1450        dependent on what combination of <literal>%monad</literal> and
1451        <literal>%lexer</literal> directives are present in the grammar.  For
1452        reference, we list those types here.  In the following types,
1453        <emphasis>t</emphasis> is the return type of the
1454        parser.  A type containing a type variable indicates that the
1455        specified function must be polymorphic.</para>
1456
1457	<indexterm>
1458	  <primary>type</primary>
1459	  <secondary>of <function>parseError</function></secondary>
1460	</indexterm>
1461	<indexterm>
1462	  <primary>type</primary>
1463	  <secondary>of parser</secondary>
1464	</indexterm>
1465	<indexterm>
1466	  <primary>type</primary>
1467	  <secondary>of lexer</secondary>
1468	</indexterm>
1469
1470	<itemizedlist>
1471	  <listitem>
1472	    <formalpara>
1473	      <title> No <literal>&percnt;monad</literal> or
1474	      <literal>&percnt;lexer</literal> </title>
1475	      <para>
1476<programlisting>
1477parse      :: [Token] -> <emphasis>t</emphasis>
1478parseError :: [Token] -> a
1479</programlisting>
1480</para>
1481	    </formalpara>
1482	  </listitem>
1483
1484	  <listitem>
1485	    <formalpara>
1486	      <title> with <literal>%monad</literal> </title>
1487	      <para>
1488<programlisting>
1489parse      :: [Token] -> P <emphasis>t</emphasis>
1490parseError :: [Token] -> P a
1491</programlisting>
1492</para>
1493	    </formalpara>
1494	  </listitem>
1495
1496
1497	  <listitem>
1498	    <formalpara>
1499	      <title> with <literal>%lexer</literal> </title>
1500	      <para><programlisting>
1501parse      :: T <emphasis>t</emphasis>
1502parseError :: Token -> T a
1503lexer      :: (Token -> T a) -> T a
1504</programlisting>
1505where the type constructor <literal>T</literal> is whatever you want (usually <literal>T
1506a = String -> a</literal>).  I'm not sure if this is useful, or even if it works
1507properly.</para>
1508	    </formalpara>
1509	  </listitem>
1510
1511	  <listitem>
1512	    <formalpara>
1513	      <title> with <literal>%monad</literal> and <literal>%lexer</literal> </title>
1514	      <para><programlisting>
1515parse      :: P <emphasis>t</emphasis>
1516parseError :: Token -> P a
1517lexer      :: (Token -> P a) -> P a
1518</programlisting>
1519</para>
1520	    </formalpara>
1521	  </listitem>
1522	</itemizedlist>
1523
1524      </sect2>
1525    </sect1>
1526
1527    <sect1 id="sec-error">
1528      <title>The Error Token</title>
1529      <indexterm>
1530	<primary>error token</primary>
1531      </indexterm>
1532
1533      <para><application>Happy</application> supports a limited form of error
1534      recovery, using the special symbol <literal>error</literal> in a grammar
1535      file.  When <application>Happy</application> finds a parse error during
1536      parsing, it automatically inserts the <literal>error</literal> symbol; if
1537      your grammar deals with <literal>error</literal> explicitly, then it can
1538      detect the error and carry on.</para>
1539
1540      <para>For example, the <application>Happy</application> grammar for Haskell
1541      uses error recovery to implement Haskell layout.  The grammar
1542      has a rule that looks like this:</para>
1543
1544<programlisting>
1545close : '}'                  { () }
1546      | error		     { () }
1547</programlisting>
1548
1549      <para>This says that a close brace in a layout-indented context
1550      may be either a curly brace (inserted by the lexical analyser),
1551      or a parse error.  </para>
1552
1553      <para>This rule is used to parse expressions like <literal>let x
1554      = e in e'</literal>: the layout system inserts an open brace before
1555      <literal>x</literal>, and the occurrence of the <literal>in</literal> symbol
1556      generates a parse error, which is interpreted as a close brace
1557      by the above rule.</para>
1558
1559      <indexterm>
1560	<primary><application>yacc</application></primary>
1561      </indexterm>
1562      <para>Note for <literal>yacc</literal> users: this form of error recovery
1563      is strictly more limited than that provided by <literal>yacc</literal>.
1564      During a parse error condition, <literal>yacc</literal> attempts to
1565      discard states and tokens in order to get back into a state
1566      where parsing may continue; <application>Happy</application> doesn't do this.
1567      The reason is that normal <literal>yacc</literal> error recovery is
1568      notoriously hard to describe, and the semantics depend heavily
1569      on the workings of a shift-reduce parser.  Furthermore,
1570      different implementations of <literal>yacc</literal> appear to implement
1571      error recovery differently.  <application>Happy</application>'s limited error
1572      recovery on the other hand is well-defined, as is just
1573      sufficient to implement the Haskell layout rule (which is why it
1574      was added in the first place).</para>
1575    </sect1>
1576
1577    <sect1 id="sec-multiple-parsers">
1578      <title>Generating Multiple Parsers From a Single Grammar</title>
1579      <indexterm>
1580	<primary>multiple parsers</primary>
1581      </indexterm>
1582
1583      <para>It is often useful to use a single grammar to describe
1584      multiple parsers, where each parser has a different top-level
1585      non-terminal, but parts of the grammar are shared between
1586      parsers.  A classic example of this is an interpreter, which
1587      needs to be able to parse both entire files and single
1588      expressions: the expression grammar is likely to be identical
1589      for the two parsers, so we would like to use a single grammar
1590      but have two entry points.</para>
1591
1592      <para><application>Happy</application> lets you do this by
1593      allowing multiple <literal>%name</literal> directives in the
1594      grammar file.  The <literal>%name</literal> directive takes an
1595      optional second parameter specifying the top-level
1596      non-terminal for this parser, so we may specify multiple parsers
1597      like so:</para>
1598      <indexterm><primary><literal>%name</literal> directive</primary>
1599      </indexterm>
1600
1601<programlisting>
1602%name parse1 non-terminal1
1603%name parse2 non-terminal2
1604</programlisting>
1605
1606      <para><application>Happy</application> will generate from this a
1607      module which defines two functions <function>parse1</function>
1608      and <function>parse2</function>, which parse the grammars given
1609      by <literal>non-terminal1</literal> and
1610      <literal>non-terminal2</literal> respectively.  Each parsing
1611      function will of course have a different type, depending on the
1612      type of the appropriate non-terminal.</para>
1613    </sect1>
1614
1615  </chapter>
1616
1617  <chapter id="sec-glr">
1618
1619    <chapterinfo>
1620      <copyright>
1621        <year>2004</year>
1622        <holder>University of Durham, Paul Callaghan, Ben Medlock</holder>
1623      </copyright>
1624    </chapterinfo>
1625
1626    <title>Generalized LR Parsing</title>
1627
1628    <para>This chapter explains how to use the GLR parsing extension,
1629    which allows <application>Happy</application> to parse ambiguous
1630    grammars and produce useful results.
1631    This extension is triggered with the <option>--glr</option> flag,
1632    which causes <application>Happy</application>
1633    to use a different driver for the LALR(1) parsing
1634    tables. The result of parsing is a structure which encodes compactly
1635    <emphasis>all</emphasis> of the possible parses.
1636    There are two options for how semantic information is combined with
1637    the structural information.
1638    </para>
1639
1640    <para>
1641    This extension was developed by Paul Callaghan and Ben Medlock
1642    (University of Durham). It is based on the structural parser
1643    implemented in Medlock's undergraduate project, but significantly
1644    extended and improved by Callaghan.
1645    Bug reports, comments, questions etc should be sent to
1646    <email>P.C.Callaghan@durham.ac.uk</email>.
1647    Further information can be found on Callaghan's
1648    <ulink url="http://www.dur.ac.uk/p.c.callaghan/happy-glr">GLR parser
1649    page</ulink>.
1650
1651
1652    </para>
1653
1654    <sect1 id="sec-glr-intro">
1655      <title>Introduction</title>
1656
1657      <para>
1658      Here's an ambiguous grammar. It has no information about the
1659      associativity of <literal>+</literal>, so for example,
1660      <literal>1+2+3</literal> can be parsed as
1661      <literal>(1+(2+3))</literal> or <literal>((1+2)+3)</literal>.
1662      In conventional mode, <application>Happy</application>,
1663      would complain about a shift/reduce
1664      conflict, although it would generate a parser which always shifts
1665      in such a conflict, and hence would produce <emphasis>only</emphasis>
1666      the first alternative above.
1667      </para>
1668
1669<programlisting>
1670E -> E + E
1671E -> i       -- any integer
1672</programlisting>
1673
1674      <para>
1675      GLR parsing will accept this grammar without complaint, and produce
1676      a result which encodes <emphasis>both</emphasis> alternatives
1677      simultaneously. Now consider the more interesting example of
1678      <literal>1+2+3+4</literal>, which has five distinct parses -- try to
1679      list them! You will see that some of the subtrees are identical.
1680      A further property of the GLR output is that such sub-results are
1681      shared, hence efficiently represented: there is no combinatorial
1682      explosion.
1683      Below is the simplified output of the GLR parser for this example.
1684      </para>
1685
1686<programlisting>
1687Root (0,7,G_E)
1688(0,1,G_E)     => [[(0,1,Tok '1'))]]
1689(0,3,G_E)     => [[(0,1,G_E),(1,2,Tok '+'),(2,3,G_E)]]
1690(0,5,G_E)     => [[(0,1,G_E),(1,2,Tok '+'),(2,5,G_E)]
1691                  ,[(0,3,G_E),(3,4,Tok '+'),(4,5,G_E)]]
1692(0,7,G_E)     => [[(0,3,G_E),(3,4,Tok '+'),(4,7,G_E)]
1693                  ,[(0,1,G_E),(1,2,Tok '+'),(2,7,G_E)]
1694                  ,[(0,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}]
1695(2,3,G_E)     => [[(2,3,Tok '2'))]}]
1696(2,5,G_E)     => [[(2,3,G_E),(3,4,Tok '+'),(4,5,G_E)]}]
1697(2,7,G_E)     => [[(2,3,G_E),(3,4,Tok '+'),(4,7,G_E)]}
1698                  ,[(2,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}]
1699(4,5,G_E)     => [[(4,5,Tok '3'))]}]
1700(4,7,G_E)     => [[(4,5,G_E),(5,6,Tok '+'),(6,7,G_E)]}]
1701(6,7,G_E)     => [[(6,7,Tok '4'))]}]
1702</programlisting>
1703
1704      <para>
1705      This is a directed, acyclic and-or graph.
1706      The node "names" are of form <literal>(a,b,c)</literal>
1707      where <literal>a</literal> and <literal>b</literal>
1708      are the start and end points (as positions in the input string)
1709      and <literal>c</literal> is a category (or name of grammar rule).
1710      For example <literal>(2,7,G_E)</literal> spans positions 2 to 7
1711      and contains analyses which match the <literal>E</literal>
1712      grammar rule.
1713      Such analyses are given as a list of alternatives (disjunctions),
1714      each corresponding to some use of a production of that
1715      category, which in turn are a conjunction of sub-analyses,
1716      each represented as a node in the graph or an instance of a token.
1717      </para>
1718
1719      <para>
1720      Hence <literal>(2,7,G_E)</literal> contains two alternatives,
1721      one which has <literal>(2,3,G_E)</literal> as its first child
1722      and the other with <literal>(2,5,G_E)</literal> as its first child,
1723      respectively corresponding to sub-analyses
1724      <literal>(2+(3+4))</literal> and <literal>((2+3)+4)</literal>.
1725      Both alternatives have the token <literal>+</literal> as their
1726      second child, but note that they are difference occurrences of
1727      <literal>+</literal> in the input!
1728      We strongly recommend looking at such results in graphical form
1729      to understand these points. If you build the
1730      <literal>expr-eval</literal> example in the directory
1731      <literal>examples/glr</literal> (NB you need to use GHC for this,
1732      unless you know how to use the <option>-F</option> flag for Hugs),
1733      running the example will produce a file which can be viewed with
1734      the <emphasis>daVinci</emphasis> graph visualization tool.
1735      (See <ulink url="http://www.informatik.uni-bremen.de/~davinci/"/>
1736       for more information. Educational use licenses are currently
1737	available without charge.)
1738      </para>
1739
1740      <para>
1741      The GLR extension also allows semantic information to be attached
1742      to productions, as in conventional <application>Happy</application>,
1743      although there are further issues to consider.
1744      Two modes are provided, one for simple applications and one for more
1745      complex use.
1746      See <xref linkend="sec-glr-semantics"/>.
1747      The extension is also integrated with <application>Happy</application>'s
1748      token handling, e.g. extraction of information from tokens.
1749      </para>
1750
1751      <para>
1752      One key feature of this implementation in Haskell is that its main
1753      result is a <emphasis>graph</emphasis>.
1754      Other implementations effectively produce a list of trees, but this
1755      limits practical use to small examples.
1756      For large and interesting applications, some of which are discussed
1757      in <xref linkend="sec-glr-misc-applications"/>, a graph is essential due
1758      to the large number of possibilities and the need to analyse the
1759      structure of the ambiguity. Converting the graph to trees could produce
1760      huge numbers of results and will lose information about sharing etc.
1761      </para>
1762
1763      <para>
1764      One final comment. You may have learnt through using
1765      <application>yacc</application>-style tools that ambiguous grammars
1766      are to be avoided, and that ambiguity is something that appears
1767      only in Natural Language processing.
1768      This is definitely not true.
1769      Many interesting grammars are ambiguous, and with GLR tools they
1770      can be used effectively.
1771      We hope you enjoy exploring this fascinating area!
1772      </para>
1773
1774    </sect1>
1775
1776    <sect1 id="sec-glr-using">
1777      <title>Basic use of a Happy-generated GLR parser</title>
1778
1779      <para>
1780      This section explains how to generate and to use a GLR parser to
1781      produce structural results.
1782      Please check the examples for further information.
1783      Discussion of semantic issues comes later; see
1784      <xref linkend="sec-glr-semantics"/>.
1785      </para>
1786
1787      <sect2 id="sec-glr-using-intro">
1788        <title>Overview</title>
1789        <para>
1790	The process of generating a GLR parser is broadly the same as
1791	for standard <application>Happy</application>. You write a grammar
1792	specification, run <application>Happy</application> on this to
1793	generate some Haskell code, then compile and link this into your
1794	program.
1795        </para>
1796        <para>
1797	An alternative to using Happy directly is to use the
1798	<ulink url="http://www.cs.chalmers.se/~markus/BNFC/">
1799	<application>BNF Converter</application></ulink> tool by
1800	Markus Forsberg, Peter Gammie, Michael Pellauer and Aarne Ranta.
1801	This tool creates an abstract syntax, grammar, pretty-printer
1802	and other useful items from a single grammar formalism, thus
1803	it saves a lot of work and improves maintainability.
1804	The current output of BNFC can be used with GLR mode now
1805	with just a few small changes, but from January 2005 we expect
1806	to have a fully-compatible version of BNFC.
1807        </para>
1808        <para>
1809	Most of the features of <application>Happy</application> still
1810	work, but note the important points below.
1811        </para>
1812	<variablelist>
1813	   <varlistentry>
1814	     <term>module header</term>
1815	     <listitem>
1816	       <para>
1817	       The GLR parser is generated in TWO files, one for data and
1818	       one for the driver. This is because the driver code needs
1819	       to be optimized, but for large parsers with lots of data,
1820	       optimizing the data tables too causes compilation to be
1821	       too slow.
1822	       </para>
1823	       <para>
1824	       Given a file <literal>Foo.y</literal>, the file
1825	       <literal>FooData.hs</literal>, containing the data
1826	       module, is generated with basic type information, the
1827	       parser tables, and the header and tail code that was
1828	       included in the parser specification.  Note that
1829	       <application>Happy</application> can automatically
1830	       generate the necessary module declaration statements,
1831	       if you do not choose to provide one in the grammar
1832	       file. But, if you do choose to provide the module
1833	       declaration statement, then the name of the module will
1834	       be parsed and used as the name of the driver
1835	       module. The parsed name will also be used to form the
1836	       name of the data module, but with the string
1837	       <literal>Data</literal> appended to it. The driver
1838	       module, which is to be found in the file
1839	       <literal>Foo.hs</literal>, will not contain any other
1840	       user-supplied text besides the module name. Do not
1841	       bother to supply any export declarations in your module
1842	       declaration statement: they will be ignored and
1843	       dropped, in favor of the standard export declaration.
1844	       </para>
1845
1846	     </listitem>
1847	   </varlistentry>
1848	   <varlistentry>
1849	     <term>export of lexer</term>
1850	     <listitem>
1851	       <para>
1852	       You can declare a lexer (and error token) with the
1853	       <literal>%lexer</literal> directive as normal, but the
1854	       generated parser does NOT call this lexer automatically.
1855	       The action of the directive is only to
1856	       <emphasis>export</emphasis> the lexer function to the top
1857	       level. This is because some applications need finer control
1858	       of the lexing process.
1859	       </para>
1860	     </listitem>
1861	   </varlistentry>
1862
1863	   <varlistentry>
1864	     <term>precedence information</term>
1865	     <listitem>
1866	       <para>
1867	       This still works, but note the reasons.
1868	       The precedence and associativity declarations are used in
1869	       <application>Happy</application>'s LR table creation to
1870	       resolve certain conflicts. It does this by retaining the
1871	       actions implied by the declarations and removing the ones
1872	       which clash with these.
1873	       The GLR parser back-end then produces code from these
1874	       filtered tables, hence the rejected actions are never
1875	       considered by the GLR parser.
1876	       </para>
1877	       <para>
1878	       Hence, declaring precedence and associativity is still
1879	       a good thing, since it avoids a certain amount of ambiguity
1880	       that the user knows how to remove.
1881	       </para>
1882	     </listitem>
1883	   </varlistentry>
1884	   <varlistentry>
1885	     <term>monad directive</term>
1886	     <listitem>
1887	       <para>
1888	       There is some support for monadic parsers.
1889	       The "tree decoding" mode
1890	       (see <xref linkend="sec-glr-semantics-tree"/>) can use the
1891	       information given in the <literal>%monad</literal>
1892	       declaration to monadify the decoding process.
1893	       This is explained in more detail in
1894	       <xref linkend="sec-glr-semantics-tree-monad"/>.
1895	       </para>
1896	       <para>
1897	       <emphasis>Note</emphasis>: the generated parsers don't include
1898	       Ashley Yakeley's monad context information yet. It is currently
1899	       just ignored.
1900	       If this is a problem, email and I'll make the changes required.
1901	       </para>
1902	     </listitem>
1903	   </varlistentry>
1904	   <varlistentry>
1905	     <term>parser name directive</term>
1906	     <listitem>
1907	       <para>
1908	       This has no effect at present. It will probably remain this
1909	       way: if you want to control names, you could use qualified
1910	       import.
1911	       </para>
1912	     </listitem>
1913	   </varlistentry>
1914	   <varlistentry>
1915	     <term>type information on non-terminals</term>
1916	     <listitem>
1917	       <para>
1918	       The generation of semantic code relies on type information
1919	       given in the grammar specification. If you don't give an
1920	       explicit signature, the type <literal>()</literal> is
1921	       assumed. If you get type clashes mentioning
1922	       <literal>()</literal> you may need to add type annotations.
1923	       Similarly, if you don't supply code for the semantic rule
1924	       portion, then the value <literal>()</literal> is used.
1925	       </para>
1926	     </listitem>
1927	   </varlistentry>
1928	   <varlistentry>
1929	     <term><literal>error</literal> symbol in grammars, and recovery
1930		</term>
1931	     <listitem>
1932	       <para>
1933	       No attempt to implement this yet. Any use of
1934	       <literal>error</literal> in grammars is thus ignored, and
1935	       parse errors will eventually mean a parse will fail.
1936	       </para>
1937	     </listitem>
1938	   </varlistentry>
1939	   <varlistentry>
1940	     <term>the token type</term>
1941	     <listitem>
1942	       <para>
1943	       The type used for tokens <emphasis>must</emphasis> be in
1944	       the <literal>Ord</literal> type class (and hence in
1945	       <literal>Eq</literal>), plus it is recommended that they
1946	       are in the <literal>Show</literal> class too.
1947	       The ordering is required for the implementation of
1948	       ambiguity packing.
1949	       It may be possible to relax this requirement, but it
1950	       is probably simpler just to require instances of the type
1951	       classes. Please tell us if this is a problem.
1952	       </para>
1953	     </listitem>
1954	   </varlistentry>
1955	</variablelist>
1956
1957      </sect2>
1958
1959      <sect2 id="sec-glr-using-main">
1960        <title>The main function</title>
1961        <para>
1962	The driver file exports a function
1963	<literal>doParse :: [[UserDefTok]] -> GLRResult</literal>.
1964	If you are using several parsers, use qualified naming to
1965	distinguish them.
1966	<literal>UserDefTok</literal> is a synonym for the type declared with
1967	the <literal>%tokentype</literal> directive.
1968        </para>
1969      </sect2>
1970
1971      <sect2 id="sec-glr-using-input">
1972        <title>The input</title>
1973        <para>
1974	The input to <literal>doParse</literal> is a list of
1975	<emphasis>list of</emphasis> token values.
1976	The outer level represents the sequence of input symbols, and
1977	the inner list represents ambiguity in the tokenisation of each
1978	input symbol.
1979	For example, the word "run" can be at least a noun or a verb,
1980	hence the inner list will contain at least two values.
1981	If your tokens are not ambiguous, you will need to convert each
1982	token to a singleton list before parsing.
1983        </para>
1984      </sect2>
1985
1986      <sect2 id="sec-glr-using-output">
1987        <title>The Parse Result</title>
1988        <para>
1989	The parse result is expressed with the following types.
1990	A successful parse yields a forest (explained below) and a single
1991	root node for the forest.
1992	A parse may fail for one of two reasons: running out of input or
1993	a (global) parse error. A global parse error means that it was
1994	not possible to continue parsing <emphasis>any</emphasis> of the
1995	live alternatives; this is different from a local error, which simply
1996	means that the current alternative dies and we try some other
1997	alternative. In both error cases, the forest at failure point is
1998	returned, since it may contain useful information.
1999	Unconsumed tokens are returned when there is a global parse error.
2000        </para>
2001<programlisting>
2002type ForestId = (Int,Int,GSymbol)
2003data GSymbol  = &lt;... automatically generated ...&gt;
2004type Forest   = FiniteMap ForestId [Branch]
2005type RootNode = ForestId
2006type Tokens   = [[(Int, GSymbol)]]
2007data Branch   = Branch {b_sem :: GSem, b_nodes :: [ForestId]}
2008data GSem     = &lt;... automatically generated ...&gt;
2009
2010data GLRResult
2011  = ParseOK     RootNode Forest    -- forest with root
2012  | ParseError  Tokens   Forest    -- partial forest with bad input
2013  | ParseEOF             Forest    -- partial forest (missing input)
2014</programlisting>
2015	<para>
2016	Conceptually, the parse forest is a directed, acyclic and-or
2017	graph. It is represented by a mapping of <literal>ForestId</literal>s
2018	to lists of possible analyses. The <literal>FiniteMap</literal>
2019	type is used to provide efficient and convenient access.
2020	The <literal>ForestId</literal> type identifies nodes in the
2021	graph, named by the range of input they span and the category of
2022	analysis they license. <literal>GSymbol</literal> is generated
2023	automatically as a union of the names of grammar rules (prefixed
2024	by <literal>G_</literal> to avoid name clashes) and of tokens and
2025	an EOF symbol. Tokens are wrapped in the constructor
2026	<literal>HappyTok :: UserDefTok -> GSymbol</literal>.
2027	</para>
2028	<para>
2029	The <literal>Branch</literal> type represents a match for some
2030	right-hand side of a production, containing semantic information
2031	(see below)
2032	and a list of sub-analyses. Each of these is a node in the graph.
2033	Note that tokens are represented as childless nodes that span
2034	one input position. Empty productions will appear as childless nodes
2035	that start and end at the same position.
2036	</para>
2037      </sect2>
2038
2039      <sect2 id="sec-glr-using-compiling">
2040        <title>Compiling the parser</title>
2041        <para>
2042	<application>Happy</application> will generate two files, and these
2043	should be compiled as normal Haskell files.
2044	If speed is an issue, then you should use the <option>-O</option>
2045	flags etc with the driver code, and if feasible, with the parser
2046	tables too.
2047        </para>
2048        <para>
2049        You can also use the <option>--ghc</option> flag to trigger certain
2050	<application>GHC</application>-specific optimizations. At present,
2051	this just causes use of unboxed types in the tables and in some key
2052	code.
2053	Using this flag causes relevant <application>GHC</application>
2054	option pragmas to be inserted into the generated code, so you shouldn't
2055	have to use any strange flags (unless you want to...).
2056        </para>
2057      </sect2>
2058    </sect1>
2059
2060    <sect1 id="sec-glr-semantics">
2061      <title>Including semantic results</title>
2062
2063      <para>
2064      This section discusses the options for including semantic information
2065      in grammars.
2066      </para>
2067
2068      <sect2 id="sec-glr-semantics-intro">
2069        <title>Forms of semantics</title>
2070        <para>
2071	Semantic information may be attached to productions in the
2072	conventional way, but when more than one analysis is possible,
2073	the use of the semantic information must change.
2074	Two schemes have been implemented, which we call
2075	<emphasis>tree decoding</emphasis>
2076	and <emphasis>label decoding</emphasis>.
2077	The former is for simple applications, where there is not much
2078	ambiguity and hence where the effective unpacking of the parse
2079	forest isn't a factor. This mode is quite similar to the
2080	standard mode in <application>Happy</application>.
2081	The latter is for serious applications, where sharing is important
2082	and where processing of the forest (eg filtering) is needed.
2083	Here, the emphasis is about providing rich labels in nodes of the
2084	the parse forest, to support such processing.
2085        </para>
2086	<para>
2087	The default mode is labelling. If you want the tree decode mode,
2088	use the <option>--decode</option> flag.
2089	</para>
2090      </sect2>
2091
2092      <sect2 id="sec-glr-semantics-tree">
2093        <title>Tree decoding</title>
2094        <para>
2095	Tree decoding corresponds to unpacking the parse forest to individual
2096	trees and collecting the list of semantic results computed from
2097	each of these. It is a mode intended for simple applications,
2098	where there is limited ambiguity.
2099	You may access semantic results from components of a reduction
2100	using the dollar variables.
2101	As a working example, the following is taken from the
2102	<literal>expr-tree</literal> grammar in the examples.
2103	Note that the type signature is required, else the types in use
2104	can't be determined by the parser generator.
2105        </para>
2106<programlisting>
2107E :: {Int} -- type signature needed
2108  : E '+' E  { $1 + $3 }
2109  | E '*' E  { $1 * $3 }
2110  | i        { $1 }
2111</programlisting>
2112	<para>
2113	This mode works by converting each of the semantic rules into
2114	functions (abstracted over the dollar variables mentioned),
2115	and labelling each <literal>Branch</literal> created from a
2116	reduction of that rule with the function value.
2117	This amounts to <emphasis>delaying</emphasis> the action of the
2118	rule, since we must wait until we know the results of all of
2119	the sub-analyses before computing any of the results. (Certain
2120	cases of packing can add new analyses at a later stage.)
2121	</para>
2122	<para>
2123	At the end of parsing, the functions are applied across relevant
2124	sub-analyses via a recursive descent. The main interface to this
2125	is via the class and entry function below. Typically,
2126	<literal>decode</literal> should be called on the root of the
2127	forest, also supplying a function which maps node names to their
2128	list of analyses (typically a partial application of lookup in
2129	the forest value).
2130	The result is a list of semantic values.
2131	Note that the context of the call to <literal>decode</literal>
2132	should (eventually) supply a concrete type to allow selection
2133	of appropriate instance. Ie, you have to indicate in some way
2134	what type the semantic result should have.
2135	<literal>Decode_Result a</literal> is a synonym generated by
2136	<application>Happy</application>: for non-monadic semantics,
2137	it is equivalent to <literal>a</literal>; when monads are
2138	in use, it becomes the declared monad type.
2139	See the full <literal>expr-eval</literal> example for more
2140	information.
2141	</para>
2142<programlisting>
2143class TreeDecode a where
2144        decode_b :: (ForestId -> [Branch]) -> Branch -> [Decode_Result a]
2145decode :: TreeDecode a => (ForestId -> [Branch]) -> ForestId -> [Decode_Result a]
2146</programlisting>
2147
2148	<para>
2149	The GLR parser generator identifies the types involved in each
2150	semantic rule, hence the types of the functions, then creates
2151	a union containing distinct types. Values of this union are
2152	stored in the branches. (The union is actually a bit more complex:
2153	it must also distinguish patterns of dollar-variable usage, eg
2154	a function <literal>\x y -> x + y </literal> could be applied to
2155	the first and second constituents, or to the first and third.)
2156	The parser generator also creates instances of the
2157	<literal>TreeDecode</literal> class, which unpacks the semantic
2158	function and applies it across the decodings of the possible
2159	combinations of children. Effectively, it does a cartesian product
2160	operation across the lists of semantic results from each of the
2161	children. Eg <literal>[1,2] "+" [3,4]</literal> produces
2162	<literal>[4,5,5,6]</literal>.
2163	Information is extracted from token values using the patterns
2164	supplied by the user when declaring tokens and their Haskell
2165	representation, so the dollar-dollar convention works also.
2166	</para>
2167	<para>
2168	The decoding process could be made more efficient by using
2169	memoisation techniques, but this hasn't been implemented since
2170	we believe the other (label) decoding mode is more useful. (If someone
2171	sends in a patch, we may include it in a future release -- but this
2172	might be tricky, eg require higher-order polymorphism?
2173	Plus, are there other ways of using this form of semantic function?)
2174	</para>
2175      </sect2>
2176
2177      <sect2 id="sec-glr-semantics-label">
2178        <title>Label decoding</title>
2179        <para>
2180	The labelling mode aims to label branches in the forest with
2181	information that supports subsequent processing, for example
2182	the filtering and prioritisation of analyses prior to extraction
2183	of favoured solutions. As above, code fragments are given in
2184	braces and can contain dollar-variables. But these variables
2185	are expanded to node names in the graph, with the intention of
2186	easing navigation.
2187	The following grammar is from the <literal>expr-tree</literal>
2188	example.
2189        </para>
2190<programlisting>
2191E :: {Tree ForestId Int}
2192  : E '+' E      { Plus  $1 $3 }
2193  | E '*' E      { Times $1 $3 }
2194  | i            { Const $1 }
2195</programlisting>
2196
2197        <para>
2198	Here, the semantic values provide more meaningful labels than
2199	the plain structural information. In particular, only the
2200	interesting parts of the branch are represented, and the
2201	programmer can clearly select or label the useful constituents
2202	if required. There is no need to remember that it is the first
2203	and third child in the branch which we need to extract, because
2204	the label only contains those values (the `noise' has been dropped).
2205	Consider also the difference between concrete and abstract syntax.
2206	The labels are oriented towards abstract syntax.
2207	Tokens are handled slightly differently here: when they appear
2208	as children in a reduction, their informational content can
2209	be extracted directly, hence the <literal>Const</literal> value
2210	above will be built with the <literal>Int</literal> value from
2211	the token, not some <literal>ForestId</literal>.
2212        </para>
2213
2214        <para>
2215	Note the useful technique of making the label types polymorphic
2216	in the position used for forest indices. This allows replacement
2217	at a later stage with more appropriate values, eg. inserting
2218	lists of actual subtrees from the final decoding.
2219        </para>
2220	<para>
2221	Use of these labels is supported by a type class
2222	<literal>LabelDecode</literal>, which unpacks values of the
2223	automatically-generated union type <literal>GSem</literal>
2224	to the original type(s). The parser generator will create
2225	appropriate instances of this class, based on the type information
2226	in the grammar file. (Note that omitting type information leads
2227	to a default of <literal>()</literal>.)
2228	Observe that use of the labels is often like traversing an abstract
2229	syntax, and the structure of the abstract syntax type usually
2230	constrains the types of constituents; so once the overall type
2231	is fixed (eg. with a type cast or signature) then there are no
2232	problems with resolution of class instances.
2233	</para>
2234
2235<programlisting>
2236class LabelDecode a where
2237        unpack :: GSem -> a
2238</programlisting>
2239
2240        <para>
2241	Internally, the semantic values are packed in a union type as
2242	before, but there is no direct abstraction step. Instead, the
2243	<literal>ForestId</literal> values (from the dollar-variables)
2244	are bound when the corresponding branch is created from the
2245	list of constituent nodes. At this stage, token information
2246	is also extracted, using the patterns supplied by the user
2247	when declaring the tokens.
2248        </para>
2249      </sect2>
2250
2251      <sect2 id="sec-glr-semantics-tree-monad">
2252        <title>Monadic tree decoding</title>
2253        <para>
2254	You can use the <literal>%monad</literal> directive in the
2255	tree-decode mode.
2256	Essentially, the decoding process now creates a list of monadic
2257	values, using the monad type declared in the directive.
2258	The default handling of the semantic functions is to apply the
2259	relevant <literal>return</literal> function to the value being
2260	returned. You can over-ride this using the <literal>{% ... }</literal>
2261	convention. The declared <literal>(>>=)</literal> function is
2262	used to assemble the computations.
2263	</para>
2264	<para>
2265	Note that no attempt is made to share the results of monadic
2266	computations from sub-trees. (You could possibly do this by
2267	supplying a memoising lookup function for the decoding process.)
2268	Hence, the usual behaviour is that decoding produces whole
2269	monadic computations, each part of which is computed afresh
2270	(in depth-first order) when the whole is computed.
2271	Hence you should take care to initialise any relevant state
2272	before computing the results from multiple solutions.
2273	</para>
2274	<para>
2275	This facility is experimental, and we welcome comments or
2276	observations on the approach taken!
2277	An example is provided (<literal>examples/glr/expr-monad</literal>).
2278	It is the standard example of arithmetic expressions, except that
2279	the <literal>IO</literal> monad is used, and a user exception is
2280	thrown when the second argument to addition is an odd number.
2281	Running this example will show a zero (from the exception handler)
2282	instead of the expected number amongst the results from the other
2283	parses.
2284	</para>
2285      </sect2>
2286    </sect1>
2287
2288    <sect1 id="sec-glr-misc">
2289      <title>Further information</title>
2290
2291      <para>
2292      Other useful information...
2293      </para>
2294
2295      <sect2 id="sec-glr-misc-examples">
2296        <title>The GLR examples</title>
2297        <para>
2298	The directory <literal>examples/glr</literal> contains several examples
2299	from the small to the large. Please consult these or use them as a
2300	base for your experiments.
2301        </para>
2302      </sect2>
2303
2304      <sect2 id="sec-glr-misc-graphs">
2305        <title>Viewing forests as graphs</title>
2306        <para>
2307	If you run the examples with <application>GHC</application>, each
2308	run will produce a file <literal>out.daVinci</literal>. This is a
2309	graph in the format expected by the <emphasis>daVinci</emphasis>
2310	graph visualization tool.
2311	(See <ulink url="http://www.informatik.uni-bremen.de/~davinci/"/>
2312	for more information. Educational use licenses are currently
2313	available without charge.)
2314        </para>
2315	<para>
2316	We highly recommend looking at graphs of parse results - it really
2317	helps to understand the results.
2318	The graphs files are created with Sven Panne's library for
2319	communicating with <emphasis>daVinci</emphasis>, supplemented
2320	with some extensions due to Callaghan. Copies of this code are
2321	included in the examples directory, for convenience.
2322	If you are trying to view large and complex graphs, contact Paul
2323	Callaghan (there are tools and techniques to make the graphs more
2324	manageable).
2325	</para>
2326      </sect2>
2327
2328      <sect2 id="sec-glr-misc-applications">
2329        <title>Some Applications of GLR parsing</title>
2330        <para>
2331	GLR parsing (and related techniques) aren't just for badly written
2332	grammars or for things like natural language (NL) where ambiguity is
2333	inescapable. There are applications where ambiguity can represent
2334	possible alternatives in pattern-matching tasks, and the flexibility
2335	of these parsing techniques and the resulting graphs support deep
2336	analyses. Below, we briefly discuss some examples, a mixture from
2337	our recent work and from the literature.
2338        </para>
2339
2340	<variablelist>
2341	   <varlistentry>
2342	     <term>Gene sequence analysis</term>
2343	     <listitem>
2344	       <para>
2345	       Combinations of structures within gene sequences can be
2346	       expressed as a grammar, for example a "start" combination
2347	       followed by a "promoter" combination then the gene proper.
2348	       A recent undergraduate project has used this GLR implementation
2349	       to detect candiate matches in data, and then to filter these
2350	       matches with a mixture of local and global information.
2351	       </para>
2352	     </listitem>
2353	   </varlistentry>
2354	   <varlistentry>
2355	     <term>Rhythmic structure in poetry</term>
2356	     <listitem>
2357	       <para>
2358	       Rhythmic patterns in (English) poetry obey certain rules,
2359	       and in more modern poetry can break rules in particular ways
2360	       to achieve certain effects. The standard rhythmic patterns
2361	       (eg. iambic pentameter) can be encoded as a grammar, and
2362	       deviations from the patterns also encoded as rules.
2363	       The neutral reading can be parsed with this grammar, to
2364	       give a forest of alternative matches. The forest can be
2365	       analysed to give a preferred reading, and to highlight
2366	       certain technical features of the poetry.
2367	       An undergraduate project in Durham has used this implementation
2368	       for this purpose, with promising results.
2369	       </para>
2370	     </listitem>
2371	   </varlistentry>
2372	   <varlistentry>
2373	     <term>Compilers -- instruction selection</term>
2374	     <listitem>
2375	       <para>
2376	       Recent work has phrased the translation problem in
2377	       compilers from intermediate representation to an
2378	       instruction set for a given processor as a matching
2379	       problem. Different constructs at the intermediate
2380	       level can map to several combinations of machine
2381	       instructions. This knowledge can be expressed as a
2382	       grammar, and instances of the problem solved by
2383	       parsing. The parse forest represents competing solutions,
2384	       and allows selection of optimum solutions according
2385	       to various measures.
2386	       </para>
2387	     </listitem>
2388	   </varlistentry>
2389	   <varlistentry>
2390	     <term>Robust parsing of ill-formed input</term>
2391	     <listitem>
2392	       <para>
2393	       The extra flexibility of GLR parsing can simplify parsing
2394	       of formal languages where a degree of `informality' is allowed.
2395	       For example, Html parsing. Modern browsers contain complex
2396	       parsers which are designed to try to extract useful information
2397	       from Html text which doesn't follow the rules precisely,
2398	       eg missing start tags or missing end tags.
2399	       Html with missing tags can be written as an ambiguous grammar,
2400	       and it should be a simple matter to extract a usable
2401	       interpretation from a forest of parses.
2402	       Notice the technique: we widen the scope of the grammar,
2403	       parse with GLR, then extract a reasonable solution.
2404	       This is arguably simpler than pushing an LR(1) or LL(1)
2405	       parser past its limits, and also more maintainable.
2406	       </para>
2407	     </listitem>
2408	   </varlistentry>
2409	   <varlistentry>
2410	     <term>Natural Language Processing</term>
2411	     <listitem>
2412	       <para>
2413	       Ambiguity is inescapable in the syntax of most human languages.
2414	       In realistic systems, parse forests are useful to encode
2415	       competing analyses in an efficient way, and they also provide
2416	       a framework for further analysis and disambiguation. Note
2417	       that ambiguity can have many forms, from simple phrase
2418	       attachment uncertainty to more subtle forms involving mixtures
2419	       of word senses. If some degree of ungrammaticality is to be
2420	       tolerated in a system, which can be done by extending the
2421	       grammar with productions incorporating common forms of
2422	       infelicity, the degree of ambiguity increases further. For
2423	       systems used on arbitrary text, such as on newspapers,
2424	       it is not uncommon that many sentences permit several
2425	       hundred or more analyses. With such grammars, parse forest
2426	       techniques are essential.
2427	       Many recent NLP systems use such techniques, including
2428	       the Durham's earlier LOLITA system - which was mostly
2429	       written in Haskell.
2430	       </para>
2431	     </listitem>
2432	   </varlistentry>
2433	</variablelist>
2434      </sect2>
2435
2436      <sect2 id="sec-glr-misc-workings">
2437        <title>Technical details</title>
2438        <para>
2439	The original implementation was developed by Ben Medlock,
2440	as his undergraduate final year project,
2441	using ideas from Peter Ljungloef's Licentiate thesis
2442	(see <ulink url="http://www.cs.chalmers.se/~peb/parsing"/>, and
2443	we recommend the thesis for its clear analysis of parsing
2444	algorithms).
2445	Ljungloef's version produces lists of parse trees, but Medlock
2446	adapted this to produce an explicit graph containing parse structure
2447	information. He also incorporated
2448	the code into <application>Happy</application>.
2449        </para>
2450
2451        <para>
2452	After Medlock's graduation, Callaghan extended the code to
2453	incorporate semantic information, and made several improvements
2454	to the original code, such as improved local packing and
2455	support for hidden left recursion. The performance of the
2456	code was significantly improved, after changes of representation
2457	(eg to a chart-style data structure)
2458	and technique. Medlock's code was also used in several student
2459	projects, including analysis of gene sequences (Fischer) and
2460	analysis of rhythmic patterns in poetry (Henderson).
2461        </para>
2462
2463        <para>
2464	The current code implements the standard GLR algorithm extended
2465	to handle hidden left recursion. Such recursion, as in the grammar
2466	below from Rekers [1992], causes the standard algorithm to loop
2467	because the empty reduction <literal>A -> </literal> is always
2468	possible and the LR parser will not change state. Alternatively,
2469	there is a problem because an unknown (at the start of parsing)
2470	number of <literal>A</literal>
2471	items are required, to match the number of <literal>i</literal>
2472	tokens in the input.
2473        </para>
2474<programlisting>
2475S -> A Q i | +
2476A ->
2477</programlisting>
2478	<para>
2479	The solution to this is not surprising. Problematic recursions
2480	are detected as zero-span reductions in a state which has a
2481	<literal>goto</literal> table entry looping to itself. A special
2482	symbol is pushed to the stack on the first such reduction,
2483	and such reductions are done at most once for any token
2484	alternative for any input position.
2485	When popping from the stack, if the last token being popped
2486	is such a special symbol, then two stack tails are returned: one
2487	corresponding to a conventional pop (which removes the
2488	symbol) and the other to a duplication of the special symbol
2489	(the stack is not changed, but a copy of the symbol is returned).
2490	This allows sufficient copies of the empty symbol to appear
2491	on some stack, hence allowing the parse to complete.
2492	</para>
2493
2494	<para>
2495	The forest is held in a chart-style data structure, and this supports
2496	local ambiguity packing (chart parsing is discussed in Ljungloef's
2497	thesis, among other places).
2498	A limited amount of packing of live stacks is also done, to avoid
2499	some repetition of work.
2500	</para>
2501
2502        <para>
2503	[Rekers 1992] Parser Generation for Interactive Environments,
2504	PhD thesis, University of Amsterdam, 1992.
2505        </para>
2506      </sect2>
2507
2508      <sect2 id="sec-glr-misc-filter">
2509        <title>The <option>--filter</option> option</title>
2510        <para>
2511	You might have noticed this GLR-related option. It is an experimental
2512	feature intended to restrict the amount of structure retained in the
2513	forest by discarding everything not required for the semantic
2514	results. It may or it may not work, and may be fixed in a future
2515	release.
2516        </para>
2517      </sect2>
2518
2519      <sect2 id="sec-glr-misc-limitations">
2520        <title>Limitations and future work</title>
2521	<para>
2522	The parser supports hidden left recursion, but makes no attempt
2523	to handle cyclic grammars that have rules which do not consume any
2524	input. If you have a grammar like this, for example with rules like
2525	<literal>S -> S</literal> or
2526	<literal>S -> A S | x; A -> empty</literal>, the implementation will
2527	loop until you run out of stack - but if it will happen, it often
2528	happens quite quickly!
2529	</para>
2530        <para>
2531	The code has been used and tested frequently over the past few years,
2532	including being used in several undergraduate projects. It should be
2533	fairly stable, but as usual, can't be guaranteed bug-free. One day
2534	I will write it in Epigram!
2535	</para>
2536        <para>
2537	If you have suggestions for improvements, or requests for features,
2538	please contact Paul
2539	Callaghan. There are some changes I am considering, and some
2540	views and/or encouragement from users will be much appreciated.
2541	Further information can be found on Callaghan's
2542	<ulink url="http://www.dur.ac.uk/p.c.callaghan/happy-glr">GLR parser
2543	page</ulink>.
2544        </para>
2545      </sect2>
2546
2547      <sect2 id="sec-glr-misc-acknowledgements">
2548        <title>Thanks and acknowledgements</title>
2549        <para>
2550	Many thanks to the people who have used and tested this software
2551	in its various forms, including Julia Fischer, James Henderson, and
2552	Aarne Ranta.
2553        </para>
2554      </sect2>
2555    </sect1>
2556  </chapter>
2557
2558<!-- Attribute Grammars ================================================= -->
2559  <chapter id="sec-AttributeGrammar">
2560    <title>Attribute Grammars</title>
2561
2562    <sect1 id="sec-introAttributeGrammars">
2563    <title>Introduction</title>
2564
2565    <para>Attribute grammars are a formalism for expressing syntax directed
2566    translation of a context-free grammar.  An introduction to attribute grammars
2567    may be found <ulink
2568    url="http://www-rocq.inria.fr/oscar/www/fnc2/manual/node32.html">here</ulink>.
2569    There is also an article in the Monad Reader about attribute grammars and a
2570    different approach to attribute grammars using Haskell
2571    <ulink url="http://www.haskell.org/haskellwiki/The_Monad.Reader/Issue4/Why_Attribute_Grammars_Matter">here</ulink>.
2572    </para>
2573
2574    <para>
2575    The main practical difficulty that has prevented attribute grammars from
2576    gaining widespread use involves evaluating the attributes.  Attribute grammars
2577    generate non-trivial data dependency graphs that are difficult to evaluate
2578    using mainstream languages and techniques.  The solutions generally involve
2579    restricting the form of the grammars or using big hammers like topological sorts.
2580    However, a language which supports lazy evaluation, such as Haskell, has no
2581    problem forming complex data dependency graphs and evaluating them.  The primary
2582    intellectual barrier to attribute grammar adoption seems to stem from the fact that
2583    most programmers have difficulty with the declarative nature of the
2584    specification.  Haskell programmers, on the other hand, have already
2585    embraced a purely functional language.  In short, the Haskell language and
2586    community seem like a perfect place to experiment with attribute grammars.
2587    </para>
2588
2589    <para>
2590    Embedding attribute grammars in Happy is easy because because Haskell supports
2591    three important features: higher order functions, labeled records, and
2592    lazy evaluation.  Attributes are encoded as fields in a labeled record. The parse
2593    result of each non-terminal in the grammar is a function which takes a record
2594    of inherited attributes and returns a record of synthesized attributes.  In each
2595    production, the attributes of various non-terminals are bound together using
2596    <literal>let</literal>.
2597    Finally, at the end of the parse, a distinguished attribute is evaluated to be
2598    the final result.  Lazy evaluation takes care of evaluating each attribute in the
2599    correct order, resulting in an attribute grammar system that is capable of evaluating
2600    a fairly large class of attribute grammars.
2601    </para>
2602
2603    <para>
2604    Attribute grammars in Happy do not use any language extensions, so the
2605    parsers are Haskell 98 (assuming you don't use the GHC specific -g option).
2606    Currently, attribute grammars cannot be generated for GLR parsers (It's not
2607    exactly clear how these features should interact...)
2608    </para>
2609
2610    </sect1>
2611
2612    <sect1 id="sec-AtrributeGrammarsInHappy">
2613    <title>Attribute Grammars in Happy</title>
2614
2615    <sect2 id="sec-declaringAttributes">
2616      <title>Declaring Attributes</title>
2617
2618      <para>
2619      The presence of one or more <literal>%attribute</literal> directives indicates
2620      that a grammar is an attribute grammar.  Attributes are calculated properties
2621      that are associated with the non-terminals in a parse tree.  Each
2622      <literal>%attribute</literal> directive generates a field in the attributes
2623      record with the given name and type.
2624      </para>
2625
2626      <para>
2627      The first <literal>%attribute</literal>
2628      directive in a grammar defines the default attribute.  The
2629      default attribute is distinguished in two ways: 1) if no attribute specifier is
2630      given on an attribute reference,
2631      the default attribute is assumed (see <xref linkend="sec-semanticRules"/>)
2632      and 2) the value for the default attribute of the starting non-terminal becomes the
2633      return value of the parse.
2634      </para>
2635
2636      <para>
2637      Optionally, one may specify a type declaration for the attribute record using
2638      the <literal>%attributetype</literal> declaration.  This allows you to define the
2639      type given to the attribute record and, more importantly, allows you to introduce
2640      type variables that can be subsequently used in <literal>%attribute</literal>
2641      declarations.  If the <literal>%attributetype</literal> directive is given without
2642      any <literal>%attribute</literal> declarations, then the <literal>%attributetype</literal>
2643      declaration has no effect.
2644      </para>
2645
2646      <para>
2647      For example, the following declarations:
2648      </para>
2649
2650<programlisting>
2651%attributetype { MyAttributes a }
2652%attribute value { a }
2653%attribute num   { Int }
2654%attribute label { String }
2655</programlisting>
2656
2657      <para>
2658      would generate this attribute record declaration in the parser:
2659      </para>
2660
2661<programlisting>
2662data MyAttributes a =
2663   HappyAttributes {
2664     value :: a,
2665     num :: Int,
2666     label :: String
2667   }
2668</programlisting>
2669
2670       <para>
2671       and <literal>value</literal> would be the default attribute.
2672       </para>
2673
2674    </sect2>
2675
2676    <sect2 id="sec-semanticRules">
2677      <title>Semantic Rules</title>
2678
2679      <para>In an ordinary Happy grammar, a production consists of a list
2680      of terminals and/or non-terminals followed by an uninterpreted
2681      code fragment enclosed in braces.  With an attribute grammar, the
2682      format is very similar, but the braces enclose a set of semantic rules
2683      rather than uninterpreted Haskell code.  Each semantic rule is either
2684      an attribute calculation or a conditional, and rules are separated by
2685      semicolons<footnote><para>Note that semantic rules must not rely on
2686      layout, because whitespace alignment is not guaranteed to be
2687      preserved</para></footnote>.
2688      </para>
2689
2690      <para>
2691      Both attribute calculations and conditionals may contain attribute references
2692      and/or terminal references.  Just like regular Happy grammars, the tokens
2693      <literal>$1</literal> through <literal>$&lt;n&gt;</literal>, where
2694      <literal>n</literal> is the number of symbols in the production, refer to
2695      subtrees of the parse.  If the referenced symbol is a terminal, then the
2696      value of the reference is just the value of the terminal, the same way as
2697      in a regular Happy grammar.  If the referenced symbol is a non-terminal,
2698      then the reference may be followed by an attribute specifier, which is
2699      a dot followed by an attribute name.  If the attribute specifier is omitted,
2700      then the default attribute is assumed (the default attribute is the first
2701      attribute appearing in an <literal>%attribute</literal> declaration).
2702      The special reference <literal>$$</literal> references the
2703      attributes of the current node in the parse tree; it behaves exactly
2704      like the numbered references.  Additionally, the reference <literal>$></literal>
2705      always references the rightmost symbol in the production.
2706      </para>
2707
2708      <para>
2709      An attribute calculation rule is of the form:
2710      </para>
2711<programlisting>
2712&lt;attribute reference&gt; = &lt;Haskell expression&gt;
2713</programlisting>
2714      <para>
2715      A rule of this form defines the value of an attribute, possibly as a function
2716      of the attributes of <literal>$$</literal> (inherited attributes), the attributes
2717      of non-terminals in the production (synthesized attributes), or the values of
2718      terminals in the production.  The value for an attribute can only
2719      be defined once for a particular production.
2720      </para>
2721
2722      <para>
2723      The following rule calculates the default attribute of the current production in
2724      terms of the first and second items of the production (a synthesized attribute):
2725      </para>
2726<programlisting>
2727$$ = $1 : $2
2728</programlisting>
2729
2730      <para>
2731      This rule calculates the length attribute of a non-terminal in terms of the
2732      length of the current non-terminal (an inherited attribute):
2733      </para>
2734<programlisting>
2735$1.length = $$.length + 1
2736</programlisting>
2737
2738      <para>
2739      Conditional rules allow the rejection of strings due to context-sensitive properties.
2740      All conditional rules have the form:
2741      </para>
2742<programlisting>
2743where &lt;Haskell expression&gt;
2744</programlisting>
2745      <para>
2746      For non-monadic parsers, all conditional expressions
2747      must be of the same (monomorphic) type.  At
2748      the end of the parse, the conditionals will be reduced using
2749      <literal>seq</literal>, which gives the grammar an opportunity to call
2750      <literal>error</literal> with an informative message.  For monadic parsers,
2751      all conditional statements must have type <literal>Monad m => m ()</literal> where
2752      <literal>m</literal> is the monad in which the parser operates.  All conditionals
2753      will be sequenced at the end of the parse, which allows the conditionals to call
2754      <literal>fail</literal> with an informative message.
2755      </para>
2756
2757      <para>
2758      The following conditional rule will cause the (non-monadic) parser to fail
2759      if the inherited length attribute is not 0.
2760      </para>
2761<programlisting>
2762where if $$.length == 0 then () else error "length not equal to 0"
2763</programlisting>
2764
2765      <para>
2766      This conditional is the monadic equivalent:
2767      </para>
2768<programlisting>
2769where unless ($$.length == 0) (fail "length not equal to 0")
2770</programlisting>
2771
2772
2773    </sect2>
2774    </sect1>
2775
2776    <sect1 id="sec-AttrGrammarLimits">
2777      <title>Limits of Happy Attribute Grammars</title>
2778
2779      <para>
2780	If you are not careful, you can write an attribute grammar which fails to
2781	terminate.  This generally happens when semantic rules
2782	are written which cause a circular dependency on the value of
2783	an attribute.  Even if the value of the attribute is well-defined (that is,
2784	if a fixpoint calculation over attribute values will eventually converge to
2785	a unique solution), this attribute grammar system will not evaluate such
2786	grammars.
2787      </para>
2788      <para>
2789	One practical way to overcome this limitation is to ensure that each attribute
2790	is always used in either a top-down (inherited) fashion or in a bottom-up
2791	(synthesized) fashion.  If the calculations are sufficiently lazy, one can
2792	"tie the knot" by synthesizing a value in one attribute, and then assigning
2793	that value to another, inherited attribute at some point in the parse tree.
2794	This technique can be useful for common tasks like building symbol tables for
2795	a syntactic scope and making that table available to sub-nodes of the parse.
2796      </para>
2797    </sect1>
2798
2799
2800    <sect1 id="sec-AttributeGrammarExample">
2801      <title>Example Attribute Grammars</title>
2802      <para>
2803      The following two toy attribute grammars may prove instructive.  The first is
2804      an attribute grammar for the classic context-sensitive grammar
2805      { a^n b^n c^n | n >= 0 }.  It demonstrates the use of conditionals,
2806      inherited and synthesized attributes.
2807      </para>
2808
2809<programlisting>
2810{
2811module ABCParser (parse) where
2812}
2813
2814%tokentype { Char }
2815
2816%token a { 'a' }
2817%token b { 'b' }
2818%token c { 'c' }
2819%token newline { '\n' }
2820
2821%attributetype { Attrs a }
2822%attribute value { a }
2823%attribute len   { Int }
2824
2825%name parse abcstring
2826
2827%%
2828
2829abcstring
2830   : alist blist clist newline
2831        { $$ = $1 ++ $2 ++ $3
2832        ; $2.len = $1.len
2833        ; $3.len = $1.len
2834        }
2835
2836alist
2837   : a alist
2838        { $$ = $1 : $2
2839        ; $$.len = $2.len + 1
2840        }
2841   |    { $$ = []; $$.len = 0 }
2842
2843blist
2844   : b blist
2845        { $$ = $1 : $2
2846        ; $2.len = $$.len - 1
2847        }
2848   |    { $$ = []
2849        ; where failUnless ($$.len == 0) "blist wrong length"
2850        }
2851
2852clist
2853   : c clist
2854        { $$ = $1 : $2
2855        ; $2.len = $$.len - 1
2856        }
2857   |    { $$ = []
2858        ; where failUnless ($$.len == 0) "clist wrong length"
2859        }
2860
2861{
2862happyError = error "parse error"
2863failUnless b msg = if b then () else error msg
2864}
2865</programlisting>
2866
2867<para>
2868This grammar parses binary numbers and
2869calculates their value.  It demonstrates
2870the use of inherited and synthesized attributes.
2871</para>
2872
2873
2874<programlisting>
2875{
2876module BitsParser (parse) where
2877}
2878
2879%tokentype { Char }
2880
2881%token minus { '-' }
2882%token plus  { '+' }
2883%token one   { '1' }
2884%token zero  { '0' }
2885%token newline { '\n' }
2886
2887%attributetype { Attrs }
2888%attribute value { Integer }
2889%attribute pos   { Int }
2890
2891%name parse start
2892
2893%%
2894
2895start
2896   : num newline { $$ = $1 }
2897
2898num
2899   : bits        { $$ = $1       ; $1.pos = 0 }
2900   | plus bits   { $$ = $2       ; $2.pos = 0 }
2901   | minus bits  { $$ = negate $2; $2.pos = 0 }
2902
2903bits
2904   : bit         { $$ = $1
2905                 ; $1.pos = $$.pos
2906                 }
2907
2908   | bits bit    { $$ = $1 + $2
2909                 ; $1.pos = $$.pos + 1
2910                 ; $2.pos = $$.pos
2911                 }
2912
2913bit
2914   : zero        { $$ = 0 }
2915   | one         { $$ = 2^($$.pos) }
2916
2917{
2918happyError = error "parse error"
2919}
2920</programlisting>
2921
2922
2923    </sect1>
2924
2925  </chapter>
2926
2927<!-- Invoking ============================================================ -->
2928
2929  <chapter id="sec-invoking">
2930    <title>Invoking <application>Happy</application></title>
2931
2932    <para>An invocation of <application>Happy</application> has the following syntax:</para>
2933
2934<screen>$ happy [ <emphasis>options</emphasis> ] <emphasis>filename</emphasis> [ <emphasis>options</emphasis> ]</screen>
2935
2936    <para>All the command line options are optional (!) and may occur
2937    either before or after the input file name. Options that take
2938    arguments may be given multiple times, and the last occurrence
2939    will be the value used.</para>
2940
2941    <para>There are two types of grammar files,
2942    <filename>file.y</filename> and <filename>file.ly</filename>, with
2943    the latter observing the reverse comment (or literate) convention
2944    (i.e. each code line must begin with the character
2945    <literal>&gt;</literal>, lines which don't begin with
2946    <literal>&gt;</literal> are treated as comments).  The examples
2947    distributed with <application>Happy</application> are all of the
2948    .ly form.</para>
2949    <indexterm>
2950      <primary>literate grammar files</primary>
2951    </indexterm>
2952
2953    <para>The flags accepted by <application>Happy</application> are as follows:</para>
2954
2955    <variablelist>
2956
2957      <varlistentry>
2958	<term><option>-o</option> <replaceable>file</replaceable></term>
2959	<term><option>--outfile</option>=<replaceable>file</replaceable></term>
2960	<listitem>
2961	  <para>Specifies the destination of the generated parser module.
2962	  If omitted, the parser will be placed in
2963          <replaceable>file</replaceable><literal>.hs</literal>,
2964	  where <replaceable>file</replaceable> is the name of the input
2965          file with any extension removed.</para>
2966	</listitem>
2967      </varlistentry>
2968
2969      <varlistentry>
2970	<term><option>-i</option><optional><replaceable>file</replaceable></optional></term>
2971	<term><option>--info</option><optional>=<replaceable>file</replaceable></optional></term>
2972	<listitem>
2973	  <indexterm>
2974	    <primary>info file</primary>
2975	  </indexterm>
2976	  <para> Directs <application>Happy</application> to produce an info file
2977          containing detailed information about the grammar, parser
2978          states, parser actions, and conflicts.  Info files are vital
2979          during the debugging of grammars.  The filename argument is
2980          optional (note that there's no space between
2981          <literal>-i</literal> and the filename in the short
2982          version), and if omitted the info file will be written to
2983          <replaceable>file</replaceable><literal>.info</literal> (where
2984          <replaceable>file</replaceable> is the input file name with any
2985          extension removed).</para>
2986	</listitem>
2987      </varlistentry>
2988
2989      <varlistentry>
2990	<term><option>-p</option><optional><replaceable>file</replaceable></optional></term>
2991	<term><option>--pretty</option><optional>=<replaceable>file</replaceable></optional></term>
2992	<listitem>
2993	  <indexterm>
2994	    <primary>pretty print</primary>
2995	  </indexterm>
2996	  <para> Directs <application>Happy</application> to produce a file
2997          containing a pretty-printed form of the grammar, containing only
2998          the productions, withouth any semantic actions or type signatures.
2999          If no file name is provided, then the file name will be computed
3000          by replacing the extension of the input file with
3001          <literal>.grammar</literal>.
3002    </para>
3003	</listitem>
3004      </varlistentry>
3005
3006
3007
3008      <varlistentry>
3009	<term><option>-t</option> <replaceable>dir</replaceable></term>
3010	<term><option>--template</option>=<replaceable>dir</replaceable></term>
3011	<listitem>
3012	  <indexterm>
3013	    <primary>template files</primary>
3014	  </indexterm>
3015	  <para>Instructs <application>Happy</application> to use this directory
3016          when looking for template files: these files contain the
3017          static code that <application>Happy</application> includes in every
3018          generated parser.  You shouldn't need to use this option if
3019          <application>Happy</application> is properly configured for your
3020          computer.</para>
3021	</listitem>
3022      </varlistentry>
3023
3024      <varlistentry>
3025	<term><option>-m</option> <replaceable>name</replaceable></term>
3026	<term><option>--magic-name</option>=<replaceable>name</replaceable></term>
3027	<listitem>
3028	  <para> <application>Happy</application> prefixes all the symbols it uses internally
3029          with either <literal>happy</literal> or <literal>Happy</literal>.  To use a
3030          different string, for example if the use of <literal>happy</literal>
3031          is conflicting with one of your own functions, specify the
3032          prefix using the <option>-m</option> option.</para>
3033	</listitem>
3034      </varlistentry>
3035
3036      <varlistentry>
3037	<term><option>-s</option></term>
3038	<term><option>--strict</option></term>
3039	<listitem>
3040	  <para>NOTE: the <option>--strict</option> option is
3041	  experimental and may cause unpredictable results.</para>
3042
3043	  <para>This option causes the right hand side of each
3044	  production (the semantic value) to be evaluated eagerly at
3045	  the moment the production is reduced.  If the lazy behaviour
3046	  is not required, then using this option will improve
3047	  performance and may reduce space leaks.  Note that the
3048	  parser as a whole is never lazy - the whole input will
3049	  always be consumed before any input is produced, regardless
3050	  of the setting of the <option>--strict</option> flag.</para>
3051	</listitem>
3052      </varlistentry>
3053
3054      <varlistentry>
3055	<term><option>-g</option></term>
3056	<term><option>--ghc</option></term>
3057	<listitem>
3058	  <indexterm>
3059	    <primary>GHC</primary>
3060	  </indexterm>
3061	  <indexterm>
3062	    <primary>back-ends</primary>
3063	    <secondary>GHC</secondary>
3064	  </indexterm>
3065	  <para>Instructs <application>Happy</application> to generate a parser
3066	  that uses GHC-specific extensions to obtain faster code.</para>
3067	</listitem>
3068      </varlistentry>
3069
3070      <varlistentry>
3071	<term><option>-c</option></term>
3072	<term><option>--coerce</option></term>
3073	<listitem>
3074	  <indexterm>
3075	    <primary>coerce</primary>
3076	  </indexterm>
3077	  <indexterm>
3078	    <primary>back-ends</primary>
3079	    <secondary>coerce</secondary>
3080	  </indexterm>
3081	  <para> Use GHC's <literal>unsafeCoerce#</literal> extension to
3082          generate smaller faster parsers.  Type-safety isn't
3083          compromised.</para>
3084
3085	  <para>This option may only be used in conjunction with
3086          <option>-g</option>.</para>
3087	</listitem>
3088      </varlistentry>
3089
3090      <varlistentry>
3091	<term><option>-a</option></term>
3092	<term><option>--arrays</option></term>
3093	<listitem>
3094	  <indexterm>
3095	    <primary>arrays</primary>
3096	  </indexterm>
3097	  <indexterm>
3098	    <primary>back-ends</primary>
3099	    <secondary>arrays</secondary>
3100	  </indexterm>
3101	  <para> Instructs <application>Happy</application> to generate a parser
3102          using an array-based shift reduce parser.  When used in
3103          conjunction with <option>-g</option>, the arrays will be
3104          encoded as strings, resulting in faster parsers.  Without
3105          <option>-g</option>, standard Haskell arrays will be
3106          used.</para>
3107	</listitem>
3108      </varlistentry>
3109
3110      <varlistentry>
3111	<term><option>-d</option></term>
3112	<term><option>--debug</option></term>
3113	<listitem>
3114	  <indexterm>
3115	    <primary>debug</primary>
3116	  </indexterm>
3117	  <indexterm>
3118	    <primary>back-ends</primary>
3119	    <secondary>debug</secondary>
3120	  </indexterm>
3121	  <para>Generate a parser that will print debugging
3122	  information to <literal>stderr</literal> at run-time,
3123	  including all the shifts, reductions, state transitions and
3124	  token inputs performed by the parser.</para>
3125
3126	  <para>This option can only be used in conjunction with
3127	  <option>-a</option>.</para>
3128	</listitem>
3129      </varlistentry>
3130
3131      <varlistentry>
3132        <term><option>-l</option></term>
3133        <term><option>--glr</option></term>
3134        <listitem>
3135          <indexterm>
3136            <primary>glr</primary>
3137          </indexterm>
3138          <indexterm>
3139            <primary>back-ends</primary>
3140            <secondary>glr</secondary>
3141          </indexterm>
3142          <para>Generate a GLR parser for ambiguous grammars.</para>
3143        </listitem>
3144      </varlistentry>
3145
3146      <varlistentry>
3147        <term><option>-k</option></term>
3148        <term><option>--decode</option></term>
3149        <listitem>
3150          <indexterm>
3151            <primary>decode</primary>
3152          </indexterm>
3153          <para>Generate simple decoding code for GLR result.</para>
3154        </listitem>
3155      </varlistentry>
3156
3157      <varlistentry>
3158        <term><option>-f</option></term>
3159        <term><option>--filter</option></term>
3160        <listitem>
3161          <indexterm>
3162            <primary>filter</primary>
3163          </indexterm>
3164          <para>Filter the GLR parse forest with respect to semantic usage.</para>
3165        </listitem>
3166      </varlistentry>
3167
3168      <varlistentry>
3169	<term><option>-?</option></term>
3170	<term><option>--help</option></term>
3171	<listitem>
3172	  <para>Print usage information on standard output then exit
3173	  successfully.</para>
3174	</listitem>
3175      </varlistentry>
3176
3177      <varlistentry>
3178	<term><option>-V</option></term>
3179	<term><option>--version</option></term>
3180	<listitem>
3181	  <para>Print version information on standard output then exit
3182	  successfully. Note that for legacy reasons <option>-v</option>
3183	  is supported, too, but the use of it is deprecated.
3184	  <option>-v</option> will be used for verbose mode when it is
3185	  actually implemented.</para>
3186	</listitem>
3187      </varlistentry>
3188
3189    </variablelist>
3190
3191  </chapter>
3192
3193  <chapter id="sec-grammar-files">
3194    <title>Syntax of Grammar Files</title>
3195
3196    <para>The input to <application>Happy</application> is a text file containing
3197    the grammar of the language you want to parse, together with some
3198    annotations that help the parser generator make a legal Haskell
3199    module that can be included in your program.  This section gives
3200    the exact syntax of grammar files. </para>
3201
3202    <para>The overall format of the grammar file is given below:</para>
3203
3204<programlisting>
3205&lt;optional module header&gt;
3206&lt;directives&gt;
3207%%
3208&lt;grammar&gt;
3209&lt;optional module trailer&gt;
3210</programlisting>
3211
3212    <indexterm>
3213      <primary>module</primary>
3214      <secondary>header</secondary>
3215    </indexterm>
3216    <indexterm>
3217      <primary>module</primary>
3218      <secondary>trailer</secondary>
3219    </indexterm>
3220    <para>If the name of the grammar file ends in <literal>.ly</literal>, then
3221    it is assumed to be a literate script.  All lines except those
3222    beginning with a <literal>&gt;</literal> will be ignored, and the
3223    <literal>&gt;</literal> will be stripped from the beginning of all the code
3224    lines.  There must be a blank line between each code section
3225    (lines beginning with <literal>&gt;</literal>) and comment section.
3226    Grammars not using the literate notation must be in a file with
3227    the <literal>.y</literal> suffix.</para>
3228
3229    <sect1 id="sec-lexical-rules">
3230      <title>Lexical Rules</title>
3231
3232<para>Identifiers in <application>Happy</application> grammar files must take the following form (using
3233the BNF syntax from the Haskell Report):</para>
3234
3235<programlisting>
3236id      ::= alpha { idchar }
3237          | ' { any{^'} | \' } '
3238          | " { any{^"} | \" } "
3239
3240alpha   ::= A | B | ... | Z
3241          | a | b | ... | z
3242
3243idchar  ::= alpha
3244          | 0 | 1 | ... | 9
3245          | _
3246</programlisting>
3247
3248    </sect1>
3249
3250    <sect1 id="sec-module-header">
3251      <title>Module Header</title>
3252
3253      <indexterm>
3254	<primary>module</primary>
3255	<secondary>header</secondary>
3256      </indexterm>
3257      <para>This section is optional, but if included takes the
3258      following form:</para>
3259
3260<programlisting>
3261{
3262&lt;Haskell module header&gt;
3263}
3264</programlisting>
3265
3266      <para>The Haskell module header contains the module name,
3267      exports, and imports.  No other code is allowed in the
3268      header&mdash;this is because <application>Happy</application> may need to include
3269      its own <literal>import</literal> statements directly after the user
3270      defined header.</para>
3271
3272    </sect1>
3273
3274    <sect1 id="sec-directives">
3275      <title>Directives</title>
3276
3277      <para>This section contains a number of lines of the form:</para>
3278
3279<programlisting>
3280%&lt;directive name&gt; &lt;argument&gt; ...
3281</programlisting>
3282
3283      <para>The statements here are all annotations to help
3284      <application>Happy</application> generate the Haskell code for the grammar.
3285      Some of them are optional, and some of them are required.</para>
3286
3287      <sect2 id="sec-token-type">
3288	<title>Token Type</title>
3289
3290<programlisting>
3291%tokentype   { &lt;valid Haskell type&gt; }
3292</programlisting>
3293
3294	<indexterm>
3295	  <primary><literal>%tokentype</literal></primary>
3296	</indexterm>
3297	<para>(mandatory) The <literal>%tokentype</literal> directive gives the
3298        type of the tokens passed from the lexical analyser to the
3299        parser (in order that <application>Happy</application> can supply types for
3300        functions and data in the generated parser).</para>
3301
3302      </sect2>
3303
3304      <sect2 id="sec-tokens">
3305	<title>Tokens</title>
3306
3307<programlisting>
3308%token &lt;name&gt; { &lt;Haskell pattern&gt; }
3309       &lt;name&gt; { &lt;Haskell pattern&gt; }
3310       ...
3311</programlisting>
3312
3313	<indexterm>
3314	  <primary><literal>%token</literal></primary>
3315	</indexterm>
3316	<para>(mandatory) The <literal>%token</literal> directive is used to
3317        tell <application>Happy</application> about all the terminal symbols used
3318        in the grammar.  Each terminal has a name, by which it is
3319        referred to in the grammar itself, and a Haskell
3320        representation enclosed in braces.  Each of the patterns must
3321        be of the same type, given by the <literal>%tokentype</literal>
3322        directive.</para>
3323
3324	<para>The name of each terminal follows the lexical rules for
3325        <application>Happy</application> identifiers given above.  There are no
3326        lexical differences between terminals and non-terminals in the
3327        grammar, so it is recommended that you stick to a convention;
3328        for example using upper case letters for terminals and lower
3329        case for non-terminals, or vice-versa.</para>
3330
3331	<para><application>Happy</application> will give you a warning if you try
3332        to use the same identifier both as a non-terminal and a
3333        terminal, or introduce an identifier which is declared as
3334        neither.</para>
3335
3336	<para>To save writing lots of projection functions that map
3337        tokens to their components, you can include
3338        <literal>&dollar;&dollar;</literal> in your Haskell pattern. For
3339        example:</para>
3340	<indexterm>
3341	  <primary><literal>&dollar;&dollar;</literal></primary>
3342	</indexterm>
3343
3344<programlisting>
3345%token INT { TokenInt $$ }
3346       ...
3347</programlisting>
3348
3349<para>This makes the semantic value of <literal>INT</literal> refer to the first argument
3350of <literal>TokenInt</literal> rather than the whole token, eliminating the need for
3351any projection function.</para>
3352
3353      </sect2>
3354
3355      <sect2 id="sec-parser-name">
3356	<title>Parser Name</title>
3357
3358<programlisting>
3359%name &lt;Haskell identifier&gt; [ &lt;non-terminal&gt; ]
3360...
3361</programlisting>
3362	<indexterm>
3363	  <primary><literal>%name</literal></primary>
3364	</indexterm>
3365
3366	<para>(optional) The <literal>%name</literal> directive is followed by
3367        a valid Haskell identifier, and gives the name of the
3368        top-level parsing function in the generated parser.  This is
3369        the only function that needs to be exported from a parser
3370        module.</para>
3371
3372	<para>If the <literal>%name</literal> directive is omitted, it
3373        defaults to <literal>happyParse</literal>.</para>
3374	<indexterm>
3375	  <primary><function>happyParse</function></primary>
3376	</indexterm>
3377
3378	<para>The <literal>%name</literal> directive takes an optional
3379	second parameter which specifies the top-level non-terminal
3380	which is to be parsed.  If this parameter is omitted, it
3381	defaults to the first non-terminal defined in the
3382	grammar.</para>
3383
3384	<para>Multiple <literal>%name</literal> directives may be
3385	given, specifying multiple parser entry points for this
3386	grammar (see <xref linkend="sec-multiple-parsers"/>).  When
3387	multiple <literal>%name</literal> directives are given, they
3388	must all specify explicit non-terminals.</para>
3389      </sect2>
3390
3391      <sect2 id="sec-partial-parsers">
3392	<title>Partial Parsers</title>
3393
3394<programlisting>
3395%partial &lt;Haskell identifier&gt; [ &lt;non-terminal&gt; ]
3396...
3397</programlisting>
3398	<indexterm>
3399	  <primary><literal>%partial</literal></primary>
3400	</indexterm>
3401
3402	<para>The <literal>%partial</literal> directive can be used instead of
3403	  <literal>%name</literal>.  It indicates that the generated parser
3404	  should be able to parse an initial portion of the input.  In
3405	  contrast, a parser specified with <literal>%name</literal> will only
3406	  parse the entire input.</para>
3407
3408	<para>A parser specified with <literal>%partial</literal> will stop
3409	  parsing and return a result as soon as there exists a complete parse,
3410	  and no more of the input can be parsed.  It does this by accepting
3411	  the parse if it is followed by the <literal>error</literal> token,
3412	  rather than insisting that the parse is followed by the
3413	  end of the token stream (or the <literal>eof</literal> token in the
3414	  case of a <literal>%lexer</literal> parser).</para>
3415      </sect2>
3416
3417      <sect2 id="sec-monad-decl">
3418	<title>Monad Directive</title>
3419
3420<programlisting>
3421%monad { &lt;type&gt; } { &lt;then&gt; } { &lt;return&gt; }
3422</programlisting>
3423	<indexterm>
3424	  <primary><literal>%monad</literal></primary>
3425	</indexterm>
3426
3427	<para>(optional) The <literal>%monad</literal> directive takes three
3428        arguments: the type constructor of the monad, the
3429        <literal>then</literal> (or <literal>bind</literal>) operation, and the
3430        <literal>return</literal> (or <literal>unit</literal>) operation.  The type
3431        constructor can be any type with kind <literal>* -&gt; *</literal>.</para>
3432
3433	<para>Monad declarations are described in more detail in <xref
3434        linkend="sec-monads"/>.</para>
3435
3436      </sect2>
3437
3438      <sect2 id="sec-lexer-decl">
3439	<title>Lexical Analyser</title>
3440
3441<programlisting>
3442%lexer { &lt;lexer&gt; } { &lt;eof&gt; }
3443</programlisting>
3444	<indexterm>
3445	  <primary><literal>%lexer</literal></primary>
3446	</indexterm>
3447
3448	<para>(optional) The <literal>%lexer</literal> directive takes two
3449        arguments: <literal>&lt;lexer&gt;</literal> is the name of the lexical
3450        analyser function, and <literal>&lt;eof&gt;</literal> is a token that
3451        is to be treated as the end of file.</para>
3452
3453	<para>Lexer declarations are described in more detail in <xref
3454        linkend="sec-lexers"/>.</para>
3455
3456      </sect2>
3457
3458      <sect2 id="sec-prec-decls">
3459	<title>Precedence declarations</title>
3460
3461<programlisting>
3462%left     &lt;name&gt; ...
3463%right    &lt;name&gt; ...
3464%nonassoc &lt;name&gt; ...
3465</programlisting>
3466	<indexterm>
3467	  <primary><literal>%left</literal> directive</primary>
3468	</indexterm>
3469	<indexterm>
3470	  <primary><literal>%right</literal> directive</primary>
3471	</indexterm>
3472	<indexterm>
3473	  <primary><literal>%nonassoc</literal> directive</primary>
3474	</indexterm>
3475
3476	<para>These declarations are used to specify the precedences
3477	and associativity of tokens.  The precedence assigned by a
3478	<literal>%left</literal>, <literal>%right</literal> or
3479	<literal>%nonassoc</literal> declaration is defined to be
3480	higher than the precedence assigned by all declarations
3481	earlier in the file, and lower than the precedence assigned by
3482	all declarations later in the file.</para>
3483
3484	<para>The associativity of a token relative to tokens in the
3485	same <literal>%left</literal>, <literal>%right</literal>, or
3486	<literal>%nonassoc</literal> declaration is to the left, to
3487	the right, or non-associative respectively.</para>
3488
3489	<para>Precedence declarations are described in more detail in
3490	<xref linkend="sec-Precedences"/>.</para>
3491      </sect2>
3492
3493      <sect2 id="sec-expect">
3494      	<title>Expect declarations</title>
3495<programlisting>
3496%expect &lt;number&gt;
3497</programlisting>
3498	<indexterm>
3499	  <primary><literal>%expect</literal> directive</primary>
3500	</indexterm>
3501
3502	<para>(optional) More often than not the grammar you write
3503	will have conflicts. These conflicts generate warnings. But
3504	when you have checked the warnings and made sure that Happy
3505	handles them correctly these warnings are just annoying. The
3506	<literal>%expect</literal> directive gives a way of avoiding
3507	them. Declaring <literal>%expect
3508	<replaceable>n</replaceable></literal> is a way of telling
3509	Happy &ldquo;There are exactly <replaceable>n</replaceable>
3510	shift/reduce conflicts and zero reduce/reduce conflicts in
3511	this grammar. I promise I have checked them and they are
3512	resolved correctly&rdquo;.  When processing the grammar, Happy
3513	will check the actual number of conflicts against the
3514	<literal>%expect</literal> declaration if any, and if there is
3515	a discrepancy then an error will be reported.</para>
3516
3517	<para>Happy's <literal>%expect</literal> directive works
3518	exactly like that of yacc.</para>
3519      </sect2>
3520
3521      <sect2 id="sec-error-directive">
3522	<title>Error declaration</title>
3523
3524<programlisting>
3525%error { &lt;identifier&gt; }
3526</programlisting>
3527	<indexterm>
3528	  <primary><literal>%error</literal></primary>
3529	</indexterm>
3530
3531	<para>Specifies the function to be called in the event of a
3532	parse error.  The type of <literal>&lt;identifier&gt;</literal> varies
3533	depending on the presence of <literal>%lexer</literal> (see
3534	<xref linkend="sec-monad-summary" />) and <literal>%errorhandlertype</literal>
3535	(see the following).</para>
3536      </sect2>
3537
3538      <sect2 id="sec-errorhandlertype-directive">
3539	<title>Additional error information</title>
3540
3541<programlisting>
3542%errorhandlertype (explist | default)
3543</programlisting>
3544
3545	<indexterm>
3546	  <primary><literal>%errorhandlertype</literal></primary>
3547	</indexterm>
3548
3549	<para>(optional) The expected type of the user-supplied error handling can be
3550	applied with additional information. By default, no information is added, for
3551	compatibility with previous versions. However, if <literal>explist</literal>
3552	is provided with this directive, then the first application will be of
3553	type <literal>[String]</literal>, providing a description of possible tokens
3554	that would not have failed the parser in place of the token that has caused
3555	the error.
3556	</para>
3557      </sect2>
3558
3559      <sect2 id="sec-attributes">
3560	<title>Attribute Type Declaration</title>
3561<programlisting>
3562%attributetype { &lt;valid Haskell type declaration&gt; }
3563</programlisting>
3564        <indexterm>
3565	  <primary><literal>%attributetype</literal> directive</primary>
3566	</indexterm>
3567
3568	<para>(optional) This directive allows you to declare the type of the
3569	attributes record when defining an attribute grammar.  If this declaration
3570	is not given, Happy will choose a default.  This declaration may only
3571	appear once in a grammar.
3572	</para>
3573	<para>
3574	  Attribute grammars are explained in <xref linkend="sec-AttributeGrammar"/>.
3575	</para>
3576      </sect2>
3577
3578      <sect2 id="sec-attribute">
3579	<title>Attribute declaration</title>
3580<programlisting>
3581%attribute &lt;Haskell identifier&gt; { &lt;valid Haskell type&gt; }
3582</programlisting>
3583        <indexterm>
3584         <primary><literal>%attribute</literal> directive</primary>
3585       </indexterm>
3586
3587       <para>The presence of one or more of these directives declares that the
3588       grammar is an attribute grammar.  The first attribute listed becomes the
3589       default attribute.  Each <literal>%attribute</literal> directive generates a
3590       field in the attributes record with the given label and type.  If there
3591       is an <literal>%attributetype</literal> declaration in the grammar which
3592       introduces type variables, then the type of an attribute may mention any
3593       such type variables.
3594       </para>
3595
3596       <para>
3597       Attribute grammars are explained in <xref linkend="sec-AttributeGrammar"/>.
3598       </para>
3599      </sect2>
3600
3601    </sect1>
3602
3603    <sect1 id="sec-grammar">
3604      <title>Grammar</title>
3605
3606      <para>The grammar section comes after the directives, separated
3607      from them by a double-percent (<literal>%%</literal>) symbol.
3608      This section contains a number of
3609      <emphasis>productions</emphasis>, each of which defines a single
3610      non-terminal.  Each production has the following syntax:</para>
3611      <indexterm>
3612	<primary><literal>%%</literal></primary>
3613      </indexterm>
3614
3615<programlisting>
3616&lt;non-terminal&gt; [ :: { &lt;type&gt; } ]
3617        :  &lt;id&gt; ... {[%] &lt;expression&gt; }
3618      [ |  &lt;id&gt; ... {[%] &lt;expression&gt; }
3619        ... ]
3620</programlisting>
3621
3622      <para>The first line gives the non-terminal to be defined by the
3623      production and optionally its type (type signatures for
3624      productions are discussed in <xref
3625      linkend="sec-type-signatures"/>).</para>
3626
3627      <para>Each production has at least one, and possibly many
3628      right-hand sides.  Each right-hand side consists of zero or more
3629      symbols (terminals or non-terminals) and a Haskell expression
3630      enclosed in braces.</para>
3631
3632      <para>The expression represents the semantic value of the
3633      non-terminal, and may refer to the semantic values of the
3634    symbols in the right-hand side using the meta-variables
3635      <literal>&dollar;1 ... &dollar;n</literal>.  It is an error to
3636      refer to <literal>&dollar;i</literal> when <literal>i</literal>
3637      is larger than the number of symbols on the right hand side of
3638      the current rule. The symbol <literal>&dollar;</literal> may be
3639      inserted literally in the Haskell expression using the sequence
3640      <literal>\&dollar;</literal> (this isn't necessary inside a
3641      string or character literal).</para>
3642
3643      <para>Additionally, the sequence <literal>&dollar;&gt;</literal>
3644      can be used to represent the value of the rightmost symbol.</para>
3645
3646      <para>A semantic value of the form <literal>{% ... }</literal> is a
3647      <emphasis>monadic action</emphasis>, and is only valid when the grammar
3648      file contains a <literal>%monad</literal> directive (<xref
3649      linkend="sec-monad-decl"/>).  Monadic actions are discussed in
3650      <xref linkend="sec-monads"/>.</para>
3651      <indexterm>
3652	<primary>monadic</primary>
3653	<secondary>action</secondary>
3654      </indexterm>
3655
3656      <para>Remember that all the expressions for a production must
3657      have the same type.</para>
3658
3659      <sect2 id="sec-param-prods">
3660        <title>Parameterized Productions</title>
3661        <para>
3662        Starting from version 1.17.1, <application>Happy</application> supports
3663        <emphasis>parameterized productions</emphasis> which provide a
3664        convenient notation for capturing recurring patterns in context free
3665        grammars. This gives the benefits of something similar to parsing
3666        combinators in the context of <application>Happy</application>
3667        grammars.
3668        </para>
3669        <para>This functionality is best illustrated with an example:
3670<programlisting>
3671opt(p)          : p                   { Just $1 }
3672                |                     { Nothing }
3673
3674rev_list1(p)    : p                   { [$1] }
3675                | rev_list1(p) p      { $2 : $1 }
3676</programlisting>
3677        The first production, <literal>opt</literal>, is used for optional
3678        components of a grammar.  It is just like <literal>p?</literal> in
3679        regular expressions or EBNF. The second production,
3680        <literal>rev_list1</literal>, is for parsing a list of 1 or more
3681        occurrences of <literal>p</literal>.  Parameterized productions are
3682        just like ordinary productions, except that they have parameter in
3683        parenthesis after the production name. Multiple parameters should
3684        be separated by commas:
3685<programlisting>
3686fst(p,q)        : p q                 { $1 }
3687snd(p,q)        : p q                 { $2 }
3688both(p,q)       : p q                 { ($1,$2) }
3689</programlisting>
3690        </para>
3691
3692        <para>To use a parameterized production, we have to pass values for the
3693        parameters, as if we are calling a function.  The parameters can be
3694        either terminals, non-terminals, or other instantiations of
3695        parameterized productions.  Here are some examples:
3696<programlisting>
3697list1(p)        : rev_list1(p)        { reverse $1 }
3698list(p)         : list1(p)            { $1 }
3699                |                     { [] }
3700</programlisting>
3701        The first production uses <literal>rev_list</literal> to define
3702        a production that behaves like <literal>p+</literal>, returning
3703        a list of elements in the same order as they occurred in the input.
3704        The second one, <literal>list</literal> is like <literal>p*</literal>.
3705        </para>
3706
3707        <para>Parameterized productions are implemented as a preprocessing
3708        pass in Happy:  each instantiation of a production turns into a
3709        separate non-terminal, but are careful to avoid generating the
3710        same rule multiple times, as this would lead to an ambiguous grammar.
3711        Consider, for example, the following parameterized rule:
3712<programlisting>
3713sep1(p,q)       : p list(snd(q,p))    { $1 : $2 }
3714</programlisting>
3715        The rules that would be generated for <literal>sep1(EXPR,SEP)</literal>
3716<programlisting>
3717sep1(EXPR,SEP)
3718  : EXPR list(snd(SEP,EXPR))                { $1 : $2 }
3719
3720list(snd(SEP,EXPR))
3721  : list1(snd(SEP,EXPR))                    { $1 }
3722  |                                         { [] }
3723
3724list1(snd(SEP,EXPR))
3725  : rev_list1(snd(SEP,EXPR))                { reverse $1 }
3726
3727rev_list1(snd(SEP,EXPR))
3728  : snd(SEP,EXPR))                          { [$1] }
3729  | rev_list1(snd(SEP,EXPR)) snd(SEP,EXPR)  { $2 : $1 }
3730
3731snd(SEP,EXPR)
3732  : SEP EXPR                                { $2 }
3733</programlisting>
3734        Note that this is just a normal grammar, with slightly strange names
3735        for the non-terminals.
3736        </para>
3737
3738        <para>A drawback of the current implementation is that it does not
3739        support type signatures for the parameterized productions, that
3740        depend on the types of the parameters.  We plan to implement that
3741        in the future---the current workaround is to omit the type signatures
3742        for such rules.
3743        </para>
3744      </sect2>
3745
3746      </sect1>
3747
3748    <sect1 id="sec-module-trailer">
3749      <title>Module Trailer</title>
3750      <indexterm>
3751	<primary>module</primary>
3752	<secondary>trailer</secondary>
3753      </indexterm>
3754
3755      <para>The module trailer is optional, comes right at the end of
3756      the grammar file, and takes the same form as the module
3757      header:</para>
3758
3759<programlisting>
3760{
3761&lt;Haskell code&gt;
3762}
3763</programlisting>
3764
3765      <para>This section is used for placing auxiliary definitions
3766      that need to be in the same module as the parser.  In small
3767      parsers, it often contains a hand-written lexical analyser too.
3768      There is no restriction on what can be placed in the module
3769      trailer, and any code in there is copied verbatim into the
3770      generated parser file.</para>
3771
3772      </sect1>
3773    </chapter>
3774
3775  <chapter id="sec-info-files">
3776    <title>Info Files</title>
3777    <indexterm>
3778      <primary>info files</primary>
3779    </indexterm>
3780
3781    <para>
3782      Happy info files, generated using the <literal>-i</literal> flag,
3783      are your most important tool for debugging errors in your grammar.
3784      Although they can be quite verbose, the general concept behind
3785      them is quite simple.
3786    </para>
3787
3788    <para>
3789      An info file contains the following information:
3790    </para>
3791
3792    <orderedlist>
3793      <listitem>
3794        <para>A summary of all shift/reduce and reduce/reduce
3795          conflicts in the grammar.</para>
3796      </listitem>
3797      <listitem>
3798        <para>Under section <literal>Grammar</literal>, a summary of all the rules in the grammar.  These rules correspond directly to your input file, absent the actual Haskell code that is to be run for each rules.  A rule is written in the form <literal>&lt;non-terminal&gt; -> &lt;id&gt; ...</literal></para>
3799      </listitem>
3800      <listitem>
3801        <para>Under section <literal>Terminals</literal>, a summary of all the terminal tokens you may run against, as well as a the Haskell pattern which matches against them.  This corresponds directly to the contents of your <literal>%token</literal> directive (<xref linkend="sec-tokens"/>).</para>
3802      </listitem>
3803      <listitem>
3804        <para>Under section <literal>Non-terminals</literal>, a summary of which rules apply to which productions.  This is generally redundant with the <literal>Grammar</literal> section.</para>
3805      </listitem>
3806      <listitem>
3807        <para>The primary section <literal>States</literal>, which describes the state-machine Happy built for your grammar, and all of the transitions for each state.</para>
3808      </listitem>
3809      <listitem>
3810        <para>Finally, some statistics <literal>Grammar Totals</literal> at the end of the file.</para>
3811      </listitem>
3812    </orderedlist>
3813    <para>In general, you will be most interested in the <literal>States</literal> section, as it will give you information, in particular, about any conflicts your grammar may have.</para>
3814
3815    <sect1 id="sec-info-files-states">
3816      <title>States</title>
3817      <para>Although Happy does its best to insulate you from the
3818        vagaries of parser generation, it's important to know a little
3819        about how shift-reduce parsers work in order to be able to
3820        interpret the entries in the <literal>States</literal>
3821        section.</para>
3822
3823      <para>In general, a shift-reduce parser operates by maintaining
3824        parse stack, which tokens and productions are shifted onto or
3825        reduced off of.  The parser maintains a state machine, which
3826        accepts a token, performs some shift or reduce, and transitions
3827        to a new state for the next token.  Importantly, these states
3828        represent <emphasis>multiple</emphasis> possible productions,
3829        because in general the parser does not know what the actual
3830        production for the tokens it's parsing is going to be.
3831        There's no direct correspondence between the state-machine
3832        and the input grammar; this is something you have to
3833        reverse engineer.</para>
3834
3835      <para>With this knowledge in mind, we can look at two example states
3836        from the example grammar from <xref linkend="sec-using" />:
3837      </para>
3838
3839<programlisting>
3840State 5
3841
3842        Exp1 -> Term .                                      (rule 5)
3843        Term -> Term . '*' Factor                           (rule 6)
3844        Term -> Term . '/' Factor                           (rule 7)
3845
3846        in             reduce using rule 5
3847        '+'            reduce using rule 5
3848        '-'            reduce using rule 5
3849        '*'            shift, and enter state 11
3850        '/'            shift, and enter state 12
3851        ')'            reduce using rule 5
3852        %eof           reduce using rule 5
3853
3854State 9
3855
3856        Factor -> '(' . Exp ')'                             (rule 11)
3857
3858        let            shift, and enter state 2
3859        int            shift, and enter state 7
3860        var            shift, and enter state 8
3861        '('            shift, and enter state 9
3862
3863        Exp            goto state 10
3864        Exp1           goto state 4
3865        Term           goto state 5
3866        Factor         goto state 6
3867</programlisting>
3868
3869      <para>For each state, the first set of lines describes the
3870        <emphasis>rules</emphasis> which correspond to this state.  A
3871        period <literal>.</literal> is inserted in the production to
3872        indicate where, if this is indeed the correct production, we
3873        would have parsed up to. In state 5, there are multiple rules,
3874        so we don't know if we are parsing an <literal>Exp1</literal>, a
3875        multiplication or a division (however, we do know there is a
3876        <literal>Term</literal> on the parse stack); in state 9, there
3877        is only one rule, so we know we are definitely parsing a
3878        <literal>Factor</literal>.</para>
3879
3880      <para>The next set of lines specifies the action and state
3881        transition that should occur given a token.  For example, if in
3882        state 5 we process the <literal>'*'</literal> token, this token
3883        is shifted onto the parse stack and we transition to the state
3884        corresponding to the rule <literal>Term -> Term '*' .
3885          Factor</literal> (matching the token disambiguated which state
3886        we are in.)</para>
3887
3888      <para>Finally, for states which shift on non-terminals,
3889        there will be a last set of lines saying what should be done
3890        after the non-terminal has been fully parsed; this information
3891        is effectively the stack for the parser.  When a reduce occurs,
3892        these goto entries are used to determine what the next
3893      state should be.</para>
3894
3895      <!-- Probably could improve this section by walking through
3896      parsing -->
3897
3898    </sect1>
3899
3900    <sect1 id="sec-info-files-conflicts">
3901      <title>Interpreting conflicts</title>
3902
3903      <para>When you have a conflict, you will see an entry like this
3904      in your info file:</para>
3905
3906<programlisting>
3907State 432
3908
3909        atype -> SIMPLEQUOTE '[' . comma_types0 ']'         (rule 318)
3910        sysdcon -> '[' . ']'                                (rule 613)
3911
3912        '_'            shift, and enter state 60
3913        'as'           shift, and enter state 16
3914
3915...
3916
3917        ']'            shift, and enter state 381
3918                        (reduce using rule 328)
3919
3920...
3921</programlisting>
3922
3923      <para>On large, complex grammars, determining what the conflict is
3924        can be a bit of an art, since the state with the conflict may
3925        not have enough information to determine why a conflict is
3926        occurring).</para>
3927
3928        <para>In some cases, the rules associated with the state with
3929          the conflict will immediately give you enough guidance to
3930          determine what the ambiguous syntax is.
3931          For example, in the miniature shift/reduce conflict
3932          described in <xref linkend="sec-conflict-tips" />,
3933          the conflict looks like this:</para>
3934
3935<programlisting>
3936State 13
3937
3938        exp -> exp . '+' exp0                               (rule 1)
3939        exp0 -> if exp then exp else exp .                  (rule 3)
3940
3941        then           reduce using rule 3
3942        else           reduce using rule 3
3943        '+'            shift, and enter state 7
3944                        (reduce using rule 3)
3945
3946        %eof           reduce using rule 3
3947</programlisting>
3948
3949<para>Here, rule 3 makes it easy to imagine that we had been parsing a
3950  statement like <literal>if 1 then 2 else 3 + 4</literal>; the conflict
3951  arises from whether or not we should shift (thus parsing as
3952  <literal>if 1 then 2 else (3 + 4)</literal>) or reduce (thus parsing
3953  as <literal>(if 1 then 2 else 3) + 4</literal>).</para>
3954
3955<para>Sometimes, there's not as much helpful context in the error message;
3956take this abridged example from GHC's parser:</para>
3957
3958<programlisting>
3959State 49
3960
3961        type -> btype .                                     (rule 281)
3962        type -> btype . '->' ctype                          (rule 284)
3963
3964        '->'           shift, and enter state 472
3965                        (reduce using rule 281)
3966</programlisting>
3967
3968<para>A pair of rules like this doesn't always result in a shift/reduce
3969  conflict: to reduce with rule 281 implies that, in some context when
3970  parsing the non-terminal <literal>type</literal>, it is possible for
3971  an <literal>'->'</literal> to occur immediately afterwards (indeed
3972  these source rules are factored such that there is no rule of the form
3973  <literal>... -> type '->' ...</literal>).</para>
3974
3975<para>The best way this author knows how to sleuth this out is to
3976  look for instances of the token and check if any of the preceeding
3977  non-terminals could terminate in a type:</para>
3978
3979<programlisting>
3980        texp -> exp '->' texp                              (500)
3981        exp -> infixexp '::' sigtype                       (414)
3982        sigtype -> ctype                                   (260)
3983        ctype -> type                                      (274)
3984</programlisting>
3985
3986<para>As it turns out, this shift/reduce conflict results from
3987  ambiguity for <emphasis>view patterns</emphasis>, as in
3988  the code sample <literal>case v of { x :: T -&gt; T ... }</literal>.</para>
3989
3990    </sect1>
3991
3992  </chapter>
3993
3994  <chapter id="sec-tips">
3995    <title>Tips</title>
3996
3997    <para>This section contains a lot of accumulated lore about using
3998    <application>Happy</application>.</para>
3999
4000    <sect1 id="sec-performance-tips">
4001      <title>Performance Tips</title>
4002
4003      <para>How to make your parser go faster:</para>
4004
4005      <itemizedlist>
4006
4007	<listitem>
4008	  <para> If you are using GHC
4009          <indexterm>
4010	    <primary>GHC</primary>
4011	  </indexterm>
4012	  , generate parsers using the
4013          <literal>-a -g -c</literal> options, and compile them using GHC with
4014          the <literal>-fglasgow-exts</literal> option.  This is worth a
4015          <emphasis>lot</emphasis>, in terms of compile-time,
4016          execution speed and binary size.<footnote><para>omitting the
4017          <literal>-a</literal> may generate slightly faster parsers,
4018          but they will be much bigger.</para></footnote></para>
4019	</listitem>
4020
4021	<listitem>
4022	  <para> The lexical analyser is usually the most performance
4023          critical part of a parser, so it's worth spending some time
4024          optimising this.  Profiling tools are essential here.  In
4025          really dire circumstances, resort to some of the hacks that
4026          are used in the Glasgow Haskell Compiler's interface-file
4027          lexer.</para>
4028	</listitem>
4029
4030	<listitem>
4031	  <para> Simplify the grammar as much as possible, as this
4032          reduces the number of states and reduction rules that need
4033          to be applied.</para>
4034	</listitem>
4035
4036	<listitem>
4037	  <para> Use left recursion rather than right recursion
4038          <indexterm>
4039	    <primary>recursion, left vs. right</primary>
4040	  </indexterm>
4041          wherever possible.  While not strictly a performance issue,
4042          this affects the size of the parser stack, which is kept on
4043          the heap and thus needs to be garbage collected.</para>
4044	</listitem>
4045
4046      </itemizedlist>
4047
4048
4049    </sect1>
4050
4051    <sect1 id="sec-compilation-time">
4052      <title>Compilation-Time Tips</title>
4053
4054      <para>We have found that compiling parsers generated by
4055      <application>Happy</application> can take a large amount of time/memory, so
4056      here's some tips on making things more sensible:</para>
4057
4058      <itemizedlist>
4059
4060	<listitem>
4061	  <para> Include as little code as possible in the module
4062          trailer.  This code is included verbatim in the generated
4063          parser, so if any of it can go in a separate module, do
4064          so.</para>
4065	</listitem>
4066
4067	<listitem>
4068          <para> Give type signatures
4069	  <indexterm>
4070	    <primary>type</primary>
4071	    <secondary>signatures in grammar</secondary>
4072	  </indexterm>
4073	  for everything (see <xref
4074          linkend="sec-type-signatures"/>.  This is reported to improve
4075          things by about 50%.  If there is a type signature for every
4076          single non-terminal in the grammar, then <application>Happy</application>
4077          automatically generates type signatures for most functions
4078          in the parser.</para>
4079	</listitem>
4080
4081	<listitem>
4082	  <para> Simplify the grammar as much as possible (applies to
4083          everything, this one).</para>
4084	</listitem>
4085
4086	<listitem>
4087	  <para> Use a recent version of GHC.  Versions from 4.04
4088	  onwards have lower memory requirements for compiling
4089	  <application>Happy</application>-generated parsers.</para>
4090	</listitem>
4091
4092	<listitem>
4093	  <para> Using <application>Happy</application>'s <literal>-g -a -c</literal>
4094	  options when generating parsers to be compiled with GHC will
4095	  help considerably.</para>
4096	</listitem>
4097
4098      </itemizedlist>
4099
4100    </sect1>
4101
4102    <sect1 id="sec-finding-errors">
4103      <title>Finding Type Errors</title>
4104
4105      <indexterm>
4106	<primary>type</primary>
4107	<secondary>errors, finding</secondary>
4108      </indexterm>
4109
4110      <para>Finding type errors in grammar files is inherently
4111      difficult because the code for reductions is moved around before
4112      being placed in the parser.  We currently have no way of passing
4113      the original filename and line numbers to the Haskell compiler,
4114      so there is no alternative but to look at the parser and match
4115      the code to the grammar file.  An info file (generated by the
4116      <literal>-i</literal> option) can be helpful here.</para>
4117
4118      <indexterm>
4119	<primary>type</primary>
4120	<secondary>signatures in grammar</secondary>
4121      </indexterm>
4122
4123      <para>Type signature sometimes help by pinning down the
4124      particular error to the place where the mistake is made, not
4125      half way down the file.  For each production in the grammar,
4126      there's a bit of code in the generated file that looks like
4127      this:</para>
4128
4129<programlisting>
4130HappyAbsSyn&lt;n&gt; ( E )
4131</programlisting>
4132      <indexterm>
4133	<primary><literal>HappyAbsSyn</literal></primary>
4134      </indexterm>
4135
4136      <para>where <literal>E</literal> is the Haskell expression from the
4137      grammar file (with <literal>&dollar;n</literal> replaced by
4138      <literal>happy_var_n</literal>).  If there is a type signature for this
4139      production, then <application>Happy</application> will have taken it into
4140      account when declaring the HappyAbsSyn datatype, and errors in
4141      <literal>E</literal> will be caught right here.  Of course, the error may
4142      be really caused by incorrect use of one of the
4143      <literal>happy_var_n</literal> variables.</para>
4144
4145      <para>(this section will contain more info as we gain experience
4146      with creating grammar files.  Please send us any helpful tips
4147      you find.)</para>
4148
4149    </sect1>
4150
4151    <sect1 id="sec-conflict-tips">
4152      <title>Conflict Tips</title>
4153      <indexterm>
4154	<primary>conflicts</primary>
4155      </indexterm>
4156
4157      <para>Conflicts arise from ambiguities in the grammar.  That is,
4158      some input sequences may possess more than one parse.
4159      Shift/reduce conflicts are benign in the sense that they are
4160      easily resolved (<application>Happy</application> automatically selects the
4161      shift action, as this is usually the intended one).
4162      Reduce/reduce conflicts are more serious.  A reduce/reduce
4163      conflict implies that a certain sequence of tokens on the input
4164      can represent more than one non-terminal, and the parser is
4165      uncertain as to which reduction rule to use.  It will select the
4166      reduction rule uppermost in the grammar file, so if you really
4167      must have a reduce/reduce conflict you can select which rule
4168      will be used by putting it first in your grammar file.</para>
4169
4170      <para>It is usually possible to remove conflicts from the
4171      grammar, but sometimes this is at the expense of clarity and
4172      simplicity.  Here is a cut-down example from the grammar of
4173      Haskell (1.2):</para>
4174
4175<programlisting>
4176exp     : exp op exp0
4177        | exp0
4178
4179exp0    : if exp then exp else exp
4180        ...
4181        | atom
4182
4183atom    : var
4184        | integer
4185        | '(' exp ')'
4186        ...
4187</programlisting>
4188
4189      <para>This grammar has a shift/reduce conflict, due to the
4190      following ambiguity.  In an input such as</para>
4191
4192<programlisting>
4193if 1 then 2 else 3 + 4
4194</programlisting>
4195
4196      <para>the grammar doesn't specify whether the parse should be</para>
4197
4198<programlisting>
4199if 1 then 2 else (3 + 4)
4200</programlisting>
4201
4202      <para>or</para>
4203
4204<programlisting>
4205(if 1 then 2 else 3) + 4
4206</programlisting>
4207
4208      <para>and the ambiguity shows up as a shift/reduce conflict on
4209      reading the 'op' symbol.  In this case, the first parse is the
4210      intended one (the 'longest parse' rule), which corresponds to
4211      the shift action.  Removing this conflict relies on noticing
4212      that the expression on the left-hand side of an infix operator
4213      can't be an <literal>exp0</literal> (the grammar previously said
4214      otherwise, but since the conflict was resolved as shift, this
4215      parse was not allowed).  We can reformulate the
4216      <literal>exp</literal> rule as:</para>
4217
4218<programlisting>
4219exp     : atom op exp
4220        | exp0
4221</programlisting>
4222
4223      <para>and this removes the conflict, but at the expense of some
4224      stack space while parsing (we turned a left-recursion into a
4225      right-recursion).  There are alternatives using left-recursion,
4226      but they all involve adding extra states to the parser, so most
4227      programmers will prefer to keep the conflict in favour of a
4228      clearer and more efficient parser.</para>
4229
4230      <sect2 id="sec-lalr">
4231	<title>LALR(1) parsers</title>
4232
4233	<para>There are three basic ways to build a shift-reduce
4234        parser.  Full LR(1) (the `L' is the direction in which the
4235        input is scanned, the `R' is the way in which the parse is
4236        built, and the `1' is the number of tokens of lookahead)
4237        generates a parser with many states, and is therefore large
4238        and slow.  SLR(1) (simple LR(1)) is a cut-down version of
4239        LR(1) which generates parsers with roughly one-tenth as many
4240        states, but lacks the power to parse many grammars (it finds
4241        conflicts in grammars which have none under LR(1)). </para>
4242
4243	<para>LALR(1) (look-ahead LR(1)), the method used by
4244        <application>Happy</application> and
4245        <application>yacc</application>, is a tradeoff between the two.
4246        An LALR(1) parser has the same number of states as an SLR(1)
4247        parser, but it uses a more complex method to calculate the
4248        lookahead tokens that are valid at each point, and resolves
4249        many of the conflicts that SLR(1) finds.  However, there may
4250        still be conflicts in an LALR(1) parser that wouldn't be there
4251        with full LR(1).</para>
4252
4253      </sect2>
4254    </sect1>
4255
4256    <sect1 id="sec-happy-ghci">
4257      <title>Using Happy with <application>GHCi</application></title>
4258      <indexterm><primary><application>GHCi</application></primary>
4259      </indexterm>
4260
4261      <para><application>GHCi</application>'s compilation manager
4262      doesn't understand Happy grammars, but with some creative use of
4263      macros and makefiles we can give the impression that
4264      <application>GHCi</application> is invoking Happy
4265      automatically:</para>
4266
4267      <itemizedlist>
4268	<listitem>
4269	  <para>Create a simple makefile, called
4270	  <filename>Makefile_happysrcs</filename>:</para>
4271
4272<programlisting>HAPPY = happy
4273HAPPY_OPTS =
4274
4275all: MyParser.hs
4276
4277%.hs: %.y
4278	$(HAPPY) $(HAPPY_OPTS) $&lt; -o $@</programlisting>
4279	</listitem>
4280
4281	<listitem>
4282	  <para>Create a macro in GHCi to replace the
4283          <literal>:reload</literal> command, like so (type this all
4284          on one line):</para>
4285
4286<screen>:def myreload (\_ -> System.system "make -f Makefile_happysrcs"
4287   >>= \rr -> case rr of { System.ExitSuccess -> return ":reload" ;
4288                           _ -> return "" })</screen>
4289	</listitem>
4290
4291	<listitem>
4292	  <para>Use <literal>:myreload</literal>
4293	  (<literal>:my</literal> will do) instead of
4294	  <literal>:reload</literal> (<literal>:r</literal>).</para>
4295	</listitem>
4296      </itemizedlist>
4297    </sect1>
4298
4299    <sect1 id="sec-monad-alex">
4300      <title>Basic monadic Happy use with Alex</title>
4301      <indexterm>
4302        <primary><application>Alex</application></primary>
4303        <secondary>monad</secondary>
4304      </indexterm>
4305
4306      <para>
4307        <application>Alex</application> lexers are often used by
4308        <application>Happy</application> parsers, for example in
4309        GHC. While many of these applications are quite sophisticated,
4310        it is still quite useful to combine the basic
4311        <application>Happy</application> <literal>%monad</literal>
4312        directive with the <application>Alex</application>
4313        <literal>monad</literal> wrapper. By using monads for both,
4314        the resulting parser and lexer can handle errors far more
4315        gracefully than by throwing an exception.
4316      </para>
4317
4318      <para>
4319        The most straightforward way to use a monadic
4320        <application>Alex</application> lexer is to simply use the
4321        <literal>Alex</literal> monad as the
4322        <application>Happy</application> monad:
4323      </para>
4324
4325      <example><title>Lexer.x</title>
4326<programlisting>{
4327module Lexer where
4328}
4329
4330%wrapper "monad"
4331
4332tokens :-
4333  ...
4334
4335{
4336data Token = ... | EOF
4337  deriving (Eq, Show)
4338
4339alexEOF = return EOF
4340}</programlisting></example>
4341      <example><title>Parser.y</title>
4342<programlisting>{
4343module Parser where
4344
4345import Lexer
4346}
4347
4348%name pFoo
4349%tokentype { Token }
4350%error { parseError }
4351%monad { Alex } { >>= } { return }
4352%lexer { lexer } { EOF }
4353
4354%token
4355  ...
4356
4357%%
4358  ...
4359
4360parseError :: Token -> Alex a
4361parseError _ = do
4362  ((AlexPn _ line column), _, _, _) &lt;- alexGetInput
4363  alexError ("parse error at line " ++ (show line) ++ ", column " ++ (show column))
4364
4365lexer :: (Token -> Alex a) -> Alex a
4366lexer = (alexMonadScan >>=)
4367}</programlisting></example>
4368
4369      <para>
4370        We can then run the finished parser in the
4371        <literal>Alex</literal> monad using
4372        <literal>runAlex</literal>, which returns an
4373        <literal>Either</literal> value rather than throwing an
4374        exception in case of a parse or lexical error:
4375      </para>
4376
4377<programlisting>
4378import qualified Lexer as Lexer
4379import qualified Parser as Parser
4380
4381parseFoo :: String -> Either String Foo
4382parseFoo s = Lexer.runAlex s Parser.pFoo
4383</programlisting>
4384
4385    </sect1>
4386  </chapter>
4387  <index/>
4388</book>
4389