1% Test cases for the parser generator. This all runs in symbolic mode...
2
3
4%
5% This is where (for now) I will put documentation of the syntax I
6% will use when creating a grammer. There is a main function called
7% lalr_create_parser and that is passed a list that describes
8% a grammar. It is in the form of a sequence of productions, and the first
9% one given is taken to be the top-level target.
10%
11% Each production is in the form
12%     (LHS   ((rhs1.1 rhs1.2 ...) a1.1 a1.2 ...)
13%            ((rhs2.1 rhs2.1 ...) a2.1 a2.2 ...)
14%            ...)
15% which in regular publication style for grammars might be interpreted
16% as meaning
17%      LHS ::= rhs1.1 rhs1.2 ... { a1.1 a1.2 ... }
18%          |   rhs2.1 rhs2.2 ... { a2.1 a2.2 ... }
19%          ...
20%          ;
21%
22% Each LHS is treated as a non-terminal symbol and is specified as a simple
23% name. Note that by default the Reduce parser will be folding characters
24% within names to lower case and so it will be best to choose names for
25% non-terminals that are unambiguous even when case-folded, but I would like
26% to establish a convention that in source code they are written in capitals.
27%
28% The RHS items may be either non-terminals (identified because they are
29% present in the left hand side of some production) or terminals. Terminal
30% symbols can be specified in two different ways.
31%
32% Semantic actions must be given in Lisp, and the variables !$1, !$2, ...
33% refer to the components from the matched right hand side. The final
34% item in the semantic action is used as the result (and hence gets built
35% up into the parse tree). If an explicit action is not given the
36% behaviour is as if "(list !$1 !$2 ... !$n)" had been given, where the
37% number of items is the number of terms in the pattern -- with a special
38% case if there is only one item in which case the implicit action is
39% just "!$1".
40%
41% The default lexer has built-in recipies that decode certain sequences
42% of characters and return the special markers for !:symbol, !:number,
43% !:string, !:list for commonly used cases. In these cases the variable
44% yylval gets left set to associated data, so for instance in the case of
45% !:symbol it gets set to the particular symbol concerned.
46% The token type :list is used for Lisp or rlisp-like notation where the
47% input contains
48%     'expression
49% or  `expression
50% so for instance the input `(a b c) leads to the lexer returning !:list and
51% yylvel being set to (backquote (a b c)). This treatment is specialised for
52% handling rlisp-like syntax.
53%
54% Other terminals are indicated by writing a string. That may either
55% consist of characters that would otherwise form a symbol (ie a letter
56% followed by letters, digits and underscores) or a sequence of
57% non-alphanumeric characters. In the latter case if a sequence of three or
58% more punctuation marks make up a terminal then all the shorter prefixes
59% of it will also be grouped to form single entities. So if "<-->" is a
60% terminal then '<', '<-' and '<--' will each by parsed as single tokens, and
61% any of them that are not used as terminals will be classified as !:symbol.
62%
63% A non-default lexer needs to be hand-written and can be installed
64% for a grammar by using "set!-lexer". See examples.
65%
66% As well as terminals and non-terminals (which are written as symbols or
67% strings) it is possible to write one of
68%     (OPT s1 s2 ...)           0 or 1 instances of the sequence s1, ...
69%     (STAR s1 s2 ...)          0, 1, 2, ... instances
70%     (PLUS s1 s2 ...)          1, 2, 3, ... instances
71%     (LIST sep s1 s2 ...)      like (STAR s1 s2 ...) but with the single
72%                               item sep between each instance.
73%     (LISTPLUS sep s1 ...)     like (PLUS s2 ...) but with sep interleaved.
74%     (OR s1 s2 ...)            one or other of the tokens shown.
75%
76% When the lexer processes input it will return a numeric code that identifies
77% the type of the item seen, so in a production one might write
78%     (!:symbol ":=" EXPRESSION)
79% and as it recognises the first two tokens the lexer will return a numeric
80% code for !:symbol (and set yylval to the actual symbol as seen) and then
81% a numeric code that it allocates for ":=". In the latter case it will
82% also set yylval to the symbol !:!= in case that is useful.
83%
84% Precedence can be set using lalr_precedence. See examples lower down in this
85% file.
86
87% Some of the limitations are:
88% (0) Testing has been pretty minimal so far, with the examples in this
89%     file the main ones used. Well testing is ongoing with a grammar
90%     for SML, and so things are improving on that front.
91% (1) The lexer is hand-written and hence somewhat inflexible. It does
92%     provide a set of option that can be used to tune it for (broad)
93%     compatibility with several commonly-used languages. Probably some
94%     people would appreciate something like "lex" or "flex" to make
95%     lexer generationm automatic?
96% (2) The issue of when input should end is not properly though through,
97%     and so it is easy to end up with a syntax error reported when the
98%     parser reaches what you had wanted to be the end of your input.
99% (3) Diagnostics if you give a malformed grammar are weak.
100% (4) The input format is perhaps crude, and in particular semantic actions
101%     have to be given as raw Lisp code.
102% (5) Diagnostics when you present malformed input to the parser are
103%     not that good either...
104% (6) The internals of the code use association lists and I *believe* that
105%     calls to assoc, rassoc and related functions represent bottlenecks
106%     such that processing could be made somewhat faster.
107%
108% I have listed the above in the order that I think I need to work on them.
109
110symbolic;
111load_package lalr;
112
113% Before testing parser generation I will demonstrate the lexer..
114% If I was jumpy about the exact behaviour of the lexer I could go
115%               on tracelex;
116% to get some more tracing.
117
118lex_cleanup();
119
120lex_keywords '("begin" "<=>" "<==");
121
122% The output from this is expected to be
123
124%  Result: (2 symbol)
125%  Result: (4 200)
126%  Result: (4 3.542)
127%  Result: (3 "a string")
128%  Result: (5 (quote (quoted lisp)))
129%  Result: (5 (backquote (backquoted (!, comma) (!,!@ comma_at))))
130%  Result: (2 !+)
131%  Result: (7 !<!=!>)
132%  Result: (2 !-)
133%  Result: (2 !=)
134%  Result: (2 !>)
135%  Result: (9 !<)
136%  Result: (8 !<!=)
137%  Result: (5 begin)
138%  Result: (2 !;)
139%  Result: (2 !;)
140%
141%  nil
142
143% The row of "; ; ;" at the end provides some protection so that
144% if faults in the lexer were to cause it to read more or less than it ought
145% to then what is left over is reasonably likely to remain as valid rlisp
146% syntax and so the rest of this test file will be able to continue happily.
147
148
149begin
150  scalar tt;
151  off echo;
152  lex_init();
153  for i := 1:16 do <<
154    tt := yylex();
155    if not zerop posn() then terpri();
156    princ "Result: ";
157    print list(tt, yylval) >>;
158  on echo
159end;
160
161symbol
162200
1633.542
164"a string"
165'(quoted lisp)
166`(backquoted ,comma ,@comma_at)
167+
168<=>
169-
170=>
171<
172<=
173begin
174; ; ;
175
176
177% I will now illustrate the various lexing styles that are available:
178
179symbolic procedure demonstrate_lexer style;
180  begin
181    scalar tt, r;
182    lex_init();
183    lex_keywords '(">=");
184% This sets one of the "lexer styles".
185    lexer_style!* := style;
186    while << tt := yylex(); yylval neq '!; >> do r := (tt . yylval) . r;
187    for each x in reverse r do <<
188% CSL and PSL do not put "!" escapes in exactly the same places when
189% printing symbols with an underscore in the name - and thay also
190% disagree about where to break lines to respect linelength. The
191% function portable_print exists to provide consistent output in places
192% like here.
193      if not zerop posn() then terpri();
194      portable_print x >>
195  end;
196
197% I will read exactly the same sequence of characters using various lexer
198% styles so that token syntax, string treatment and comments show up.
199
200demonstrate_lexer lexer_style_rlisp;
201% Rlisp
202# script comment test $
203// C line
204/* C block /* no_nesting */ outside */
205(* SML (* nesting *) inside *)
206'x'
207"strings \" twice \" escapes & embedded "" quotes"
208quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b
209>=
210>>
2110x10000
212#if nil
213conditional
214#endif
215;;;
216
217demonstrate_lexer lexer_style_C;
218% Rlisp
219# script comment test $
220// C line
221/* C block /* no_nesting */ outside */
222(* SML (* nesting *) inside *)
223'x'
224"strings \" twice \" escapes & embedded "" quotes"
225quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b
226>=
227>>
2280x10000
229#if nil
230conditional
231#endif
232;;;
233
234demonstrate_lexer lexer_style_SML;
235% Rlisp
236# script comment test $
237// C line
238/* C block /* no_nesting */ outside */
239(* SML (* nesting *) inside *)
240'x'
241"strings \" twice \" escapes & embedded "" quotes"
242quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b
243>=
244>>
2450x10000
246#if nil
247conditional
248#endif
249;;;
250
251demonstrate_lexer lexer_style_script;
252% Rlisp
253# script comment test $
254// C line
255/* C block /* no_nesting */ outside */
256(* SML (* nesting *) inside *)
257'x'
258"strings \" twice \" escapes & embedded "" quotes"
259quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b
260>=
261>>
2620x10000
263#if nil
264conditional
265#endif
266;;;
267
268
269lexer_style!* := lexer_style_rlisp;
270
271on lalr_verbose;
272
273% Here I set up a sample grammar
274%    S' -> S
275%    S  -> C C        { }
276%    C  -> "c" C      { }
277%        | "d"        { }
278% This is example 4.42 from Aho, Sethi and Ullman's Red Dragon book.
279% It is example 4.54 in the more recent Purple book.
280
281% Note that this grammar will introduce "c" and "d" as keywords rather than
282% being general symbols. When I construct a subsequent grammar that will
283% undo that setting. I will omit semantic actions here so that the default
284% action of building a form of tree is used.
285
286% There seems to be an issue here in that if I use a name for a non-terminal
287% that is the same as one used for a terminal thinsg get confused. So I had
288% originally hoped and expected to write just 'c' for the name of the
289% non-terminal here are use '"c"' to denote the terminal. However when I
290% try that I find that internally the names and up clashing to rather bad
291% effect. I will look into this later since it is merely a matter of surface
292% notation!  Also in the input to semantic actions the values passed when a
293% terminal is matched may at present be the internal numeric code allocated
294% to that terminal, not a "sensible" value. Agian this is a small issue.
295
296grammar := '(
297  (s  ((cc cc)  )   % One production for S, no explicit semantic here
298  )
299  (cc (("c" cc) (list 'c !$2))   % First production for C
300      (("d")    'd           ))  % Second production for C
301% I put in a redundant (unused) clause here to illustrate that it is
302% detected and reported.
303  (redundant (()))
304  )$
305
306g := lalr_create_parser(nil, grammar)$
307
308symbolic procedure pparse g;
309  begin
310    scalar r;
311    r := yyparse g;
312    terpri();
313    princ "= ";
314    portable_print r
315  end;
316
317pparse g$
318
319c c c d c d ;
320
321pparse g$
322
323d d ;
324
325
326% Now switch off the tracing. It is useful while debugging this
327% package but is typically rather over the top for normal use.
328
329off tracelex, lalr_verbose;
330
331% Example 4.46 from the Red Dragon (4.61 in Aho, Lam, Sethi and Ullman,
332% "Compilers: principles, techniques and tools", second edition 2007).
333%
334% This is used there as an example of a grammar that is not SLR(1) but
335% that can be handled by LALR .
336
337% The semantic actions here contain print statements that will
338% print some sort of trace as the parsing progresses.
339
340symbolic procedure neatprintc x;
341 << if not zerop posn() then terpri();
342    printc x >>;
343
344g4_46 := '((s   ((l "=" r)   (neatprintc "## S => L = R")
345                             (list 'equal !$1 !$3))
346                ((r)         (neatprintc "## S => R")
347                             !$1))
348           (l   (("*" r)     (neatprintc "## L => * R")
349                             (list 'star !$2))
350                ((!:symbol)  (neatprintc "## L => symbol")
351                             !$1))
352           (r   ((l)         (neatprintc "## R => L")
353                             !$1)))$
354
355g := lalr_create_parser(nil, g4_46)$
356
357pparse g$
358
359leftsym = rightsym ;
360
361
362pparse g$
363
364****abc = *def ;
365
366% This next example is expected to be reasonably representative of
367% small grammars. It needs precedence rules to disambiguate the
368% grammar, and illustrates both left and right associativity, and
369% cases where two operators have the same precedence.
370
371gtest := '((s  ((p))
372               ((s "^" s) (list 'expt !$1 !$3))
373               ((s "**" s) (list 'expt !$1 !$3))
374               ((s "*" s) (list 'times !$1 !$3))
375               ((s "/" s) (list 'quotient !$1 !$3))
376               ((s "+" s) (list 'plus !$1 !$3))
377               ((s "-" s) (list 'difference !$1 !$3))
378               ((s "=" s) (list 'equal !$1 !$3))
379               (("-" s) (list 'minus !$2))
380               (("+" s) !$2))
381
382           (p  (("(" s ")") !$2)
383               ((!:symbol))
384               ((!:string))
385               ((!:number))))$
386
387% "^" and "**" both have the same high precedence and are right
388% associative. Next come "*" and "/" which are left associative,
389% and after that "+" and "-". Finally "=" has lowest precedence and
390% must not associate with itself, so (a=b=c) should be a syntax error.
391
392p := '(!:right ("^" "**") !:left ("*" "/") ("+" "-") !:none "=")$
393
394g := lalr_create_parser(p, gtest)$
395
396pparse g$
397a^b^c;
398
399pparse g$
400a*b+c*d;
401
402pparse g$
403a * (b/c + d/e/f) ^ 2 ^ g - "str" ;
404
405% Demonstrate various of the short-hand notations...
406
407g := lalr_create_parser(nil, '(
408 (s
409% (opt ...) means that the included material is optional.
410          (("begin" (opt "and" "also") "end")))))$
411
412pparse g$
413begin end
414;;
415
416pparse g$
417begin and also end
418;;
419
420g := lalr_create_parser(nil, '(
421 (s
422% (star ...) is for zero or mor instances of something.
423          (((star "a") "end") !$1))))$
424
425pparse g$
426end
427;;
428
429pparse g$
430a a a a a a end
431;;
432
433g := lalr_create_parser(nil, '(
434 (s
435% (plus ...) is for one or more repetitions of an item
436          (((plus "a") "end") !$1))))$
437
438pparse g$
439a end
440;;
441
442pparse g$
443a a a a a a end
444;;
445
446g := lalr_create_parser(nil, '(
447 (s
448% (list delimiter item-description) is a sequence of zero
449% or more items, and if there are several that are separated by the
450% indicated delimiter.
451          (((list ";" !:symbol) "eof") !$1))))$
452
453pparse g$
454
455several ; words ; here eof
456;;
457
458g := lalr_create_parser(nil, '(
459 (s
460% (listplus delimiter item-description) is as (list ...) however it
461% requires at least one item.
462          (((listplus ";" !:symbol) "eof") !$1))))$
463
464pparse g$
465
466several ; words ; here eof
467;;
468
469g := lalr_create_parser(nil, '(
470 (s
471% (or x y z) may be a more compact way of writing what could
472% otherwise by given as multiple productions, so for instance
473% (or "+" "-" "*" "/") would match one of the listed operators.
474          (((star (or "a" "b")) "end") !$1))))$
475
476pparse g$
477end
478;;
479
480pparse g$
481a b b a end
482;;
483
484% The next example shows all the above put together to parse what is
485% in effect a small programming language. Although it is not yet a large
486% grammar it illustrates painfull clearly how poor performange of the
487% parser generator can be if ut used what Aho, Sethi and Ullman describe as
488% the "Easy but space-consuming LALR table construction" method.
489
490p := '(!:left ("*" "/")
491              ("+" "-")
492       !:none ("<" "<=" "==" "neq" ">=" ">")
493       !:right ":=" "="
494       !:left ("then" "else" "return"))$
495
496mini_language := '(
497 (program
498          (((listplus ";" expression) "eof") !$1))
499
500 (expression
501          ((funcall))
502          ((expression "*" expression) (list 'times !$1 !$3))
503          ((expression "/" expression) (list 'quotient !$1 !$3))
504          ((expression "+" expression) (list 'plus !$1 !$3))
505          ((expression "-" expression) (list 'difference !$1 !$3))
506          ((expression "<" expression) (list 'lessp !$1 !$3))
507          ((expression "<=" expression) (list 'lesseq !$1 !$3))
508          ((expression "==" expression) (list 'equals !$1 !$3))
509          ((expression "neq" expression) (list 'neq !$1 !$3))
510          ((expression "=>" expression) (list 'geq !$1 !$3))
511          ((expression ">" expression) (list 'greaterp !$1 !$3))
512          ((expression ":=" expression) (list 'setq !$1 !$3))
513          (("fun" funcall "=" expression) (list 'fun !$2 !$4))
514          (("if" sequence "then" expression)
515             (list 'cond (list !$2 !$4)))
516          (("if" sequence "then" sequence "else" expression)
517             (list 'cond (list !$2 !$4) (list t !$6)))
518          (("go" (opt "to") !:symbol) (list 'go !$3))
519          (("goto" !:symbol) (list 'go !$2))
520          (("return" expression)))
521
522(funcall
523          ((closedexpression))
524          ((funcall closedexpression)))
525
526(closedexpression
527          ((!:symbol))
528          ((!:number))
529          (((plus !:string))) % Several strings in a row just concatenate
530          (("let" sequence "in" sequence "end") (list 'letstat !$2 !$4))
531          (("(" exprlist ")") (cons 'paren !$2))
532          (("(" sequence ")") (cons 'paren !$2))
533          (("[" exprlist "]") (cons 'bracket !$2)))
534
535(exprlist (((list "," expression))))
536
537(sequence
538          (((list ";" expression)))))$
539
540% The grammar shown here used to fail for lack of space. It should now
541% behave.
542% One issue it reveals at right now is that the processing here does not
543% keep the types of terminal symbols under control carefully enough, so the
544% numeric values 22 and 33 in the sample text get transliterated back into
545% terminal symbols that happen to have ended up allocated numeric codes
546% 22 and 33. This NEEDS fixing, but is not really an issue for the
547% code that manufactures parsing tables.
548%
549%                                                       ACN   May 2016
550
551on tracelex, lalr_verbose;
552
553g := lalr_create_parser(p, mini_language)$
554
555pparse g$
556fun f(a,b) = a + b;
557f(22,33)
558eof
559
560end;
561
562