1% Test cases for the parser generator. This all runs in symbolic mode... 2 3 4% 5% This is where (for now) I will put documentation of the syntax I 6% will use when creating a grammer. There is a main function called 7% lalr_create_parser and that is passed a list that describes 8% a grammar. It is in the form of a sequence of productions, and the first 9% one given is taken to be the top-level target. 10% 11% Each production is in the form 12% (LHS ((rhs1.1 rhs1.2 ...) a1.1 a1.2 ...) 13% ((rhs2.1 rhs2.1 ...) a2.1 a2.2 ...) 14% ...) 15% which in regular publication style for grammars might be interpreted 16% as meaning 17% LHS ::= rhs1.1 rhs1.2 ... { a1.1 a1.2 ... } 18% | rhs2.1 rhs2.2 ... { a2.1 a2.2 ... } 19% ... 20% ; 21% 22% Each LHS is treated as a non-terminal symbol and is specified as a simple 23% name. Note that by default the Reduce parser will be folding characters 24% within names to lower case and so it will be best to choose names for 25% non-terminals that are unambiguous even when case-folded, but I would like 26% to establish a convention that in source code they are written in capitals. 27% 28% The RHS items may be either non-terminals (identified because they are 29% present in the left hand side of some production) or terminals. Terminal 30% symbols can be specified in two different ways. 31% 32% Semantic actions must be given in Lisp, and the variables !$1, !$2, ... 33% refer to the components from the matched right hand side. The final 34% item in the semantic action is used as the result (and hence gets built 35% up into the parse tree). If an explicit action is not given the 36% behaviour is as if "(list !$1 !$2 ... !$n)" had been given, where the 37% number of items is the number of terms in the pattern -- with a special 38% case if there is only one item in which case the implicit action is 39% just "!$1". 40% 41% The default lexer has built-in recipies that decode certain sequences 42% of characters and return the special markers for !:symbol, !:number, 43% !:string, !:list for commonly used cases. In these cases the variable 44% yylval gets left set to associated data, so for instance in the case of 45% !:symbol it gets set to the particular symbol concerned. 46% The token type :list is used for Lisp or rlisp-like notation where the 47% input contains 48% 'expression 49% or `expression 50% so for instance the input `(a b c) leads to the lexer returning !:list and 51% yylvel being set to (backquote (a b c)). This treatment is specialised for 52% handling rlisp-like syntax. 53% 54% Other terminals are indicated by writing a string. That may either 55% consist of characters that would otherwise form a symbol (ie a letter 56% followed by letters, digits and underscores) or a sequence of 57% non-alphanumeric characters. In the latter case if a sequence of three or 58% more punctuation marks make up a terminal then all the shorter prefixes 59% of it will also be grouped to form single entities. So if "<-->" is a 60% terminal then '<', '<-' and '<--' will each by parsed as single tokens, and 61% any of them that are not used as terminals will be classified as !:symbol. 62% 63% A non-default lexer needs to be hand-written and can be installed 64% for a grammar by using "set!-lexer". See examples. 65% 66% As well as terminals and non-terminals (which are written as symbols or 67% strings) it is possible to write one of 68% (OPT s1 s2 ...) 0 or 1 instances of the sequence s1, ... 69% (STAR s1 s2 ...) 0, 1, 2, ... instances 70% (PLUS s1 s2 ...) 1, 2, 3, ... instances 71% (LIST sep s1 s2 ...) like (STAR s1 s2 ...) but with the single 72% item sep between each instance. 73% (LISTPLUS sep s1 ...) like (PLUS s2 ...) but with sep interleaved. 74% (OR s1 s2 ...) one or other of the tokens shown. 75% 76% When the lexer processes input it will return a numeric code that identifies 77% the type of the item seen, so in a production one might write 78% (!:symbol ":=" EXPRESSION) 79% and as it recognises the first two tokens the lexer will return a numeric 80% code for !:symbol (and set yylval to the actual symbol as seen) and then 81% a numeric code that it allocates for ":=". In the latter case it will 82% also set yylval to the symbol !:!= in case that is useful. 83% 84% Precedence can be set using lalr_precedence. See examples lower down in this 85% file. 86 87% Some of the limitations are: 88% (0) Testing has been pretty minimal so far, with the examples in this 89% file the main ones used. Well testing is ongoing with a grammar 90% for SML, and so things are improving on that front. 91% (1) The lexer is hand-written and hence somewhat inflexible. It does 92% provide a set of option that can be used to tune it for (broad) 93% compatibility with several commonly-used languages. Probably some 94% people would appreciate something like "lex" or "flex" to make 95% lexer generationm automatic? 96% (2) The issue of when input should end is not properly though through, 97% and so it is easy to end up with a syntax error reported when the 98% parser reaches what you had wanted to be the end of your input. 99% (3) Diagnostics if you give a malformed grammar are weak. 100% (4) The input format is perhaps crude, and in particular semantic actions 101% have to be given as raw Lisp code. 102% (5) Diagnostics when you present malformed input to the parser are 103% not that good either... 104% (6) The internals of the code use association lists and I *believe* that 105% calls to assoc, rassoc and related functions represent bottlenecks 106% such that processing could be made somewhat faster. 107% 108% I have listed the above in the order that I think I need to work on them. 109 110symbolic; 111load_package lalr; 112 113% Before testing parser generation I will demonstrate the lexer.. 114% If I was jumpy about the exact behaviour of the lexer I could go 115% on tracelex; 116% to get some more tracing. 117 118lex_cleanup(); 119 120lex_keywords '("begin" "<=>" "<=="); 121 122% The output from this is expected to be 123 124% Result: (2 symbol) 125% Result: (4 200) 126% Result: (4 3.542) 127% Result: (3 "a string") 128% Result: (5 (quote (quoted lisp))) 129% Result: (5 (backquote (backquoted (!, comma) (!,!@ comma_at)))) 130% Result: (2 !+) 131% Result: (7 !<!=!>) 132% Result: (2 !-) 133% Result: (2 !=) 134% Result: (2 !>) 135% Result: (9 !<) 136% Result: (8 !<!=) 137% Result: (5 begin) 138% Result: (2 !;) 139% Result: (2 !;) 140% 141% nil 142 143% The row of "; ; ;" at the end provides some protection so that 144% if faults in the lexer were to cause it to read more or less than it ought 145% to then what is left over is reasonably likely to remain as valid rlisp 146% syntax and so the rest of this test file will be able to continue happily. 147 148 149begin 150 scalar tt; 151 off echo; 152 lex_init(); 153 for i := 1:16 do << 154 tt := yylex(); 155 if not zerop posn() then terpri(); 156 princ "Result: "; 157 print list(tt, yylval) >>; 158 on echo 159end; 160 161symbol 162200 1633.542 164"a string" 165'(quoted lisp) 166`(backquoted ,comma ,@comma_at) 167+ 168<=> 169- 170=> 171< 172<= 173begin 174; ; ; 175 176 177% I will now illustrate the various lexing styles that are available: 178 179symbolic procedure demonstrate_lexer style; 180 begin 181 scalar tt, r; 182 lex_init(); 183 lex_keywords '(">="); 184% This sets one of the "lexer styles". 185 lexer_style!* := style; 186 while << tt := yylex(); yylval neq '!; >> do r := (tt . yylval) . r; 187 for each x in reverse r do << 188% CSL and PSL do not put "!" escapes in exactly the same places when 189% printing symbols with an underscore in the name - and thay also 190% disagree about where to break lines to respect linelength. The 191% function portable_print exists to provide consistent output in places 192% like here. 193 if not zerop posn() then terpri(); 194 portable_print x >> 195 end; 196 197% I will read exactly the same sequence of characters using various lexer 198% styles so that token syntax, string treatment and comments show up. 199 200demonstrate_lexer lexer_style_rlisp; 201% Rlisp 202# script comment test $ 203// C line 204/* C block /* no_nesting */ outside */ 205(* SML (* nesting *) inside *) 206'x' 207"strings \" twice \" escapes & embedded "" quotes" 208quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b 209>= 210>> 2110x10000 212#if nil 213conditional 214#endif 215;;; 216 217demonstrate_lexer lexer_style_C; 218% Rlisp 219# script comment test $ 220// C line 221/* C block /* no_nesting */ outside */ 222(* SML (* nesting *) inside *) 223'x' 224"strings \" twice \" escapes & embedded "" quotes" 225quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b 226>= 227>> 2280x10000 229#if nil 230conditional 231#endif 232;;; 233 234demonstrate_lexer lexer_style_SML; 235% Rlisp 236# script comment test $ 237// C line 238/* C block /* no_nesting */ outside */ 239(* SML (* nesting *) inside *) 240'x' 241"strings \" twice \" escapes & embedded "" quotes" 242quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b 243>= 244>> 2450x10000 246#if nil 247conditional 248#endif 249;;; 250 251demonstrate_lexer lexer_style_script; 252% Rlisp 253# script comment test $ 254// C line 255/* C block /* no_nesting */ outside */ 256(* SML (* nesting *) inside *) 257'x' 258"strings \" twice \" escapes & embedded "" quotes" 259quote'-'chars :=: !!!! _ _ABC mixed!Case a!-b 260>= 261>> 2620x10000 263#if nil 264conditional 265#endif 266;;; 267 268 269lexer_style!* := lexer_style_rlisp; 270 271on lalr_verbose; 272 273% Here I set up a sample grammar 274% S' -> S 275% S -> C C { } 276% C -> "c" C { } 277% | "d" { } 278% This is example 4.42 from Aho, Sethi and Ullman's Red Dragon book. 279% It is example 4.54 in the more recent Purple book. 280 281% Note that this grammar will introduce "c" and "d" as keywords rather than 282% being general symbols. When I construct a subsequent grammar that will 283% undo that setting. I will omit semantic actions here so that the default 284% action of building a form of tree is used. 285 286% There seems to be an issue here in that if I use a name for a non-terminal 287% that is the same as one used for a terminal thinsg get confused. So I had 288% originally hoped and expected to write just 'c' for the name of the 289% non-terminal here are use '"c"' to denote the terminal. However when I 290% try that I find that internally the names and up clashing to rather bad 291% effect. I will look into this later since it is merely a matter of surface 292% notation! Also in the input to semantic actions the values passed when a 293% terminal is matched may at present be the internal numeric code allocated 294% to that terminal, not a "sensible" value. Agian this is a small issue. 295 296grammar := '( 297 (s ((cc cc) ) % One production for S, no explicit semantic here 298 ) 299 (cc (("c" cc) (list 'c !$2)) % First production for C 300 (("d") 'd )) % Second production for C 301% I put in a redundant (unused) clause here to illustrate that it is 302% detected and reported. 303 (redundant (())) 304 )$ 305 306g := lalr_create_parser(nil, grammar)$ 307 308symbolic procedure pparse g; 309 begin 310 scalar r; 311 r := yyparse g; 312 terpri(); 313 princ "= "; 314 portable_print r 315 end; 316 317pparse g$ 318 319c c c d c d ; 320 321pparse g$ 322 323d d ; 324 325 326% Now switch off the tracing. It is useful while debugging this 327% package but is typically rather over the top for normal use. 328 329off tracelex, lalr_verbose; 330 331% Example 4.46 from the Red Dragon (4.61 in Aho, Lam, Sethi and Ullman, 332% "Compilers: principles, techniques and tools", second edition 2007). 333% 334% This is used there as an example of a grammar that is not SLR(1) but 335% that can be handled by LALR . 336 337% The semantic actions here contain print statements that will 338% print some sort of trace as the parsing progresses. 339 340symbolic procedure neatprintc x; 341 << if not zerop posn() then terpri(); 342 printc x >>; 343 344g4_46 := '((s ((l "=" r) (neatprintc "## S => L = R") 345 (list 'equal !$1 !$3)) 346 ((r) (neatprintc "## S => R") 347 !$1)) 348 (l (("*" r) (neatprintc "## L => * R") 349 (list 'star !$2)) 350 ((!:symbol) (neatprintc "## L => symbol") 351 !$1)) 352 (r ((l) (neatprintc "## R => L") 353 !$1)))$ 354 355g := lalr_create_parser(nil, g4_46)$ 356 357pparse g$ 358 359leftsym = rightsym ; 360 361 362pparse g$ 363 364****abc = *def ; 365 366% This next example is expected to be reasonably representative of 367% small grammars. It needs precedence rules to disambiguate the 368% grammar, and illustrates both left and right associativity, and 369% cases where two operators have the same precedence. 370 371gtest := '((s ((p)) 372 ((s "^" s) (list 'expt !$1 !$3)) 373 ((s "**" s) (list 'expt !$1 !$3)) 374 ((s "*" s) (list 'times !$1 !$3)) 375 ((s "/" s) (list 'quotient !$1 !$3)) 376 ((s "+" s) (list 'plus !$1 !$3)) 377 ((s "-" s) (list 'difference !$1 !$3)) 378 ((s "=" s) (list 'equal !$1 !$3)) 379 (("-" s) (list 'minus !$2)) 380 (("+" s) !$2)) 381 382 (p (("(" s ")") !$2) 383 ((!:symbol)) 384 ((!:string)) 385 ((!:number))))$ 386 387% "^" and "**" both have the same high precedence and are right 388% associative. Next come "*" and "/" which are left associative, 389% and after that "+" and "-". Finally "=" has lowest precedence and 390% must not associate with itself, so (a=b=c) should be a syntax error. 391 392p := '(!:right ("^" "**") !:left ("*" "/") ("+" "-") !:none "=")$ 393 394g := lalr_create_parser(p, gtest)$ 395 396pparse g$ 397a^b^c; 398 399pparse g$ 400a*b+c*d; 401 402pparse g$ 403a * (b/c + d/e/f) ^ 2 ^ g - "str" ; 404 405% Demonstrate various of the short-hand notations... 406 407g := lalr_create_parser(nil, '( 408 (s 409% (opt ...) means that the included material is optional. 410 (("begin" (opt "and" "also") "end")))))$ 411 412pparse g$ 413begin end 414;; 415 416pparse g$ 417begin and also end 418;; 419 420g := lalr_create_parser(nil, '( 421 (s 422% (star ...) is for zero or mor instances of something. 423 (((star "a") "end") !$1))))$ 424 425pparse g$ 426end 427;; 428 429pparse g$ 430a a a a a a end 431;; 432 433g := lalr_create_parser(nil, '( 434 (s 435% (plus ...) is for one or more repetitions of an item 436 (((plus "a") "end") !$1))))$ 437 438pparse g$ 439a end 440;; 441 442pparse g$ 443a a a a a a end 444;; 445 446g := lalr_create_parser(nil, '( 447 (s 448% (list delimiter item-description) is a sequence of zero 449% or more items, and if there are several that are separated by the 450% indicated delimiter. 451 (((list ";" !:symbol) "eof") !$1))))$ 452 453pparse g$ 454 455several ; words ; here eof 456;; 457 458g := lalr_create_parser(nil, '( 459 (s 460% (listplus delimiter item-description) is as (list ...) however it 461% requires at least one item. 462 (((listplus ";" !:symbol) "eof") !$1))))$ 463 464pparse g$ 465 466several ; words ; here eof 467;; 468 469g := lalr_create_parser(nil, '( 470 (s 471% (or x y z) may be a more compact way of writing what could 472% otherwise by given as multiple productions, so for instance 473% (or "+" "-" "*" "/") would match one of the listed operators. 474 (((star (or "a" "b")) "end") !$1))))$ 475 476pparse g$ 477end 478;; 479 480pparse g$ 481a b b a end 482;; 483 484% The next example shows all the above put together to parse what is 485% in effect a small programming language. Although it is not yet a large 486% grammar it illustrates painfull clearly how poor performange of the 487% parser generator can be if ut used what Aho, Sethi and Ullman describe as 488% the "Easy but space-consuming LALR table construction" method. 489 490p := '(!:left ("*" "/") 491 ("+" "-") 492 !:none ("<" "<=" "==" "neq" ">=" ">") 493 !:right ":=" "=" 494 !:left ("then" "else" "return"))$ 495 496mini_language := '( 497 (program 498 (((listplus ";" expression) "eof") !$1)) 499 500 (expression 501 ((funcall)) 502 ((expression "*" expression) (list 'times !$1 !$3)) 503 ((expression "/" expression) (list 'quotient !$1 !$3)) 504 ((expression "+" expression) (list 'plus !$1 !$3)) 505 ((expression "-" expression) (list 'difference !$1 !$3)) 506 ((expression "<" expression) (list 'lessp !$1 !$3)) 507 ((expression "<=" expression) (list 'lesseq !$1 !$3)) 508 ((expression "==" expression) (list 'equals !$1 !$3)) 509 ((expression "neq" expression) (list 'neq !$1 !$3)) 510 ((expression "=>" expression) (list 'geq !$1 !$3)) 511 ((expression ">" expression) (list 'greaterp !$1 !$3)) 512 ((expression ":=" expression) (list 'setq !$1 !$3)) 513 (("fun" funcall "=" expression) (list 'fun !$2 !$4)) 514 (("if" sequence "then" expression) 515 (list 'cond (list !$2 !$4))) 516 (("if" sequence "then" sequence "else" expression) 517 (list 'cond (list !$2 !$4) (list t !$6))) 518 (("go" (opt "to") !:symbol) (list 'go !$3)) 519 (("goto" !:symbol) (list 'go !$2)) 520 (("return" expression))) 521 522(funcall 523 ((closedexpression)) 524 ((funcall closedexpression))) 525 526(closedexpression 527 ((!:symbol)) 528 ((!:number)) 529 (((plus !:string))) % Several strings in a row just concatenate 530 (("let" sequence "in" sequence "end") (list 'letstat !$2 !$4)) 531 (("(" exprlist ")") (cons 'paren !$2)) 532 (("(" sequence ")") (cons 'paren !$2)) 533 (("[" exprlist "]") (cons 'bracket !$2))) 534 535(exprlist (((list "," expression)))) 536 537(sequence 538 (((list ";" expression)))))$ 539 540% The grammar shown here used to fail for lack of space. It should now 541% behave. 542% One issue it reveals at right now is that the processing here does not 543% keep the types of terminal symbols under control carefully enough, so the 544% numeric values 22 and 33 in the sample text get transliterated back into 545% terminal symbols that happen to have ended up allocated numeric codes 546% 22 and 33. This NEEDS fixing, but is not really an issue for the 547% code that manufactures parsing tables. 548% 549% ACN May 2016 550 551on tracelex, lalr_verbose; 552 553g := lalr_create_parser(p, mini_language)$ 554 555pparse g$ 556fun f(a,b) = a + b; 557f(22,33) 558eof 559 560end; 561 562