• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..24-Mar-2021-

cst/H24-Mar-2021-9,7866,794

README.mdH A D24-Mar-20213.4 KiB10280

ast_build.goH A D24-Mar-20217.9 KiB282214

ast_print.goH A D24-Mar-20216.7 KiB309220

ast_test.goH A D24-Mar-20211.1 KiB6151

ast_types.goH A D24-Mar-20214.9 KiB12189

README.md

1Parsing a Miller DSL (domain-specific language) expression goes through three representations:
2
3* Source code which is a string of characters.
4* Abstract syntax tree (AST)
5* Concrete syntax tree (AST)
6
7The job of the GOCC parser is to turn the DSL string into an AST.
8
9The job of the CST builder is to turn the AST into a CST.
10
11The job of the `put` and `filter` transformers is to execute the CST statements on each input record.
12
13# Source-code representation
14
15For example, the part between the single quotes in
16
17`mlr put '$v = $i + $x * 4 + 100.7 * $y' myfile.dat`
18
19# AST representation
20
21Use `put -v` to display the AST:
22
23```
24mlr -n put -v '$v = $i + $x * 4 + 100.7 * $y'
25RAW AST:
26* StatementBlock
27    * SrecDirectAssignment "=" "="
28        * DirectFieldName "md_token_field_name" "v"
29        * Operator "+" "+"
30            * Operator "+" "+"
31                * DirectFieldName "md_token_field_name" "i"
32                * Operator "*" "*"
33                    * DirectFieldName "md_token_field_name" "x"
34                    * IntLiteral "md_token_int_literal" "4"
35            * Operator "*" "*"
36                * FloatLiteral "md_token_float_literal" "100.7"
37                * DirectFieldName "md_token_field_name" "y"
38```
39
40Note the following about the AST:
41
42* Parentheses, commas, semicolons, line endings, whitespace are all stripped away
43* Variable names and literal values remain as leaf nodes of the AST
44* Operators like `=` `+` `-` `*` `/` `**`, function names, and so on remain as non-leaf nodes of the AST
45* Operator precedence is clear from the tree structure
46
47Operator-precedence examples:
48
49```
50$ mlr -n put -v '$x = 1 + 2 * 3'
51RAW AST:
52* StatementBlock
53    * SrecDirectAssignment "=" "="
54        * DirectFieldName "md_token_field_name" "x"
55        * Operator "+" "+"
56            * IntLiteral "md_token_int_literal" "1"
57            * Operator "*" "*"
58                * IntLiteral "md_token_int_literal" "2"
59                * IntLiteral "md_token_int_literal" "3"
60```
61
62```
63$ mlr -n put -v '$x = 1 * 2 + 3'
64RAW AST:
65* StatementBlock
66    * SrecDirectAssignment "=" "="
67        * DirectFieldName "md_token_field_name" "x"
68        * Operator "+" "+"
69            * Operator "*" "*"
70                * IntLiteral "md_token_int_literal" "1"
71                * IntLiteral "md_token_int_literal" "2"
72            * IntLiteral "md_token_int_literal" "3"
73```
74
75```
76$ mlr -n put -v '$x = 1 * (2 + 3)'
77RAW AST:
78* StatementBlock
79    * SrecDirectAssignment "=" "="
80        * DirectFieldName "md_token_field_name" "x"
81        * Operator "*" "*"
82            * IntLiteral "md_token_int_literal" "1"
83            * Operator "+" "+"
84                * IntLiteral "md_token_int_literal" "2"
85                * IntLiteral "md_token_int_literal" "3"
86```
87
88# CST representation
89
90There's no `-v` display for the CST, but it's simply a reshaping of the AST
91with pre-processed setup of function pointers to handle each type of statement
92on a per-record basis.
93
94The if/else and/or switch statements to decide what to do with each AST node
95are done at CST-build time, so they don't need to be re-done when the syntax
96tree is executed once on every data record.
97
98# Source directories/files
99
100* The AST logic is in `./ast*.go`.  I didn't use a `src/dsl/ast` naming convention, although that would have been nice, in order to avoid a Go package-dependency cycle.
101* The CST logic is in [`./cst`](./cst). Please see [cst/README.md](./cst/README.md) for more information.
102