1# Parser
2
3## The AST
4
5When it comes to AST Elixir is a rather specific language due to its macro system.
6From the perspective of our parser, the important implication is that a seemingly
7invalid code can be a valid syntax when used in a macro (or just put in the `quote`
8expression). For example:
9
10```elixir
11quote do
12  def Bar.foo(x), definitely_not_do: 1
13  %{a}
14  */2
15end
16```
17
18As opposed to other languages, core constructs like `def`, `if` and `for` are not
19particularly special either, since they are itself regular functions (or macros rather).
20As a result, these constructs can be used "improperly" in a quoted expression, as shown above.
21
22Consequently, to correctly parse all Elixir code, we need the AST to closely match
23the Elixir AST. See [Elixir / Syntax reference](https://hexdocs.pm/elixir/syntax-reference.html)
24for more details.
25
26Whenever possible, we try using a more specific nodes (like binary/unary operator), but only
27to the extent that doesn't lose on generality. To get a sense of what the AST looks like, have
28a look at the tests in `test/corpus/`.
29
30## Getting started with Tree-sitter
31
32For detailed introduction see the official guide on [Creating parsers](https://tree-sitter.github.io/tree-sitter/creating-parsers).
33
34Essentially, we define relevant language rules in `grammar.js`, based on which
35Tree-sitter generates parser code (under `src/`). In some cases, we want to write
36custom C++ code for tokenizing specific character sequences (in `src/scanner.cc`).
37
38The grammar rules may often conflict with each other, meaning that the given
39sequence of tokens has multiple valid interpretations given one _token_ of lookahead.
40In many conflicts we always want to pick one interpretation over the other and we can
41do this by assigning different precedence and associativity to relevant rules, which
42tells the parser which way to go.
43
44For example given `expression1 * expression2 • *` the next token we _see_ ahead is `*`.
45The parser needs to decide whether `expression1 * expression2` is a complete binary operator
46node, or if it should await the next expression and interpret it as `expression1 * (expression2 * expression3)`.
47Since the `*` operator is left-associative we can use `prec.left` on the corresponding
48grammar rule, to inform the parser how to resolve this conflict.
49
50However, in some cases looking at one token ahead isn't enough, in which case we can add
51the conflicting rules to the `conflicts` list in the grammar. Whenever the parser stumbles
52upon this conflict it uses its GLR algorithm, basically considering both interpretations
53until one leads to parsing error. If both paths parse correctly (there's a genuine ambiguity)
54we can use dynamic precedence (`prec.dynamic`) to decide on the preferred path.
55
56## Using the CLI
57
58### tree-sitter
59
60```shell
61# See CLI usage
62npx tree-sitter -h
63
64# Generate the the parser code based on grammar.js
65npx tree-sitter generate
66
67# Run tests
68npx tree-sitter test
69npx tree-sitter test --filter "access syntax"
70
71# Parse a specific file
72npx tree-sitter parse tmp/test.ex
73npx tree-sitter parse -x tmp/test.ex
74
75# Parse codebase to verify syntax coverage
76npx tree-sitter parse --quiet --stat 'tmp/elixir/**/*.ex*'
77```
78
79Whenever you make a change to `grammar.js` remember to run `generate`,
80before verifying the result. To test custom code, create an Elixir file
81like `tmp/test.ex` and use `parse` on it. The `-x` flag prints out the
82source grouped into AST nodes as XML.
83
84### Additional scripts
85
86```shell
87# Format the grammar.js file
88npm run format
89
90# Run parser against the given repository
91scripts/parse_repo.sh elixir-lang/elixir
92
93# Run parser against a predefined list of popular repositories
94scripts/integration_test.sh
95```
96
97## Implementation notes
98
99This section covers some of the implementation decisions that have a more
100elaborated rationale. The individual subsections are referenced in the code.
101
102### Ref 1. External scanner for quoted content
103
104We want to scan quoted content as a single token, but it requires lookahead.
105Specifically the `#` character may no longer be quoted content if followed by `{`.
106Also, inside heredoc string tokenizing `"` (or `'`) requires lookahead to know
107if it's already part of the end delimiter or not.
108
109Since we need to use external scanner, we need to know the delimiter type.
110One way to achieve this is using external scanner to scan the start delimiter
111and then storing its type on the parser stack. This approach requires the parser
112to allocate enough memory upfront and implement serialization/deserialization,
113which ideally would be avoided. To avoid this, we use a different approach!
114Instead of having a single `quoted_content` token, we have specific tokens for
115each quoted content type, such as `_quoted_content_i_single`, `_quoted_content_i_double`.
116Once the start delimiter is tokenized, we know which quoted content should be
117tokenized next, and from the token we can infer the end delimiter and whether
118it supports interpolation. In other words, we extract the information from the
119parsing state, rather than maintaining custom parser state.
120
121### Ref 2. External scanner for newlines
122
123Generally newlines may appear in the middle of expressions and we ignore them
124as long as the expression is valid, that's why we list newline under extras.
125
126When a newline follows a complete expression, most of the time it should be
127treated as terminator. However, there are specific cases where the newline is
128non-breaking and treated as if it was just a space. This cases are:
129
130  * call followed by newline and a `do end` block
131  * expression followed by newline and a binary operator
132
133In both cases we want to tokenize the newline as non-breaking, so we use external
134scanner for lookahead.
135
136Note that the relevant rules already specify left/right associativity, so if we
137simply added `optional("\n")` the conflicts would be resolved immediately rather
138without using GLR.
139
140Additionally, since comments may appear anywhere and don't change the context,
141we also tokenize newlines before comments as non-breaking.
142
143### Ref 3. External scanner for unary + and -
144
145Plus and minus are either binary or unary operators, depending on the context.
146Consider the following variants
147
148```
149a + b
150a+b
151a+ b
152a +b
153```
154
155In the first three expressions `+` is a binary operator, while in the last one
156`+` is an unary operator referring to local call argument.
157
158To correctly tokenize all cases we use external scanner to tokenize a special empty
159token (`_before_unary_operator`) when the spacing matches `a +b`, which forces the
160parser to pick the unary operator path.
161
162### Ref 4. External scanner for `not in`
163
164The `not in` operator may have an arbitrary inline whitespace between `not` and `in`.
165
166We cannot use a regular expression like `/not[ \t]+in/`, because it would also match
167in expressions like `a not inn` as the longest matching token.
168
169A possible solution could be `seq("not", "in")` with dynamic conflict resolution, but
170then we tokenize two separate tokens. Also to properly handle `a not inn`, we would need
171keyword extraction, which causes problems in our case (https://github.com/tree-sitter/tree-sitter/issues/1404).
172
173In the end it's easiest to use external scanner, so that we can skip inline whitespace
174and ensure token ends after `in`.
175
176### Ref 5. External scanner for quoted atom start
177
178For parsing quoted atom `:` we could make the `"` (or `'`) token immediate, however this
179would require adding immediate rules for single/double quoted content and listing them
180in relevant places. We could definitely do that, but using external scanner is actually
181simpler.
182
183### Ref 6. Identifier pattern
184
185See [Elixir / Unicode Syntax](https://hexdocs.pm/elixir/unicode-syntax.html) for official
186notes.
187
188Tree-sitter already supports unicode properties in regular expressions, however character
189class subtraction is not supported.
190
191For the base `<Start>` and `<Continue>` we can use `[\p{ID_Start}]` and `[\p{ID_Continue}]`
192respectively, since both are supported and according to the
193[Unicode Annex #31](https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers)
194they match the ranges listed in the Elixir docs.
195
196For atoms this translates to a clean regular expression.
197
198For variables however, we want to exclude uppercase (`\p{Lu}`) and titlecase (`\p{Lt}`)
199categories from `\p{ID_Start}`. As already mentioned, we cannot use group subtraction
200in the regular expression, so instead we need to create a suitable group of characters
201on our own.
202
203After removing the uppercase/titlecase categories from `[\p{ID_Start}]`, we obtain the
204following group:
205
206`[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]`
207
208At the time of writing the subtracted groups actually only remove a single character:
209
210```elixir
211Mix.install([{:unicode_set, "~> 1.1"}])
212
213Unicode.Set.to_utf8_char(
214  "[[[:Ll:][:Lm:][:Lo:][:Nl:][:Other_ID_Start:]] & [[:Pattern_Syntax:][:Pattern_White_Space:]]]"
215)
216#=> {:ok, [11823]}
217```
218
219Consequently, by removing the subtraction we allow just one additional (not common) character,
220which is perfectly acceptable.
221
222It's important to note that JavaScript regular expressions don't support the `\p{Other_ID_Start}`
223unicode category. Fortunately this category is a small set of characters introduces for
224[backward compatibility](https://unicode.org/reports/tr31/#Backward_Compatibility), so we can
225enumerate it manually:
226
227```elixir
228Mix.install([{:unicode_set, "~> 1.1"}])
229
230Unicode.Set.to_utf8_char("[[[:Other_ID_Start:]] - [[:Pattern_Syntax:][:Pattern_White_Space:]]]")
231|> elem(1)
232|> Enum.flat_map(fn
233  n when is_number(n) -> [n]
234  range -> range
235end)
236|> Enum.map(&Integer.to_string(&1, 16))
237#=> ["1885", "1886", "2118", "212E", "309B", "309C"]
238```
239
240Finally, we obtain this regular expression group for variable `<Start>`:
241
242`[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\u1885\u1886\u2118\u212E\u309B\u309C]`
243
244### Ref 7. Keyword token
245
246We tokenize the whole keyword sequence like `do: ` as a single token.
247Ideally we wouldn't include the whitespace, but since we use `token`
248it gets include. However, this is an intentionally accepted tradeoff,
249because using `token` significantly simplifies the grammar and avoids
250conflicts.
251
252The alternative approach would be to define keyword as `seq(alias(choice(...), $._keyword_literal), $._keyword_end)`,
253where we list all other tokens that make for for valid keyword literal
254and use custom scanner for `_keyword_end` to look ahead without tokenizing
255the whitespace. However, this approach generates a number of conflicts
256because `:` is tokenized separately and phrases like `fun fun • do` or
257`fun • {}` are ambiguous (interpretation depends on whether `:` comes next).
258Resolving some of these conflicts (for instance special keywords like `{}` or `%{}`)
259requires the use of external scanner. Given the complexities this approach
260brings to the grammar, and consequently the parser, we stick to the simpler
261approach.
262