1# Parser 2 3## The AST 4 5When it comes to AST Elixir is a rather specific language due to its macro system. 6From the perspective of our parser, the important implication is that a seemingly 7invalid code can be a valid syntax when used in a macro (or just put in the `quote` 8expression). For example: 9 10```elixir 11quote do 12 def Bar.foo(x), definitely_not_do: 1 13 %{a} 14 */2 15end 16``` 17 18As opposed to other languages, core constructs like `def`, `if` and `for` are not 19particularly special either, since they are itself regular functions (or macros rather). 20As a result, these constructs can be used "improperly" in a quoted expression, as shown above. 21 22Consequently, to correctly parse all Elixir code, we need the AST to closely match 23the Elixir AST. See [Elixir / Syntax reference](https://hexdocs.pm/elixir/syntax-reference.html) 24for more details. 25 26Whenever possible, we try using a more specific nodes (like binary/unary operator), but only 27to the extent that doesn't lose on generality. To get a sense of what the AST looks like, have 28a look at the tests in `test/corpus/`. 29 30## Getting started with Tree-sitter 31 32For detailed introduction see the official guide on [Creating parsers](https://tree-sitter.github.io/tree-sitter/creating-parsers). 33 34Essentially, we define relevant language rules in `grammar.js`, based on which 35Tree-sitter generates parser code (under `src/`). In some cases, we want to write 36custom C++ code for tokenizing specific character sequences (in `src/scanner.cc`). 37 38The grammar rules may often conflict with each other, meaning that the given 39sequence of tokens has multiple valid interpretations given one _token_ of lookahead. 40In many conflicts we always want to pick one interpretation over the other and we can 41do this by assigning different precedence and associativity to relevant rules, which 42tells the parser which way to go. 43 44For example given `expression1 * expression2 • *` the next token we _see_ ahead is `*`. 45The parser needs to decide whether `expression1 * expression2` is a complete binary operator 46node, or if it should await the next expression and interpret it as `expression1 * (expression2 * expression3)`. 47Since the `*` operator is left-associative we can use `prec.left` on the corresponding 48grammar rule, to inform the parser how to resolve this conflict. 49 50However, in some cases looking at one token ahead isn't enough, in which case we can add 51the conflicting rules to the `conflicts` list in the grammar. Whenever the parser stumbles 52upon this conflict it uses its GLR algorithm, basically considering both interpretations 53until one leads to parsing error. If both paths parse correctly (there's a genuine ambiguity) 54we can use dynamic precedence (`prec.dynamic`) to decide on the preferred path. 55 56## Using the CLI 57 58### tree-sitter 59 60```shell 61# See CLI usage 62npx tree-sitter -h 63 64# Generate the the parser code based on grammar.js 65npx tree-sitter generate 66 67# Run tests 68npx tree-sitter test 69npx tree-sitter test --filter "access syntax" 70 71# Parse a specific file 72npx tree-sitter parse tmp/test.ex 73npx tree-sitter parse -x tmp/test.ex 74 75# Parse codebase to verify syntax coverage 76npx tree-sitter parse --quiet --stat 'tmp/elixir/**/*.ex*' 77``` 78 79Whenever you make a change to `grammar.js` remember to run `generate`, 80before verifying the result. To test custom code, create an Elixir file 81like `tmp/test.ex` and use `parse` on it. The `-x` flag prints out the 82source grouped into AST nodes as XML. 83 84### Additional scripts 85 86```shell 87# Format the grammar.js file 88npm run format 89 90# Run parser against the given repository 91scripts/parse_repo.sh elixir-lang/elixir 92 93# Run parser against a predefined list of popular repositories 94scripts/integration_test.sh 95``` 96 97## Implementation notes 98 99This section covers some of the implementation decisions that have a more 100elaborated rationale. The individual subsections are referenced in the code. 101 102### Ref 1. External scanner for quoted content 103 104We want to scan quoted content as a single token, but it requires lookahead. 105Specifically the `#` character may no longer be quoted content if followed by `{`. 106Also, inside heredoc string tokenizing `"` (or `'`) requires lookahead to know 107if it's already part of the end delimiter or not. 108 109Since we need to use external scanner, we need to know the delimiter type. 110One way to achieve this is using external scanner to scan the start delimiter 111and then storing its type on the parser stack. This approach requires the parser 112to allocate enough memory upfront and implement serialization/deserialization, 113which ideally would be avoided. To avoid this, we use a different approach! 114Instead of having a single `quoted_content` token, we have specific tokens for 115each quoted content type, such as `_quoted_content_i_single`, `_quoted_content_i_double`. 116Once the start delimiter is tokenized, we know which quoted content should be 117tokenized next, and from the token we can infer the end delimiter and whether 118it supports interpolation. In other words, we extract the information from the 119parsing state, rather than maintaining custom parser state. 120 121### Ref 2. External scanner for newlines 122 123Generally newlines may appear in the middle of expressions and we ignore them 124as long as the expression is valid, that's why we list newline under extras. 125 126When a newline follows a complete expression, most of the time it should be 127treated as terminator. However, there are specific cases where the newline is 128non-breaking and treated as if it was just a space. This cases are: 129 130 * call followed by newline and a `do end` block 131 * expression followed by newline and a binary operator 132 133In both cases we want to tokenize the newline as non-breaking, so we use external 134scanner for lookahead. 135 136Note that the relevant rules already specify left/right associativity, so if we 137simply added `optional("\n")` the conflicts would be resolved immediately rather 138without using GLR. 139 140Additionally, since comments may appear anywhere and don't change the context, 141we also tokenize newlines before comments as non-breaking. 142 143### Ref 3. External scanner for unary + and - 144 145Plus and minus are either binary or unary operators, depending on the context. 146Consider the following variants 147 148``` 149a + b 150a+b 151a+ b 152a +b 153``` 154 155In the first three expressions `+` is a binary operator, while in the last one 156`+` is an unary operator referring to local call argument. 157 158To correctly tokenize all cases we use external scanner to tokenize a special empty 159token (`_before_unary_operator`) when the spacing matches `a +b`, which forces the 160parser to pick the unary operator path. 161 162### Ref 4. External scanner for `not in` 163 164The `not in` operator may have an arbitrary inline whitespace between `not` and `in`. 165 166We cannot use a regular expression like `/not[ \t]+in/`, because it would also match 167in expressions like `a not inn` as the longest matching token. 168 169A possible solution could be `seq("not", "in")` with dynamic conflict resolution, but 170then we tokenize two separate tokens. Also to properly handle `a not inn`, we would need 171keyword extraction, which causes problems in our case (https://github.com/tree-sitter/tree-sitter/issues/1404). 172 173In the end it's easiest to use external scanner, so that we can skip inline whitespace 174and ensure token ends after `in`. 175 176### Ref 5. External scanner for quoted atom start 177 178For parsing quoted atom `:` we could make the `"` (or `'`) token immediate, however this 179would require adding immediate rules for single/double quoted content and listing them 180in relevant places. We could definitely do that, but using external scanner is actually 181simpler. 182 183### Ref 6. Identifier pattern 184 185See [Elixir / Unicode Syntax](https://hexdocs.pm/elixir/unicode-syntax.html) for official 186notes. 187 188Tree-sitter already supports unicode properties in regular expressions, however character 189class subtraction is not supported. 190 191For the base `<Start>` and `<Continue>` we can use `[\p{ID_Start}]` and `[\p{ID_Continue}]` 192respectively, since both are supported and according to the 193[Unicode Annex #31](https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers) 194they match the ranges listed in the Elixir docs. 195 196For atoms this translates to a clean regular expression. 197 198For variables however, we want to exclude uppercase (`\p{Lu}`) and titlecase (`\p{Lt}`) 199categories from `\p{ID_Start}`. As already mentioned, we cannot use group subtraction 200in the regular expression, so instead we need to create a suitable group of characters 201on our own. 202 203After removing the uppercase/titlecase categories from `[\p{ID_Start}]`, we obtain the 204following group: 205 206`[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]` 207 208At the time of writing the subtracted groups actually only remove a single character: 209 210```elixir 211Mix.install([{:unicode_set, "~> 1.1"}]) 212 213Unicode.Set.to_utf8_char( 214 "[[[:Ll:][:Lm:][:Lo:][:Nl:][:Other_ID_Start:]] & [[:Pattern_Syntax:][:Pattern_White_Space:]]]" 215) 216#=> {:ok, [11823]} 217``` 218 219Consequently, by removing the subtraction we allow just one additional (not common) character, 220which is perfectly acceptable. 221 222It's important to note that JavaScript regular expressions don't support the `\p{Other_ID_Start}` 223unicode category. Fortunately this category is a small set of characters introduces for 224[backward compatibility](https://unicode.org/reports/tr31/#Backward_Compatibility), so we can 225enumerate it manually: 226 227```elixir 228Mix.install([{:unicode_set, "~> 1.1"}]) 229 230Unicode.Set.to_utf8_char("[[[:Other_ID_Start:]] - [[:Pattern_Syntax:][:Pattern_White_Space:]]]") 231|> elem(1) 232|> Enum.flat_map(fn 233 n when is_number(n) -> [n] 234 range -> range 235end) 236|> Enum.map(&Integer.to_string(&1, 16)) 237#=> ["1885", "1886", "2118", "212E", "309B", "309C"] 238``` 239 240Finally, we obtain this regular expression group for variable `<Start>`: 241 242`[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\u1885\u1886\u2118\u212E\u309B\u309C]` 243 244### Ref 7. Keyword token 245 246We tokenize the whole keyword sequence like `do: ` as a single token. 247Ideally we wouldn't include the whitespace, but since we use `token` 248it gets include. However, this is an intentionally accepted tradeoff, 249because using `token` significantly simplifies the grammar and avoids 250conflicts. 251 252The alternative approach would be to define keyword as `seq(alias(choice(...), $._keyword_literal), $._keyword_end)`, 253where we list all other tokens that make for for valid keyword literal 254and use custom scanner for `_keyword_end` to look ahead without tokenizing 255the whitespace. However, this approach generates a number of conflicts 256because `:` is tokenized separately and phrases like `fun fun • do` or 257`fun • {}` are ambiguous (interpretation depends on whether `:` comes next). 258Resolving some of these conflicts (for instance special keywords like `{}` or `%{}`) 259requires the use of external scanner. Given the complexities this approach 260brings to the grammar, and consequently the parser, we stick to the simpler 261approach. 262