1## JS syntactic quirks 2 3> *To make a labyrinth, it takes* 4> *Some good intentions, some mistakes.* 5> —A. E. Stallings, “Daedal” 6 7JavaScript is rather hard to parse. Here is an in-depth accounting of 8its syntactic quirks, with an eye toward actually implementing a parser 9from scratch. 10 11With apologies to the generous people who work on the standard. Thanks 12for doing that—better you than me. 13 14Thanks to [@bakkot](https://github.com/bakkot) and 15[@mathiasbynens](https://github.com/mathiasbynens) for pointing out 16several additional quirks. 17 18Problems are rated in terms of difficulty, from `(*)` = easy to `(***)` 19= hard. We’ll start with the easiest problems. 20 21 22### Dangling else (*) 23 24If you know what this is, you may be excused. 25 26Statements like `if (EXPR) STMT if (EXPR) STMT else STMT` 27are straight-up ambiguous in the JS formal grammar. 28The ambiguity is resolved with 29[a line of specification text](https://tc39.es/ecma262/#sec-if-statement): 30 31> Each `else` for which the choice of associated `if` is ambiguous shall 32> be associated with the nearest possible `if` that would otherwise have 33> no corresponding `else`. 34 35I love this sentence. Something about it cracks me up, I dunno... 36 37In a recursive descent parser, just doing the dumbest possible thing 38correctly implements this rule. 39 40A parser generator has to decide what to do. In Yacc, you can use 41operator precedence for this. 42 43Yacc aside: This should seem a little outrageous at first, as `else` is 44hardly an operator. It helps if you understand what Yacc is doing. In LR 45parsers, this kind of ambiguity in the grammar manifests as a 46shift-reduce conflict. In this case, when we’ve already parsed `if ( 47Expression ) Statement if ( Expression ) Statement` 48and are looking at `else`, it’s unclear to Yacc 49whether to reduce the if-statement or shift `else`. Yacc does not offer 50a feature that lets us just say "always shift `else` here"; but there 51*is* a Yacc feature that lets us resolve shift-reduce conflicts in a 52rather odd, indirect way: operator precedence. We can resolve this 53conflict by making `else` higher-precedence than the preceding symbol 54`)`. 55 56Alternatively, I believe it’s equivalent to add "[lookahead ≠ `else`]" 57at the end of the IfStatement production that doesn’t have an `else`. 58 59 60### Other ambiguities and informal parts of the spec (*) 61 62Not all of the spec is as formal as it seems at first. Most of the stuff 63in this section is easy to deal with, but #4 is special. 64 651. The lexical grammar is ambiguous: when looking at the characters `<<=`, 66 there is the question of whether to parse that as one token `<<=`, two 67 tokens (`< <=` or `<< =`), or three (`< < =`). 68 69 Of course every programming language has this, and the fix is one 70 sentence of prose in the spec: 71 72 > The source text is scanned from left to right, repeatedly taking the 73 > longest possible sequence of code points as the next input element. 74 75 This is easy enough for hand-coded lexers, and for systems that are 76 designed to use separate lexical and syntactic grammars. (Other 77 parser generators may need help to avoid parsing `functionf(){}` as 78 a function.) 79 802. The above line of prose does not apply *within* input elements, in 81 components of the lexical grammar. In those cases, the same basic 82 idea ("maximum munch") is specified using lookahead restrictions at 83 the end of productions: 84 85 > *LineTerminatorSequence* :: 86 > <LF> 87 > <CR>[lookahead ≠ <LF>] 88 > <LS> 89 > <PS> 90 > <CR><LF> 91 92 The lookahead restriction prevents a CR LF sequence from being 93 parsed as two adjacent *LineTerminatorSequence*s. 94 95 This technique is used in several places, particularly in 96 [*NotEscapeSequences*](https://tc39.es/ecma262/#prod-NotEscapeSequence). 97 983. Annex B.1.4 extends the syntax for regular expressions, making the 99 grammar ambiguous. Again, a line of prose explains how to cope: 100 101 > These changes introduce ambiguities that are broken by the 102 > ordering of grammar productions and by contextual 103 > information. When parsing using the following grammar, each 104 > alternative is considered only if previous production alternatives 105 > do not match. 106 1074. Annex B.1.2 extends the syntax of string literals to allow legacy 108 octal escape sequences, like `\033`. It says: 109 110 > The syntax and semantics of 11.8.4 is extended as follows except 111 > that this extension is not allowed for strict mode code: 112 113 ...followed by a new definition of *EscapeSequence*. 114 115 So there are two sets of productions for *EscapeSequence*, and an 116 implementation is required to implement both and dynamically switch 117 between them. 118 119 This means that `function f() { "\033"; "use strict"; }` is a 120 SyntaxError, even though the octal escape is scanned before we know 121 we're in strict mode. 122 123For another ambiguity, see "Slashes" below. 124 125 126### Unicode quirks 127 128JavaScript source is Unicode and usually follows Unicode rules for thing 129like identifiers and whitespace, but it has a few special cases: `$`, 130`_`, `U+200C ZERO WIDTH NON-JOINER`, and `U+200D ZERO WIDTH JOINER` are 131legal in identifiers (the latter two only after the first character), and 132`U+FEFF ZERO WIDTH NO-BREAK SPACE` (also known as the byte-order mark) is 133treated as whitespace. 134 135It also allows any code point, including surrogate halves, even though the 136Unicode standard says that unpaired surrogate halves should be treated as 137encoding errors. 138 139 140### Legacy octal literals and escape sequences (*) 141 142This is more funny than difficult. 143 144In a browser, in non-strict code, every sequence of decimal digits (not 145followed by an identifier character) is a *NumericLiteral* token. 146 147If it starts with `0`, with more digits after, then it's a legacy Annex 148B.1.1 literal. If the token contains an `8` or a `9`, it's a decimal 149number. Otherwise, hilariously, it's octal. 150 151``` 152js> [067, 068, 069, 070] 153[55, 68, 69, 56] 154``` 155 156There are also legacy octal escape sequences in strings, and these have 157their own quirks. `'\07' === '\u{7}'`, but `'\08' !== '\u{8}'` since 8 158is not an octal digit. Instead `'\08' === '\0' + '8'`, because `\0` 159followed by `8` or `9` is a legacy octal escape sequence representing 160the null character. (Not to be confused with `\0` in strict code, not 161followed by a digit, which still represents the null character, but 162doesn't count as octal.) 163 164None of this is hard to implement, but figuring out what the spec says 165is hard. 166 167 168### Strict mode (*) 169 170*(entangled with: lazy compilation)* 171 172A script or function can start with this: 173 174```js 175"use strict"; 176``` 177 178This enables ["strict mode"](https://tc39.es/ecma262/#sec-strict-mode-of-ecmascript). 179Additionally, all classes and modules are strict mode code. 180 181Strict mode has both parse-time and run-time effects. Parse-time effects 182include: 183 184* Strict mode affects the lexical grammar: octal integer literals are 185 SyntaxErrors, octal character escapes are SyntaxErrors, and a 186 handful of words like `private` and `interface` are reserved (and 187 thus usually SyntaxErrors) in strict mode. 188 189 Like the situation with slashes, this means it is not possible to 190 implement a complete lexer for JS without also parsing—at least 191 enough to detect class boundaries, "use strict" directives in 192 functions, and function boundaries. 193 194* It’s a SyntaxError to have bindings named `eval` or `arguments` in 195 strict mode code, or to assign to `eval` or `arguments`. 196 197* It’s a SyntaxError to have two argument bindings with the same name 198 in a strict function. 199 200 Interestingly, you don’t always know if you’re in strict mode or not 201 when parsing arguments. 202 203 ```js 204 function foo(a, a) { 205 "use strict"; 206 } 207 ``` 208 209 When the implementation reaches the Use Strict Directive, it must 210 either know that `foo` has two arguments both named `a`, or switch 211 to strict mode, go back, and reparse the function from the 212 beginning. 213 214 Fortunately an Early Error rule prohibits mixing `"use strict"` with 215 more complex parameter lists, like `function foo(x = eval('')) {`. 216 217* The expression syntax “`delete` *Identifier*” and the abominable 218 *WithStatement* are banned in strict mode. 219 220 221### Conditional keywords (**) 222 223In some programming languages, you could write a lexer that has rules 224like 225 226* When you see `if`, return `Token::If`. 227 228* When you see something like `apple` or `arrow` or `target`, 229 return `Token::Identifier`. 230 231Not so in JavaScript. The input `if` matches both the terminal `if` and 232the nonterminal *IdentifierName*, both of which appear in the high-level 233grammar. The same goes for `target`. 234 235This poses a deceptively difficult problem for table-driven parsers. 236Such parsers run on a stream of token-ids, but the question of which 237token-id to use for a word like `if` or `target` is ambiguous. The 238current parser state can't fully resolve the ambiguity: there are cases 239like `class C { get` where the token `get` might match either as a 240keyword (the start of a getter) or as an *IdentifierName* (a method or 241property named `get`) in different grammatical productions. 242 243All keywords are conditional, but some are more conditional than others. 244The rules are inconsistent to a tragicomic extent. Keywords like `if` 245that date back to JavaScript 1.0 are always keywords except when used as 246property names or method names. They can't be variable names. Two 247conditional keywords (`await` and `yield`) are in the *Keyword* list; 248the rest are not. New syntax that happened to be introduced around the 249same time as strict mode was awarded keyword status in strict mode. The 250rules are scattered through the spec. All this interacts with `\u0065` 251Unicode escape sequences somehow. It’s just unbelievably confusing. 252 253(After writing this section, I 254[proposed revisions to the specification](https://github.com/tc39/ecma262/pull/1694) 255to make it a little less confusing.) 256 257* Thirty-six words are always reserved: 258 259 > `break` `case` `catch` `class` `const` `continue` `debugger` 260 > `default` `delete` `do` `else` `enum` `export` `extends` `false` 261 > `finally` `for` `function` `if` `import` `in` `instanceof` `new` 262 > `null` `return` `super` `switch` `this` `throw` `true` `try` 263 > `typeof` `var` `void` `while` `with` 264 265 These tokens can't be used as names of variables or arguments. 266 They're always considered special *except* when used as property 267 names, method names, or import/export names in modules. 268 269 ```js 270 // property names 271 let obj = {if: 3, function: 4}; 272 assert(obj.if == 3); 273 274 // method names 275 class C { 276 if() {} 277 function() {} 278 } 279 280 // imports and exports 281 import {if as my_if} from "modulename"; 282 export {my_if as if}; 283 ``` 284 285* Two more words, `yield` and `await`, are in the *Keyword* list but 286 do not always act like keywords in practice. 287 288 * `yield` is a *Keyword*; but it can be used as an identifier, 289 except in generators and strict mode code. 290 291 This means that `yield - 1` is valid both inside and outside 292 generators, with different meanings. Outside a generator, it’s 293 subtraction. Inside, it yields the value **-1**. 294 295 That reminds me of the Groucho Marx line: Outside of a dog, a 296 book is a man’s best friend. Inside of a dog it’s too dark to 297 read. 298 299 * `await` is like that, but in async functions. Also it’s not a 300 valid identifier in modules. 301 302 Conditional keywords are entangled with slashes: `yield /a/g` is two 303 tokens in a generator but five tokens elsewhere. 304 305* In strict mode code, `implements`, `interface`, `package`, 306 `private`, `protected`, and `public` are reserved (via Early Errors 307 rules). 308 309 This is reflected in the message and location information for 310 certain syntax errors: 311 312 ``` 313 SyntaxError: implements is a reserved identifier: 314 class implements {} 315 ......^ 316 317 SyntaxError: implements is a reserved identifier: 318 function implements() { "use strict"; } 319 ....................................^ 320 ``` 321 322* `let` is not a *Keyword* or *ReservedWord*. Usually it can be an 323 identifier. It is special at the beginning of a statement or after 324 `for (` or `for await (`. 325 326 ```js 327 var let = [new Date]; // ok: let as identifier 328 let v = let; // ok: let as keyword, then identifier 329 let let; // SyntaxError: banned by special early error rule 330 let.length; // ok: `let .` -> ExpressionStatement 331 let[0].getYear(); // SyntaxError: `let [` -> LexicalDeclaration 332 ``` 333 334 In strict mode code, `let` is reserved. 335 336* `static` is similar. It’s a valid identifier, except in strict 337 mode. It’s only special at the beginning of a *ClassElement*. 338 339 In strict mode code, `static` is reserved. 340 341* `async` is similar, but trickier. It’s an identifier. It is special 342 only if it’s marking an `async` function, method, or arrow function 343 (the tough case, since you won’t know it’s an arrow function until 344 you see the `=>`, possibly much later). 345 346 ```js 347 function async() {} // normal function named "async" 348 349 async(); // ok, `async` is an Identifier; function call 350 async() => {}; // ok, `async` is not an Identifier; async arrow function 351 ``` 352 353* `of` is special only in one specific place in `for-of` loop syntax. 354 355 ```js 356 var of = [1, 2, 3]; 357 for (of of of) console.log(of); // logs 1, 2, 3 358 ``` 359 360 Amazingly, both of the following are valid JS code: 361 362 ```js 363 for (async of => {};;) {} 364 for (async of []) {} 365 ``` 366 367 In the first line, `async` is a keyword and `of` is an identifier; 368 in the second line it's the other way round. 369 370 Even a simplified JS grammar can't be LR(1) as long as it includes 371 the features used here! 372 373* `get` and `set` are special only in a class or an object literal, 374 and then only if followed by a PropertyName: 375 376 ```js 377 var obj1 = {get: f}; // `get` is an identifier 378 var obj2 = {get x() {}}; // `get` means getter 379 380 class C1 { get = 3; } // `get` is an identifier 381 class C2 { get x() {} } // `get` means getter 382 ``` 383 384* `target` is special only in `new.target`. 385 386* `arguments` and `eval` can't be binding names, and can't be assigned 387 to, in strict mode code. 388 389To complicate matters, there are a few grammatical contexts where both 390*IdentifierName* and *Identifier* match. For example, after `var {` 391there are two possibilities: 392 393```js 394// Longhand properties: BindingProperty -> PropertyName -> IdentifierName 395var { xy: v } = obj; // ok 396var { if: v } = obj; // ok, `if` is an IdentifierName 397 398// Shorthand properties: BindingProperty -> SingleNameBinding -> BindingIdentifier -> Identifier 399var { xy } = obj; // ok 400var { if } = obj; // SyntaxError: `if` is not an Identifier 401``` 402 403 404### Escape sequences in keywords 405 406*(entangled with: conditional keywords, ASI)* 407 408You can use escape sequences to write variable and property names, but 409not keywords (including contextual keywords in contexts where they act 410as keywords). 411 412So `if (foo) {}` and `{ i\u0066: 0 }` are legal but `i\u0066 (foo)` is not. 413 414And you don't necessarily know if you're lexing a contextual keyword 415until the next token: `({ g\u0065t: 0 })` is legal, but 416`({ g\u0065t x(){} })` is not. 417 418And for `let` it's even worse: `l\u0065t` by itself is a legal way to 419reference a variable named `let`, which means that 420 421```js 422let 423x 424``` 425declares a variable named `x`, while, thanks to ASI, 426 427```js 428l\u0065t 429x 430``` 431is a reference to a variable named `let` followed by a reference to a 432variable named `x`. 433 434 435### Early errors (**) 436 437*(entangled with: lazy parsing, conditional keywords, ASI)* 438 439Some early errors are basically syntactic. Others are not. 440 441This is entangled with lazy compilation: "early errors" often involve a 442retrospective look at an arbitrarily large glob of code we just parsed, 443but in Beast Mode we’re not building an AST. In fact we would like to be 444doing as little bookkeeping as possible. 445 446Even setting that aside, every early error is a special case, and it’s 447just a ton of rules that all have to be implemented by hand. 448 449Here are some examples of Early Error rules—setting aside restrictions 450that are covered adequately elsewhere: 451 452* Rules about names: 453 454 * Rules that affect the set of keywords (character sequences that 455 match *IdentifierName* but are not allowed as binding names) based 456 on whether or not we’re in strict mode code, or in a 457 *Module*. Affected identifiers include `arguments`, `eval`, `yield`, 458 `await`, `let`, `implements`, `interface`, `package`, `private`, 459 `protected`, `public`, `static`. 460 461 * One of these is a strangely worded rule which prohibits using 462 `yield` as a *BindingIdentifier*. At first blush, this seems 463 like it could be enforced in the grammar, but that approach 464 would make this a valid program, due to ASI: 465 466 ```js 467 let 468 yield 0; 469 ``` 470 471 Enforcing the same rule using an Early Error prohibits ASI here. 472 It works by exploiting the detailed inner workings of ASI case 473 1, and arranging for `0` to be "the offending token" rather than 474 `yield`. 475 476 * Lexical variable names have to be unique within a scope: 477 478 * Lexical variables (`let` and `const`) can’t be declared more 479 than once in a block, or both lexically declared and 480 declared with `var`. 481 482 * Lexically declared variables in a function body can’t have the same 483 name as argument bindings. 484 485 * A lexical variable can’t be named `let`. 486 487 * Common-sense rules dealing with unicode escape sequences in 488 identifiers. 489 490* Common-sense rules about regular expression literals. (They have to 491 actually be valid regular expressions, and unrecognized flags are 492 errors.) 493 494* The number of string parts that a template string can have is 495 limited to 2<sup>32</sup> − 1. 496 497* Invalid Unicode escape sequences, like `\7` or `\09` or `\u{3bjq`, are 498 banned in non-tagged templates (in tagged templates, they are allowed). 499 500* The *SuperCall* syntax is allowed only in derived class 501 constructors. 502 503* `const x;` without an initializer is a Syntax Error. 504 505* A direct substatement of an `if` statement, loop statement, or 506 `with` statement can’t be a labelled `function`. 507 508* Early errors are used to hook up cover grammars. 509 510 * Early errors are also used in one case to avoid having to 511 specify a very large refinement grammar when *ObjectLiteral* 512 almost covers *ObjectAssignmentPattern*: 513 [sorry, too complicated to explain](https://tc39.es/ecma262/#sec-object-initializer-static-semantics-early-errors). 514 515* Early errors are sometimes used to prevent parsers from needing to 516 backtrack too much. 517 518 * When parsing `async ( x = await/a/g )`, you don't know until the 519 next token if this is an async arrow or a call to a function named 520 `async`. This means you can't even tokenize properly, because in 521 the former case the thing following `x =` is two divisions and in 522 the latter case it's an *AwaitExpression* of a regular expression. 523 So an Early Error forbids having `await` in parameters at all, 524 allowing parsers to immediately throw an error if they find 525 themselves in this case. 526 527Many strict mode rules are enforced using Early Errors, but others 528affect runtime semantics. 529 530<!-- 531* I think the rules about assignment targets are related to cover 532 grammars. Not sure. 533 534 * Expressions used in assignment context (i.e. as the operands of `++` 535 and `--`, the left operand of assignment, the loop variable of a 536 `for-in/for-of` loop, or sub-targets in destructuring) must be valid 537 assignment targets. 538 539 * In destructuring assignment, `[...EXPR] = x` is an error if `EXPR` 540 is an array or object destructuring target. 541 542--> 543 544 545### Boolean parameters (**) 546 547Some nonterminals are parameterized. (Search for “parameterized 548production” in [this spec 549section](https://tc39.es/ecma262/#sec-grammar-notation).) 550 551Implemented naively (e.g. by macro expansion) in a parser generator, 552each parameter could nearly double the size of the parser. Instead, the 553parameters must be tracked at run time somehow. 554 555 556### Lookahead restrictions (**) 557 558*(entangled with: restricted productions)* 559 560TODO (I implemented this by hacking the entire LR algorithm. Most every 561part of it is touched, although in ways that seem almost obvious once 562you understand LR inside and out.) 563 564(Note: It may seem like all of the lookahead restrictions in the spec 565are really just a way of saying “this production takes precedence over 566that one”—for example, that the lookahead restriction on 567*ExpressionStatement* just means that other productions for statements 568and declarations take precedence over it. But that isn't accurate; you 569can't have an *ExpressionStatement* that starts with `{`, even if it 570doesn't parse as a *Block* or any other kind of statement.) 571 572 573### Automatic Semicolon Insertion (**) 574 575*(entangled with: restricted productions, slashes)* 576 577Most semicolons at the end of JS statements and declarations “may be 578omitted from the source text in certain situations”. This is called 579[Automatic Semicolon 580Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion), 581or ASI for short. 582 583The specification for this feature is both very-high-level and weirdly 584procedural (“When, as the source text is parsed from left to right, a 585token is encountered...”, as if the specification is telling a story 586about a browser. As far as I know, this is the only place in the spec 587where anything is assumed or implied about the internal implementation 588details of parsing.) But it would be hard to specify ASI any other way. 589 590Wrinkles: 591 5921. Whitespace is significant (including whitespace inside comments). 593 Most semicolons in the grammar are optional only at the end of a 594 line (or before `}`, or at the end of the program). 595 5962. The ending semicolon of a `do`-`while` statement is extra optional. 597 You can always omit it. 598 5993. A few semicolons are never optional, like the semicolons in `for (;;)`. 600 601 This means there’s a semicolon in the grammar that is optionally 602 optional! This one: 603 604 > *LexicalDeclaration* : *LetOrConst* *BindingList* `;` 605 606 It’s usually optional, but not if this is the *LexicalDeclaration* 607 in `for (let i = 0; i < 9; i++)`! 608 6094. Semicolons are not inserted only as a last resort to avoid 610 SyntaxErrors. That turned out to be too error-prone, so there are 611 also *restricted productions* (see below), where semicolons are more 612 aggressively inferred. 613 6145. In implementations, ASI interacts with the ambiguity of *slashes* 615 (see below). 616 617A recursive descent parser implements ASI by calling a special method 618every time it needs to parse a semicolon that might be optional. The 619special method has to peek at the next token and consume it only if it’s 620a semicolon. This would not be so bad if it weren’t for slashes. 621 622In a parser generator, ASI can be implemented using an error recovery 623mechanism. 624 625I think the [error recovery mechanism in 626yacc/Bison](https://www.gnu.org/software/bison/manual/bison.html#Error-Recovery) 627is too imprecise—when an error happens, it discards states from the 628stack searching for a matching error-handling rule. The manual says 629“Error recovery strategies are necessarily guesses.” 630 631But here’s a slightly more careful error recovery mechanism that could 632do the job: 633 6341. For each production in the ES spec grammar where ASI could happen, e.g. 635 636 ``` 637 ImportDeclaration ::= `import` ModuleSpecifier `;` 638 { import_declaration($2); } 639 ``` 640 641 add an ASI production, like this: 642 643 ``` 644 ImportDeclaration ::= `import` ModuleSpecifier [ERROR] 645 { check_asi(); import_declaration($2); } 646 ``` 647 648 What does this mean? This production can be matched, like any other 649 production, but it's a fallback. All other productions take 650 precedence. 651 6522. While generating the parser, treat `[ERROR]` as a terminal 653 symbol. It can be included in start sets and follow sets, lookahead, 654 and so forth. 655 6563. At run time, when an error happens, synthesize an `[ERROR]` token. 657 Let that bounce through the state machine. It will cause zero or 658 more reductions. Then, it might actually match a production that 659 contains `[ERROR]`, like the ASI production above. 660 661 Otherwise, we’ll get another error—the entry in the parser table for 662 an `[ERROR]` token at this state will be an `error` entry. Then we 663 really have a syntax error. 664 665This solves most of the ASI issues: 666 667* [x] Whitespace sensitivity: That's what `check_asi()` is for. It 668 should signal an error if we're not at the end of a line. 669 670* [x] Special treatment of `do`-`while` loops: Make an error production, 671 but don't `check_asi()`. 672 673* [x] Rule banning ASI in *EmptyStatement* or `for(;;)`: 674 Easy, don't create error productions for those. 675 676 * [x] Banning ASI in `for (let x=1 \n x<9; x++)`: Manually adjust 677 the grammar, copying *LexicalDeclaration* so that there's a 678 *LexicalDeclarationNoASI* production used only by `for` 679 statements. Not a big deal, as it turns out. 680 681* [x] Slashes: Perhaps have `check_asi` reset the lexer to rescan the 682 next token, if it starts with `/`. 683 684* [ ] Restricted productions: Not solved. Read on. 685 686 687### Restricted productions (**) 688 689*(entangled with: ASI, slashes)* 690 691Line breaks aren’t allowed in certain places. For example, the following 692is not a valid program: 693 694 throw // SyntaxError 695 new Error(); 696 697For another example, this function contains two statements, not one: 698 699 function f(g) { 700 return // ASI 701 g(); 702 } 703 704The indentation is misleading; actually ASI inserts a semicolon at the 705end of the first line: `return; g();`. (This function always returns 706undefined. The second statement is never reached.) 707 708These restrictions apply even to multiline comments, so the function 709 710```js 711function f(g) { 712 return /* 713 */ g(); 714} 715``` 716contains two statements, just as the previous example did. 717 718I’m not sure why these rules exist, but it’s probably because (back in 719the Netscape days) users complained about the bizarre behavior of 720automatic semicolon insertion, and so some special do-what-I-mean hacks 721were put in. 722 723This is specified with a weird special thing in the grammar: 724 725> *ReturnStatement* : `return` [no *LineTerminator* here] *Expression* `;` 726 727This is called a *restricted production*, and it’s unfortunately 728necessary to go through them one by one, because there are several 729kinds. Note that the particular hack required to parse them in a 730recursive descent parser is a little bit different each time. 731 732* After `continue`, `break`, or `return`, a line break triggers ASI. 733 734 The relevant productions are all statements, and in each case 735 there’s an alternative production that ends immediately with a 736 semicolon: `continue ;` `break ;` and `return ;`. 737 738 Note that the alternative production is *not* restricted: e.g. a 739 *LineTerminator* can appear between `return` and `;`: 740 741 ```js 742 if (x) 743 return // ok 744 ; 745 else 746 f(); 747 ``` 748 749* After `throw`, a line break is a SyntaxError. 750 751* After `yield`, a line break terminates the *YieldExpression*. 752 753 Here the alternative production is simply `yield`, not `yield ;`. 754 755* In a post-increment or post-decrement expression, there can’t be a 756 line break before `++` or `--`. 757 758 The purpose of this rule is subtle. It triggers ASI and thus prevents 759 syntax errors: 760 761 ```js 762 var x = y // ok: semicolon inserted here 763 ++z; 764 ``` 765 766 Without the restricted production, `var x = y ++` would parse 767 successfully, and the “offending token” would be `z`. It would be 768 too late for ASI. 769 770 However, the restriction can of course also *cause* a SyntaxError: 771 772 ```js 773 var x = (y 774 ++); // SyntaxError 775 ``` 776 777As we said, recursive descent parsers can implement these rules with hax. 778 779In a generated parser, there are a few possible ways to implement 780them. Here are three. If you are not interested in ridiculous 781approaches, you can skip the first two. 782 783* Treat every token as actually a different token when it appears 784 after a line break: `TokenType::LeftParen` and 785 `TokenType::LeftParenAfterLineBreak`. Of course the parser 786 generator can treat these exactly the same in normal cases, and 787 automatically generate identical table entries (or whatever) except 788 in states where there’s a relevant restricted production. 789 790* Add a special LineTerminator token. Normally, the lexer skips 791 newlines and never emits this token. However, if the current state 792 has a relevant restricted production, the lexer knows this and emits 793 a LineTerminator for the first line break it sees; and the parser 794 uses that token to trigger an error or transition to another state, 795 as appropriate. 796 797* When in a state that has a relevant restricted production, change 798 states if there’s a line break before the next token. That is, 799 split each such state into two: the one we stay in when there’s not 800 a line break, and the one we jump to if there is a line break. 801 802In all cases it’ll be hard to have confidence that the resulting parser 803generator is really sound. (That is, it might not properly reject all 804ambiguous grammars.) I don’t know exactly what property of the few 805special uses in the ES grammar makes them seem benign. 806 807 808### Slashes (**) 809 810*(entangled with: ASI, restricted productions)* 811 812When you see `/` in a JS program, you don’t know if that’s a 813division operator or the start of a regular expression unless you’ve 814been paying attention up to that point. 815 816[The spec:](https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar) 817 818> There are several situations where the identification of lexical input 819> elements is sensitive to the syntactic grammar context that is 820> consuming the input elements. This requires multiple goal symbols for 821> the lexical grammar. 822 823You might think the lexer could treat `/` as an operator only if the 824previous token is one that can be the last token of an expression (a set 825that includes literals, identifiers, `this`, `)`, `]`, and `}`). To see 826that this does not work, consider: 827 828```js 829{} /x/ // `/` after `}` is regexp 830({} / 2) // `/` after `}` is division 831 832for (g of /(a)(b)/) {} // `/` after `of` is regexp 833var of = 6; of / 2 // `/` after `of` is division 834 835throw /x/; // `/` after `throw` is regexp 836Math.throw / 2; // `/` after `throw` is division 837 838++/x/.lastIndex; // `/` after `++` is regexp 839n++ / 2; // `/` after `++` is division 840``` 841 842So how can the spec be implemented? 843 844In a recursive descent parser, you have to tell the lexer which goal 845symbol to use every time you ask for a token. And you have to make sure, 846if you look ahead at a token, but *don’t* consume it, and fall back on 847another path that can accept a *RegularExpressionLiteral* or 848*DivPunctuator*, that you did not initially lex it incorrectly. We have 849assertions for this and it is a bit of a nightmare when we have to touch 850it (which is thankfully rare). Part of the problem is that the token 851you’re peeking ahead at might not be part of the same production at all. 852Thanks to ASI, it might be the start of the next statement, which will 853be parsed in a faraway part of the Parser. 854 855A table-driven parser has it easy here! The lexer can consult the state 856table and see which kind of token can be accepted in the current 857state. This is closer to what the spec actually says. 858 859Two minor things to watch out for: 860 861* The nonterminal *ClassTail* is used both at the end of 862 *ClassExpression*, which may be followed by `/`; and at the end of 863 *ClassDeclaration*, which may be followed by a 864 *RegularExpressionLiteral* at the start of the next 865 statement. Canonical LR creates separate states for these two uses 866 of *ClassTail*, but the LALR algorithm will unify them, creating 867 some states that have both `/` and *RegularExpressionLiteral* in the 868 follow set. In these states, determining which terminal is actually 869 allowed requires looking not only at the current state, but at the 870 current stack of states (to see one level of grammatical context). 871 872* Since this decision depends on the parser state, and automatic 873 semicolon insertion adjusts the parser state, a parser may need to 874 re-scan a token after ASI. 875 876In other kinds of generated parsers, at least the lexical goal symbol 877can be determined automatically. 878 879 880### Lazy compilation and scoping (**) 881 882*(entangled with: arrow functions)* 883 884JS engines *lazily compile* function bodies. During parsing, when the 885engine sees a `function`, it switches to a high-speed parsing mode 886(which I will call “Beast Mode”) that just skims the function and checks 887for syntax errors. Beast Mode does not compile the code. Beast Mode 888doesn’t even create AST nodes. All that will be done later, on demand, 889the first time the function is called. 890 891The point is to get through parsing *fast*, so that the script can start 892running. In browsers, `<script>` compilation usually must finish before 893we can show the user any meaningful content. Any part of it that can be 894deferred must be deferred. (Bonus: many functions are never called, so 895the work is not just deferred but avoided altogether.) 896 897Later, when a function is called, we can still defer compilation of 898nested functions inside it. 899 900So what? Seems easy enough, right? Well... 901 902Local variables in JS are optimized. To generate reasonable code for the 903script or function we are compiling, we need to know which of its local 904variables are used by nested functions, which we are *not* currently 905compiling. That is, we must know which variables *occur free in* each 906nested function. So scoping, which otherwise could be done as a separate 907phase of compilation, must be done during parsing in Beast Mode. 908 909Getting the scoping of arrow functions right while parsing is tricky, 910because it’s often not possible to know when you’re entering a scope 911until later. Consider parsing `(a`. This could be the beginning of an 912arrow function, or not; we might not know until after we reach the 913matching `)`, which could be a long way away. 914 915Annex B.3.3 adds extremely complex rules for scoping for function 916declarations which makes this especially difficult. In 917 918```js 919function f() { 920 let x = 1; 921 return function g() { 922 { 923 function x(){} 924 } 925 { 926 let x; 927 } 928 return x; 929 }; 930} 931``` 932 933the function `g` does not use the initial `let x`, but in 934 935```js 936function f() { 937 let x = 1; 938 return function g() { 939 { 940 { 941 function x(){} 942 } 943 let x; 944 } 945 return x; 946 }; 947} 948``` 949it does. 950 951[Here](https://dev.to/rkirsling/tales-from-ecma-s-crypt-annex-b-3-3-56go) 952is a good writeup of what's going on in these examples. 953 954 955### Arrow functions, assignment, destructuring, and cover grammars (\*\*\*) 956 957*(entangled with: lazy compilation)* 958 959Suppose the parser is scanning text from start to end, when it sees a 960statement that starts with something like this: 961 962```js 963let f = (a 964``` 965 966The parser doesn’t know yet if `(a ...` is *ArrowParameters* or just a 967*ParenthesizedExpression*. Either is possible. We’ll know when either 968(1) we see something that rules out the other case: 969 970```js 971let f = (a + b // can't be ArrowParameters 972 973let f = (a, b, ...args // can't be ParenthesizedExpression 974``` 975 976or (2) we reach the matching `)` and see if the next token is `=>`. 977 978Probably (1) is a pain to implement. 979 980To keep the language nominally LR(1), the standard specifies a “cover 981grammar”, a nonterminal named 982*CoverParenthesizedExpressionAndArrowParameterList* that is a superset 983of both *ArrowParameters* and *ParenthesizedExpression*. The spec 984grammar uses the *Cover* nonterminal in both places, so technically `let 985f = (a + b) => 7;` and `let f = (a, b, ...args);` are both syntactically 986valid *Script*s, according to the formal grammar. But there are Early 987Error rules (that is, plain English text, not part of the formal 988grammar) that say arrow functions’ parameters must match 989*ArrowParameters*, and parenthesized expressions must match 990*ParenthesizedExpression*. 991 992So after the initial parse, the implementation must somehow check that 993the *CoverParenthesizedExpressionAndArrowParameterList* really is valid 994in context. This complicates lazy compilation because in Beast Mode we 995are not even building an AST. It’s not easy to go back and check. 996 997Something similar happens in a few other cases: the spec is written as 998though syntactic rules are applied after the fact: 999 1000* *CoverCallExpressionAndAsyncArrowHead* covers the syntax `async(x)` 1001 which looks like a function call and also looks like the beginning 1002 of an async arrow function. 1003 1004* *ObjectLiteral* is a cover grammar; it covers both actual object 1005 literals and the syntax `propertyName = expr`, which is not valid in 1006 object literals but allowed in destructuring assignment: 1007 1008 ```js 1009 var obj = {x = 1}; // SyntaxError, by an early error rule 1010 1011 ({x = 1} = {}); // ok, assigns 1 to x 1012 ``` 1013 1014* *LeftHandSideExpression* is used to the left of `=` in assignment, 1015 and as the operand of postfix `++` and `--`. But this is way too lax; 1016 most expressions shouldn’t be assigned to: 1017 1018 ```js 1019 1 = 0; // matches the formal grammar, SyntaxError by an early error rule 1020 1021 null++; // same 1022 1023 ["a", "b", "c"]++; // same 1024 ``` 1025 1026 1027## Conclusion 1028 1029What have we learned today? 1030 1031* Do not write a JS parser. 1032 1033* JavaScript has some syntactic horrors in it. But hey, you don't 1034 make the world's most widely used programming language by avoiding 1035 all mistakes. You do it by shipping a serviceable tool, in the right 1036 circumstances, for the right users. 1037