1## JS syntactic quirks
2
3> *To make a labyrinth, it takes*
4> *Some good intentions, some mistakes.*
5> —A. E. Stallings, “Daedal”
6
7JavaScript is rather hard to parse. Here is an in-depth accounting of
8its syntactic quirks, with an eye toward actually implementing a parser
9from scratch.
10
11With apologies to the generous people who work on the standard. Thanks
12for doing that—better you than me.
13
14Thanks to [@bakkot](https://github.com/bakkot) and
15[@mathiasbynens](https://github.com/mathiasbynens) for pointing out
16several additional quirks.
17
18Problems are rated in terms of difficulty, from `(*)` = easy to `(***)`
19= hard. We’ll start with the easiest problems.
20
21
22### Dangling else (*)
23
24If you know what this is, you may be excused.
25
26Statements like `if (EXPR) STMT if (EXPR) STMT else STMT`
27are straight-up ambiguous in the JS formal grammar.
28The ambiguity is resolved with
29[a line of specification text](https://tc39.es/ecma262/#sec-if-statement):
30
31> Each `else` for which the choice of associated `if` is ambiguous shall
32> be associated with the nearest possible `if` that would otherwise have
33> no corresponding `else`.
34
35I love this sentence. Something about it cracks me up, I dunno...
36
37In a recursive descent parser, just doing the dumbest possible thing
38correctly implements this rule.
39
40A parser generator has to decide what to do. In Yacc, you can use
41operator precedence for this.
42
43Yacc aside: This should seem a little outrageous at first, as `else` is
44hardly an operator. It helps if you understand what Yacc is doing. In LR
45parsers, this kind of ambiguity in the grammar manifests as a
46shift-reduce conflict. In this case, when we’ve already parsed `if (
47Expression ) Statement if ( Expression ) Statement`
48and are looking at `else`, it’s unclear to Yacc
49whether to reduce the if-statement or shift `else`. Yacc does not offer
50a feature that lets us just say "always shift `else` here"; but there
51*is* a Yacc feature that lets us resolve shift-reduce conflicts in a
52rather odd, indirect way: operator precedence.  We can resolve this
53conflict by making `else` higher-precedence than the preceding symbol
54`)`.
55
56Alternatively, I believe it’s equivalent to add "[lookahead ≠ `else`]"
57at the end of the IfStatement production that doesn’t have an `else`.
58
59
60### Other ambiguities and informal parts of the spec (*)
61
62Not all of the spec is as formal as it seems at first. Most of the stuff
63in this section is easy to deal with, but #4 is special.
64
651.  The lexical grammar is ambiguous: when looking at the characters `<<=`,
66    there is the question of whether to parse that as one token `<<=`, two
67    tokens (`< <=` or `<< =`), or three (`< < =`).
68
69    Of course every programming language has this, and the fix is one
70    sentence of prose in the spec:
71
72    > The source text is scanned from left to right, repeatedly taking the
73    > longest possible sequence of code points as the next input element.
74
75    This is easy enough for hand-coded lexers, and for systems that are
76    designed to use separate lexical and syntactic grammars.  (Other
77    parser generators may need help to avoid parsing `functionf(){}` as
78    a function.)
79
802.  The above line of prose does not apply *within* input elements, in
81    components of the lexical grammar. In those cases, the same basic
82    idea ("maximum munch") is specified using lookahead restrictions at
83    the end of productions:
84
85    > *LineTerminatorSequence* ::
86    > &nbsp; &nbsp; &lt;LF&gt;
87    > &nbsp; &nbsp; &lt;CR&gt;[lookahead &ne; &lt;LF&gt;]
88    > &nbsp; &nbsp; &lt;LS&gt;
89    > &nbsp; &nbsp; &lt;PS&gt;
90    > &nbsp; &nbsp; &lt;CR&gt;&lt;LF&gt;
91
92    The lookahead restriction prevents a CR LF sequence from being
93    parsed as two adjacent *LineTerminatorSequence*s.
94
95    This technique is used in several places, particularly in
96    [*NotEscapeSequences*](https://tc39.es/ecma262/#prod-NotEscapeSequence).
97
983.  Annex B.1.4 extends the syntax for regular expressions, making the
99    grammar ambiguous. Again, a line of prose explains how to cope:
100
101    > These changes introduce ambiguities that are broken by the
102    > ordering of grammar productions and by contextual
103    > information. When parsing using the following grammar, each
104    > alternative is considered only if previous production alternatives
105    > do not match.
106
1074.  Annex B.1.2 extends the syntax of string literals to allow legacy
108    octal escape sequences, like `\033`. It says:
109
110    > The syntax and semantics of 11.8.4 is extended as follows except
111    > that this extension is not allowed for strict mode code:
112
113    ...followed by a new definition of *EscapeSequence*.
114
115    So there are two sets of productions for *EscapeSequence*, and an
116    implementation is required to implement both and dynamically switch
117    between them.
118
119    This means that `function f() { "\033"; "use strict"; }` is a
120    SyntaxError, even though the octal escape is scanned before we know
121    we're in strict mode.
122
123For another ambiguity, see "Slashes" below.
124
125
126### Unicode quirks
127
128JavaScript source is Unicode and usually follows Unicode rules for thing
129like identifiers and whitespace, but it has a few special cases: `$`,
130`_`, `U+200C ZERO WIDTH NON-JOINER`, and `U+200D ZERO WIDTH JOINER` are
131legal in identifiers (the latter two only after the first character), and
132`U+FEFF ZERO WIDTH NO-BREAK SPACE` (also known as the byte-order mark) is
133treated as whitespace.
134
135It also allows any code point, including surrogate halves, even though the
136Unicode standard says that unpaired surrogate halves should be treated as
137encoding errors.
138
139
140### Legacy octal literals and escape sequences (*)
141
142This is more funny than difficult.
143
144In a browser, in non-strict code, every sequence of decimal digits (not
145followed by an identifier character) is a *NumericLiteral* token.
146
147If it starts with `0`, with more digits after, then it's a legacy Annex
148B.1.1 literal. If the token contains an `8` or a `9`, it's a decimal
149number. Otherwise, hilariously, it's octal.
150
151```
152js> [067, 068, 069, 070]
153[55, 68, 69, 56]
154```
155
156There are also legacy octal escape sequences in strings, and these have
157their own quirks. `'\07' === '\u{7}'`, but `'\08' !== '\u{8}'` since 8
158is not an octal digit. Instead `'\08' === '\0' + '8'`, because `\0`
159followed by `8` or `9` is a legacy octal escape sequence representing
160the null character. (Not to be confused with `\0` in strict code, not
161followed by a digit, which still represents the null character, but
162doesn't count as octal.)
163
164None of this is hard to implement, but figuring out what the spec says
165is hard.
166
167
168### Strict mode (*)
169
170*(entangled with: lazy compilation)*
171
172A script or function can start with this:
173
174```js
175"use strict";
176```
177
178This enables ["strict mode"](https://tc39.es/ecma262/#sec-strict-mode-of-ecmascript).
179Additionally, all classes and modules are strict mode code.
180
181Strict mode has both parse-time and run-time effects. Parse-time effects
182include:
183
184*   Strict mode affects the lexical grammar: octal integer literals are
185    SyntaxErrors, octal character escapes are SyntaxErrors, and a
186    handful of words like `private` and `interface` are reserved (and
187    thus usually SyntaxErrors) in strict mode.
188
189    Like the situation with slashes, this means it is not possible to
190    implement a complete lexer for JS without also parsing—at least
191    enough to detect class boundaries, "use strict" directives in
192    functions, and function boundaries.
193
194*   It’s a SyntaxError to have bindings named `eval` or `arguments` in
195    strict mode code, or to assign to `eval` or `arguments`.
196
197*   It’s a SyntaxError to have two argument bindings with the same name
198    in a strict function.
199
200    Interestingly, you don’t always know if you’re in strict mode or not
201    when parsing arguments.
202
203    ```js
204    function foo(a, a) {
205        "use strict";
206    }
207    ```
208
209    When the implementation reaches the Use Strict Directive, it must
210    either know that `foo` has two arguments both named `a`, or switch
211    to strict mode, go back, and reparse the function from the
212    beginning.
213
214    Fortunately an Early Error rule prohibits mixing `"use strict"` with
215    more complex parameter lists, like `function foo(x = eval('')) {`.
216
217*   The expression syntax “`delete` *Identifier*” and the abominable
218    *WithStatement* are banned in strict mode.
219
220
221### Conditional keywords (**)
222
223In some programming languages, you could write a lexer that has rules
224like
225
226*   When you see `if`, return `Token::If`.
227
228*   When you see something like `apple` or `arrow` or `target`,
229    return `Token::Identifier`.
230
231Not so in JavaScript. The input `if` matches both the terminal `if` and
232the nonterminal *IdentifierName*, both of which appear in the high-level
233grammar. The same goes for `target`.
234
235This poses a deceptively difficult problem for table-driven parsers.
236Such parsers run on a stream of token-ids, but the question of which
237token-id to use for a word like `if` or `target` is ambiguous. The
238current parser state can't fully resolve the ambiguity: there are cases
239like `class C { get` where the token `get` might match either as a
240keyword (the start of a getter) or as an *IdentifierName* (a method or
241property named `get`) in different grammatical productions.
242
243All keywords are conditional, but some are more conditional than others.
244The rules are inconsistent to a tragicomic extent. Keywords like `if`
245that date back to JavaScript 1.0 are always keywords except when used as
246property names or method names. They can't be variable names. Two
247conditional keywords (`await` and `yield`) are in the *Keyword* list;
248the rest are not. New syntax that happened to be introduced around the
249same time as strict mode was awarded keyword status in strict mode. The
250rules are scattered through the spec. All this interacts with `\u0065`
251Unicode escape sequences somehow. It’s just unbelievably confusing.
252
253(After writing this section, I
254[proposed revisions to the specification](https://github.com/tc39/ecma262/pull/1694)
255to make it a little less confusing.)
256
257*   Thirty-six words are always reserved:
258
259    > `break` `case` `catch` `class` `const` `continue` `debugger`
260    > `default` `delete` `do` `else` `enum` `export` `extends` `false`
261    > `finally` `for` `function` `if` `import` `in` `instanceof` `new`
262    > `null` `return` `super` `switch` `this` `throw` `true` `try`
263    > `typeof` `var` `void` `while` `with`
264
265    These tokens can't be used as names of variables or arguments.
266    They're always considered special *except* when used as property
267    names, method names, or import/export names in modules.
268
269    ```js
270    // property names
271    let obj = {if: 3, function: 4};
272    assert(obj.if == 3);
273
274    // method names
275    class C {
276      if() {}
277      function() {}
278    }
279
280    // imports and exports
281    import {if as my_if} from "modulename";
282    export {my_if as if};
283    ```
284
285*   Two more words, `yield` and `await`, are in the *Keyword* list but
286    do not always act like keywords in practice.
287
288    *   `yield` is a *Keyword*; but it can be used as an identifier,
289        except in generators and strict mode code.
290
291        This means that `yield - 1` is valid both inside and outside
292        generators, with different meanings. Outside a generator, it’s
293        subtraction. Inside, it yields the value **-1**.
294
295        That reminds me of the Groucho Marx line: Outside of a dog, a
296        book is a man’s best friend. Inside of a dog it’s too dark to
297        read.
298
299    *   `await` is like that, but in async functions. Also it’s not a
300        valid identifier in modules.
301
302    Conditional keywords are entangled with slashes: `yield /a/g` is two
303    tokens in a generator but five tokens elsewhere.
304
305*   In strict mode code, `implements`, `interface`, `package`,
306    `private`, `protected`, and `public` are reserved (via Early Errors
307    rules).
308
309    This is reflected in the message and location information for
310    certain syntax errors:
311
312    ```
313    SyntaxError: implements is a reserved identifier:
314    class implements {}
315    ......^
316
317    SyntaxError: implements is a reserved identifier:
318    function implements() { "use strict"; }
319    ....................................^
320    ```
321
322*   `let` is not a *Keyword* or *ReservedWord*. Usually it can be an
323    identifier. It is special at the beginning of a statement or after
324    `for (` or `for await (`.
325
326    ```js
327    var let = [new Date];   // ok: let as identifier
328    let v = let;            // ok: let as keyword, then identifier
329    let let;           // SyntaxError: banned by special early error rule
330    let.length;        // ok: `let .` -> ExpressionStatement
331    let[0].getYear();  // SyntaxError: `let [` -> LexicalDeclaration
332    ```
333
334    In strict mode code, `let` is reserved.
335
336*   `static` is similar. It’s a valid identifier, except in strict
337    mode. It’s only special at the beginning of a *ClassElement*.
338
339    In strict mode code, `static` is reserved.
340
341*   `async` is similar, but trickier. It’s an identifier. It is special
342    only if it’s marking an `async` function, method, or arrow function
343    (the tough case, since you won’t know it’s an arrow function until
344    you see the `=>`, possibly much later).
345
346    ```js
347    function async() {}   // normal function named "async"
348
349    async();        // ok, `async` is an Identifier; function call
350    async() => {};  // ok, `async` is not an Identifier; async arrow function
351    ```
352
353*   `of` is special only in one specific place in `for-of` loop syntax.
354
355    ```js
356    var of = [1, 2, 3];
357    for (of of of) console.log(of);  // logs 1, 2, 3
358    ```
359
360    Amazingly, both of the following are valid JS code:
361
362    ```js
363    for (async of => {};;) {}
364    for (async of []) {}
365    ```
366
367    In the first line, `async` is a keyword and `of` is an identifier;
368    in the second line it's the other way round.
369
370    Even a simplified JS grammar can't be LR(1) as long as it includes
371    the features used here!
372
373*   `get` and `set` are special only in a class or an object literal,
374    and then only if followed by a PropertyName:
375
376    ```js
377    var obj1 = {get: f};      // `get` is an identifier
378    var obj2 = {get x() {}};  // `get` means getter
379
380    class C1 { get = 3; }     // `get` is an identifier
381    class C2 { get x() {} }   // `get` means getter
382    ```
383
384*   `target` is special only in `new.target`.
385
386*   `arguments` and `eval` can't be binding names, and can't be assigned
387    to, in strict mode code.
388
389To complicate matters, there are a few grammatical contexts where both
390*IdentifierName* and *Identifier* match. For example, after `var {`
391there are two possibilities:
392
393```js
394// Longhand properties: BindingProperty -> PropertyName -> IdentifierName
395var { xy: v } = obj;    // ok
396var { if: v } = obj;    // ok, `if` is an IdentifierName
397
398// Shorthand properties: BindingProperty -> SingleNameBinding -> BindingIdentifier -> Identifier
399var { xy } = obj;       // ok
400var { if } = obj;       // SyntaxError: `if` is not an Identifier
401```
402
403
404### Escape sequences in keywords
405
406*(entangled with: conditional keywords, ASI)*
407
408You can use escape sequences to write variable and property names, but
409not keywords (including contextual keywords in contexts where they act
410as keywords).
411
412So `if (foo) {}` and `{ i\u0066: 0 }` are legal but `i\u0066 (foo)` is not.
413
414And you don't necessarily know if you're lexing a contextual keyword
415until the next token: `({ g\u0065t: 0 })` is legal, but
416`({ g\u0065t x(){} })` is not.
417
418And for `let` it's even worse: `l\u0065t` by itself is a legal way to
419reference a variable named `let`, which means that
420
421```js
422let
423x
424```
425declares a variable named `x`, while, thanks to ASI,
426
427```js
428l\u0065t
429x
430```
431is a reference to a variable named `let` followed by a reference to a
432variable named `x`.
433
434
435### Early errors (**)
436
437*(entangled with: lazy parsing, conditional keywords, ASI)*
438
439Some early errors are basically syntactic. Others are not.
440
441This is entangled with lazy compilation: "early errors" often involve a
442retrospective look at an arbitrarily large glob of code we just parsed,
443but in Beast Mode we’re not building an AST. In fact we would like to be
444doing as little bookkeeping as possible.
445
446Even setting that aside, every early error is a special case, and it’s
447just a ton of rules that all have to be implemented by hand.
448
449Here are some examples of Early Error rules—setting aside restrictions
450that are covered adequately elsewhere:
451
452*   Rules about names:
453
454    *   Rules that affect the set of keywords (character sequences that
455        match *IdentifierName* but are not allowed as binding names) based
456        on whether or not we’re in strict mode code, or in a
457        *Module*. Affected identifiers include `arguments`, `eval`, `yield`,
458        `await`, `let`, `implements`, `interface`, `package`, `private`,
459        `protected`, `public`, `static`.
460
461        *   One of these is a strangely worded rule which prohibits using
462            `yield` as a *BindingIdentifier*.  At first blush, this seems
463            like it could be enforced in the grammar, but that approach
464            would make this a valid program, due to ASI:
465
466            ```js
467            let
468            yield 0;
469            ```
470
471            Enforcing the same rule using an Early Error prohibits ASI here.
472            It works by exploiting the detailed inner workings of ASI case
473            1, and arranging for `0` to be "the offending token" rather than
474            `yield`.
475
476    *   Lexical variable names have to be unique within a scope:
477
478        *   Lexical variables (`let` and `const`) can’t be declared more
479            than once in a block, or both lexically declared and
480            declared with `var`.
481
482        *   Lexically declared variables in a function body can’t have the same
483            name as argument bindings.
484
485    *   A lexical variable can’t be named `let`.
486
487    *   Common-sense rules dealing with unicode escape sequences in
488        identifiers.
489
490*   Common-sense rules about regular expression literals. (They have to
491    actually be valid regular expressions, and unrecognized flags are
492    errors.)
493
494*   The number of string parts that a template string can have is
495    limited to 2<sup>32</sup> &minus; 1.
496
497*   Invalid Unicode escape sequences, like `\7` or `\09` or `\u{3bjq`, are
498    banned in non-tagged templates (in tagged templates, they are allowed).
499
500*   The *SuperCall* syntax is allowed only in derived class
501    constructors.
502
503*   `const x;` without an initializer is a Syntax Error.
504
505*   A direct substatement of an `if` statement, loop statement, or
506    `with` statement can’t be a labelled `function`.
507
508*   Early errors are used to hook up cover grammars.
509
510    *   Early errors are also used in one case to avoid having to
511        specify a very large refinement grammar when *ObjectLiteral*
512        almost covers *ObjectAssignmentPattern*:
513        [sorry, too complicated to explain](https://tc39.es/ecma262/#sec-object-initializer-static-semantics-early-errors).
514
515*   Early errors are sometimes used to prevent parsers from needing to
516    backtrack too much.
517
518    *   When parsing `async ( x = await/a/g )`, you don't know until the
519        next token if this is an async arrow or a call to a function named
520        `async`. This means you can't even tokenize properly, because in
521        the former case the thing following `x =` is two divisions and in
522        the latter case it's an *AwaitExpression* of a regular expression.
523        So an Early Error forbids having `await` in parameters at all,
524        allowing parsers to immediately throw an error if they find
525        themselves in this case.
526
527Many strict mode rules are enforced using Early Errors, but others
528affect runtime semantics.
529
530<!--
531*   I think the rules about assignment targets are related to cover
532    grammars. Not sure.
533
534    *   Expressions used in assignment context (i.e. as the operands of `++`
535        and `--`, the left operand of assignment, the loop variable of a
536        `for-in/for-of` loop, or sub-targets in destructuring) must be valid
537        assignment targets.
538
539    *   In destructuring assignment, `[...EXPR] = x` is an error if `EXPR`
540        is an array or object destructuring target.
541
542-->
543
544
545### Boolean parameters (**)
546
547Some nonterminals are parameterized. (Search for “parameterized
548production” in [this spec
549section](https://tc39.es/ecma262/#sec-grammar-notation).)
550
551Implemented naively (e.g. by macro expansion) in a parser generator,
552each parameter could nearly double the size of the parser. Instead, the
553parameters must be tracked at run time somehow.
554
555
556### Lookahead restrictions (**)
557
558*(entangled with: restricted productions)*
559
560TODO (I implemented this by hacking the entire LR algorithm. Most every
561part of it is touched, although in ways that seem almost obvious once
562you understand LR inside and out.)
563
564(Note: It may seem like all of the lookahead restrictions in the spec
565are really just a way of saying “this production takes precedence over
566that one”—for example, that the lookahead restriction on
567*ExpressionStatement* just means that other productions for statements
568and declarations take precedence over it. But that isn't accurate; you
569can't have an *ExpressionStatement* that starts with `{`, even if it
570doesn't parse as a *Block* or any other kind of statement.)
571
572
573### Automatic Semicolon Insertion (**)
574
575*(entangled with: restricted productions, slashes)*
576
577Most semicolons at the end of JS statements and declarations “may be
578omitted from the source text in certain situations”.  This is called
579[Automatic Semicolon
580Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion),
581or ASI for short.
582
583The specification for this feature is both very-high-level and weirdly
584procedural (“When, as the source text is parsed from left to right, a
585token is encountered...”, as if the specification is telling a story
586about a browser. As far as I know, this is the only place in the spec
587where anything is assumed or implied about the internal implementation
588details of parsing.) But it would be hard to specify ASI any other way.
589
590Wrinkles:
591
5921.  Whitespace is significant (including whitespace inside comments).
593    Most semicolons in the grammar are optional only at the end of a
594    line (or before `}`, or at the end of the program).
595
5962.  The ending semicolon of a `do`-`while` statement is extra optional.
597    You can always omit it.
598
5993.  A few semicolons are never optional, like the semicolons in `for (;;)`.
600
601    This means there’s a semicolon in the grammar that is optionally
602    optional! This one:
603
604    > *LexicalDeclaration* : *LetOrConst* *BindingList* `;`
605
606    It’s usually optional, but not if this is the *LexicalDeclaration*
607    in `for (let i = 0; i < 9; i++)`!
608
6094.  Semicolons are not inserted only as a last resort to avoid
610    SyntaxErrors. That turned out to be too error-prone, so there are
611    also *restricted productions* (see below), where semicolons are more
612    aggressively inferred.
613
6145.  In implementations, ASI interacts with the ambiguity of *slashes*
615    (see below).
616
617A recursive descent parser implements ASI by calling a special method
618every time it needs to parse a semicolon that might be optional.  The
619special method has to peek at the next token and consume it only if it’s
620a semicolon. This would not be so bad if it weren’t for slashes.
621
622In a parser generator, ASI can be implemented using an error recovery
623mechanism.
624
625I think the [error recovery mechanism in
626yacc/Bison](https://www.gnu.org/software/bison/manual/bison.html#Error-Recovery)
627is too imprecise—when an error happens, it discards states from the
628stack searching for a matching error-handling rule. The manual says
629“Error recovery strategies are necessarily guesses.”
630
631But here’s a slightly more careful error recovery mechanism that could
632do the job:
633
6341.  For each production in the ES spec grammar where ASI could happen, e.g.
635
636    ```
637    ImportDeclaration ::= `import` ModuleSpecifier `;`
638                                                    { import_declaration($2); }
639    ```
640
641    add an ASI production, like this:
642
643    ```
644    ImportDeclaration ::= `import` ModuleSpecifier [ERROR]
645                                       { check_asi(); import_declaration($2); }
646    ```
647
648    What does this mean? This production can be matched, like any other
649    production, but it's a fallback. All other productions take
650    precedence.
651
6522.  While generating the parser, treat `[ERROR]` as a terminal
653    symbol. It can be included in start sets and follow sets, lookahead,
654    and so forth.
655
6563.  At run time, when an error happens, synthesize an `[ERROR]` token.
657    Let that bounce through the state machine. It will cause zero or
658    more reductions. Then, it might actually match a production that
659    contains `[ERROR]`, like the ASI production above.
660
661    Otherwise, we’ll get another error—the entry in the parser table for
662    an `[ERROR]` token at this state will be an `error` entry. Then we
663    really have a syntax error.
664
665This solves most of the ASI issues:
666
667* [x] Whitespace sensitivity: That's what `check_asi()` is for. It
668    should signal an error if we're not at the end of a line.
669
670* [x] Special treatment of `do`-`while` loops: Make an error production,
671    but don't `check_asi()`.
672
673* [x] Rule banning ASI in *EmptyStatement* or `for(;;)`:
674    Easy, don't create error productions for those.
675
676    * [x] Banning ASI in `for (let x=1 \n x<9; x++)`: Manually adjust
677        the grammar, copying *LexicalDeclaration* so that there's a
678        *LexicalDeclarationNoASI* production used only by `for`
679        statements. Not a big deal, as it turns out.
680
681* [x] Slashes: Perhaps have `check_asi` reset the lexer to rescan the
682    next token, if it starts with `/`.
683
684* [ ] Restricted productions: Not solved. Read on.
685
686
687### Restricted productions (**)
688
689*(entangled with: ASI, slashes)*
690
691Line breaks aren’t allowed in certain places. For example, the following
692is not a valid program:
693
694    throw                 // SyntaxError
695      new Error();
696
697For another example, this function contains two statements, not one:
698
699    function f(g) {
700      return              // ASI
701        g();
702    }
703
704The indentation is misleading; actually ASI inserts a semicolon at the
705end of the first line: `return; g();`. (This function always returns
706undefined. The second statement is never reached.)
707
708These restrictions apply even to multiline comments, so the function
709
710```js
711function f(g) {
712  return /*
713    */ g();
714}
715```
716contains two statements, just as the previous example did.
717
718I’m not sure why these rules exist, but it’s probably because (back in
719the Netscape days) users complained about the bizarre behavior of
720automatic semicolon insertion, and so some special do-what-I-mean hacks
721were put in.
722
723This is specified with a weird special thing in the grammar:
724
725> *ReturnStatement* : `return` [no *LineTerminator* here] *Expression* `;`
726
727This is called a *restricted production*, and it’s unfortunately
728necessary to go through them one by one, because there are several
729kinds. Note that the particular hack required to parse them in a
730recursive descent parser is a little bit different each time.
731
732*   After `continue`, `break`, or `return`, a line break triggers ASI.
733
734    The relevant productions are all statements, and in each case
735    there’s an alternative production that ends immediately with a
736    semicolon: `continue ;` `break ;` and `return ;`.
737
738    Note that the alternative production is *not* restricted: e.g. a
739    *LineTerminator* can appear between `return` and `;`:
740
741    ```js
742    if (x)
743      return        // ok
744        ;
745    else
746      f();
747    ```
748
749*   After `throw`, a line break is a SyntaxError.
750
751*   After `yield`, a line break terminates the *YieldExpression*.
752
753    Here the alternative production is simply `yield`, not `yield ;`.
754
755*   In a post-increment or post-decrement expression, there can’t be a
756    line break before `++` or `--`.
757
758    The purpose of this rule is subtle. It triggers ASI and thus prevents
759    syntax errors:
760
761    ```js
762    var x = y       // ok: semicolon inserted here
763    ++z;
764    ```
765
766    Without the restricted production, `var x = y ++` would parse
767    successfully, and the “offending token” would be `z`. It would be
768    too late for ASI.
769
770    However, the restriction can of course also *cause* a SyntaxError:
771
772    ```js
773    var x = (y
774             ++);   // SyntaxError
775    ```
776
777As we said, recursive descent parsers can implement these rules with hax.
778
779In a generated parser, there are a few possible ways to implement
780them. Here are three. If you are not interested in ridiculous
781approaches, you can skip the first two.
782
783*   Treat every token as actually a different token when it appears
784    after a line break: `TokenType::LeftParen` and
785    `TokenType::LeftParenAfterLineBreak`.  Of course the parser
786    generator can treat these exactly the same in normal cases, and
787    automatically generate identical table entries (or whatever) except
788    in states where there’s a relevant restricted production.
789
790*   Add a special LineTerminator token. Normally, the lexer skips
791    newlines and never emits this token. However, if the current state
792    has a relevant restricted production, the lexer knows this and emits
793    a LineTerminator for the first line break it sees; and the parser
794    uses that token to trigger an error or transition to another state,
795    as appropriate.
796
797*   When in a state that has a relevant restricted production, change
798    states if there’s a line break before the next token.  That is,
799    split each such state into two: the one we stay in when there’s not
800    a line break, and the one we jump to if there is a line break.
801
802In all cases it’ll be hard to have confidence that the resulting parser
803generator is really sound. (That is, it might not properly reject all
804ambiguous grammars.) I don’t know exactly what property of the few
805special uses in the ES grammar makes them seem benign.
806
807
808### Slashes (**)
809
810*(entangled with: ASI, restricted productions)*
811
812When you see `/` in a JS program, you don’t know if that’s a
813division operator or the start of a regular expression unless you’ve
814been paying attention up to that point.
815
816[The spec:](https://tc39.es/ecma262/#sec-ecmascript-language-lexical-grammar)
817
818> There are several situations where the identification of lexical input
819> elements is sensitive to the syntactic grammar context that is
820> consuming the input elements. This requires multiple goal symbols for
821> the lexical grammar.
822
823You might think the lexer could treat `/` as an operator only if the
824previous token is one that can be the last token of an expression (a set
825that includes literals, identifiers, `this`, `)`, `]`, and `}`).  To see
826that this does not work, consider:
827
828```js
829{} /x/                  // `/` after `}` is regexp
830({} / 2)                // `/` after `}` is division
831
832for (g of /(a)(b)/) {}  // `/` after `of` is regexp
833var of = 6; of / 2      // `/` after `of` is division
834
835throw /x/;              // `/` after `throw` is regexp
836Math.throw / 2;         // `/` after `throw` is division
837
838++/x/.lastIndex;        // `/` after `++` is regexp
839n++ / 2;                // `/` after `++` is division
840```
841
842So how can the spec be implemented?
843
844In a recursive descent parser, you have to tell the lexer which goal
845symbol to use every time you ask for a token. And you have to make sure,
846if you look ahead at a token, but *don’t* consume it, and fall back on
847another path that can accept a *RegularExpressionLiteral* or
848*DivPunctuator*, that you did not initially lex it incorrectly. We have
849assertions for this and it is a bit of a nightmare when we have to touch
850it (which is thankfully rare). Part of the problem is that the token
851you’re peeking ahead at might not be part of the same production at all.
852Thanks to ASI, it might be the start of the next statement, which will
853be parsed in a faraway part of the Parser.
854
855A table-driven parser has it easy here! The lexer can consult the state
856table and see which kind of token can be accepted in the current
857state. This is closer to what the spec actually says.
858
859Two minor things to watch out for:
860
861*   The nonterminal *ClassTail* is used both at the end of
862    *ClassExpression*, which may be followed by `/`; and at the end of
863    *ClassDeclaration*, which may be followed by a
864    *RegularExpressionLiteral* at the start of the next
865    statement. Canonical LR creates separate states for these two uses
866    of *ClassTail*, but the LALR algorithm will unify them, creating
867    some states that have both `/` and *RegularExpressionLiteral* in the
868    follow set.  In these states, determining which terminal is actually
869    allowed requires looking not only at the current state, but at the
870    current stack of states (to see one level of grammatical context).
871
872*   Since this decision depends on the parser state, and automatic
873    semicolon insertion adjusts the parser state, a parser may need to
874    re-scan a token after ASI.
875
876In other kinds of generated parsers, at least the lexical goal symbol
877can be determined automatically.
878
879
880### Lazy compilation and scoping (**)
881
882*(entangled with: arrow functions)*
883
884JS engines *lazily compile* function bodies. During parsing, when the
885engine sees a `function`, it switches to a high-speed parsing mode
886(which I will call “Beast Mode”) that just skims the function and checks
887for syntax errors. Beast Mode does not compile the code. Beast Mode
888doesn’t even create AST nodes. All that will be done later, on demand,
889the first time the function is called.
890
891The point is to get through parsing *fast*, so that the script can start
892running. In browsers, `<script>` compilation usually must finish before
893we can show the user any meaningful content. Any part of it that can be
894deferred must be deferred. (Bonus: many functions are never called, so
895the work is not just deferred but avoided altogether.)
896
897Later, when a function is called, we can still defer compilation of
898nested functions inside it.
899
900So what? Seems easy enough, right? Well...
901
902Local variables in JS are optimized. To generate reasonable code for the
903script or function we are compiling, we need to know which of its local
904variables are used by nested functions, which we are *not* currently
905compiling. That is, we must know which variables *occur free in* each
906nested function. So scoping, which otherwise could be done as a separate
907phase of compilation, must be done during parsing in Beast Mode.
908
909Getting the scoping of arrow functions right while parsing is tricky,
910because it’s often not possible to know when you’re entering a scope
911until later. Consider parsing `(a`. This could be the beginning of an
912arrow function, or not; we might not know until after we reach the
913matching `)`, which could be a long way away.
914
915Annex B.3.3 adds extremely complex rules for scoping for function
916declarations which makes this especially difficult. In
917
918```js
919function f() {
920    let x = 1;
921    return function g() {
922        {
923            function x(){}
924        }
925        {
926            let x;
927        }
928        return x;
929    };
930}
931```
932
933the function `g` does not use the initial `let x`, but in
934
935```js
936function f() {
937    let x = 1;
938    return function g() {
939        {
940            {
941                function x(){}
942            }
943            let x;
944        }
945        return x;
946    };
947}
948```
949it does.
950
951[Here](https://dev.to/rkirsling/tales-from-ecma-s-crypt-annex-b-3-3-56go)
952is a good writeup of what's going on in these examples.
953
954
955### Arrow functions, assignment, destructuring, and cover grammars (\*\*\*)
956
957*(entangled with: lazy compilation)*
958
959Suppose the parser is scanning text from start to end, when it sees a
960statement that starts with something like this:
961
962```js
963let f = (a
964```
965
966The parser doesn’t know yet if `(a ...` is *ArrowParameters* or just a
967*ParenthesizedExpression*. Either is possible. We’ll know when either
968(1) we see something that rules out the other case:
969
970```js
971let f = (a + b           // can't be ArrowParameters
972
973let f = (a, b, ...args   // can't be ParenthesizedExpression
974```
975
976or (2) we reach the matching `)` and see if the next token is `=>`.
977
978Probably (1) is a pain to implement.
979
980To keep the language nominally LR(1), the standard specifies a “cover
981grammar”, a nonterminal named
982*CoverParenthesizedExpressionAndArrowParameterList* that is a superset
983of both *ArrowParameters* and *ParenthesizedExpression*.  The spec
984grammar uses the *Cover* nonterminal in both places, so technically `let
985f = (a + b) => 7;` and `let f = (a, b, ...args);` are both syntactically
986valid *Script*s, according to the formal grammar. But there are Early
987Error rules (that is, plain English text, not part of the formal
988grammar) that say arrow functions’ parameters must match
989*ArrowParameters*, and parenthesized expressions must match
990*ParenthesizedExpression*.
991
992So after the initial parse, the implementation must somehow check that
993the *CoverParenthesizedExpressionAndArrowParameterList* really is valid
994in context. This complicates lazy compilation because in Beast Mode we
995are not even building an AST. It’s not easy to go back and check.
996
997Something similar happens in a few other cases: the spec is written as
998though syntactic rules are applied after the fact:
999
1000*   *CoverCallExpressionAndAsyncArrowHead* covers the syntax `async(x)`
1001    which looks like a function call and also looks like the beginning
1002    of an async arrow function.
1003
1004*   *ObjectLiteral* is a cover grammar; it covers both actual object
1005    literals and the syntax `propertyName = expr`, which is not valid in
1006    object literals but allowed in destructuring assignment:
1007
1008    ```js
1009    var obj = {x = 1};  // SyntaxError, by an early error rule
1010
1011    ({x = 1} = {});   // ok, assigns 1 to x
1012    ```
1013
1014*   *LeftHandSideExpression* is used to the left of `=` in assignment,
1015    and as the operand of postfix `++` and `--`. But this is way too lax;
1016    most expressions shouldn’t be assigned to:
1017
1018    ```js
1019    1 = 0;  // matches the formal grammar, SyntaxError by an early error rule
1020
1021    null++;   // same
1022
1023    ["a", "b", "c"]++;  // same
1024    ```
1025
1026
1027## Conclusion
1028
1029What have we learned today?
1030
1031*   Do not write a JS parser.
1032
1033*   JavaScript has some syntactic horrors in it. But hey, you don't
1034    make the world's most widely used programming language by avoiding
1035    all mistakes. You do it by shipping a serviceable tool, in the right
1036    circumstances, for the right users.
1037