1# Appendix: Macro Follow-Set Ambiguity Formal Specification 2 3This page documents the formal specification of the follow rules for [Macros 4By Example]. They were originally specified in [RFC 550], from which the bulk 5of this text is copied, and expanded upon in subsequent RFCs. 6 7## Definitions & Conventions 8 9 - `macro`: anything invokable as `foo!(...)` in source code. 10 - `MBE`: macro-by-example, a macro defined by `macro_rules`. 11 - `matcher`: the left-hand-side of a rule in a `macro_rules` invocation, or a 12 subportion thereof. 13 - `macro parser`: the bit of code in the Rust parser that will parse the 14 input using a grammar derived from all of the matchers. 15 - `fragment`: The class of Rust syntax that a given matcher will accept (or 16 "match"). 17 - `repetition` : a fragment that follows a regular repeating pattern 18 - `NT`: non-terminal, the various "meta-variables" or repetition matchers 19 that can appear in a matcher, specified in MBE syntax with a leading `$` 20 character. 21 - `simple NT`: a "meta-variable" non-terminal (further discussion below). 22 - `complex NT`: a repetition matching non-terminal, specified via repetition 23 operators (`*`, `+`, `?`). 24 - `token`: an atomic element of a matcher; i.e. identifiers, operators, 25 open/close delimiters, *and* simple NT's. 26 - `token tree`: a tree structure formed from tokens (the leaves), complex 27 NT's, and finite sequences of token trees. 28 - `delimiter token`: a token that is meant to divide the end of one fragment 29 and the start of the next fragment. 30 - `separator token`: an optional delimiter token in an complex NT that 31 separates each pair of elements in the matched repetition. 32 - `separated complex NT`: a complex NT that has its own separator token. 33 - `delimited sequence`: a sequence of token trees with appropriate open- and 34 close-delimiters at the start and end of the sequence. 35 - `empty fragment`: The class of invisible Rust syntax that separates tokens, 36 i.e. whitespace, or (in some lexical contexts), the empty token sequence. 37 - `fragment specifier`: The identifier in a simple NT that specifies which 38 fragment the NT accepts. 39 - `language`: a context-free language. 40 41Example: 42 43```rust,compile_fail 44macro_rules! i_am_an_mbe { 45 (start $foo:expr $($i:ident),* end) => ($foo) 46} 47``` 48 49`(start $foo:expr $($i:ident),* end)` is a matcher. The whole matcher is a 50delimited sequence (with open- and close-delimiters `(` and `)`), and `$foo` 51and `$i` are simple NT's with `expr` and `ident` as their respective fragment 52specifiers. 53 54`$(i:ident),*` is *also* an NT; it is a complex NT that matches a 55comma-separated repetition of identifiers. The `,` is the separator token for 56the complex NT; it occurs in between each pair of elements (if any) of the 57matched fragment. 58 59Another example of a complex NT is `$(hi $e:expr ;)+`, which matches any 60fragment of the form `hi <expr>; hi <expr>; ...` where `hi <expr>;` occurs at 61least once. Note that this complex NT does not have a dedicated separator 62token. 63 64(Note that Rust's parser ensures that delimited sequences always occur with 65proper nesting of token tree structure and correct matching of open- and 66close-delimiters.) 67 68We will tend to use the variable "M" to stand for a matcher, variables "t" and 69"u" for arbitrary individual tokens, and the variables "tt" and "uu" for 70arbitrary token trees. (The use of "tt" does present potential ambiguity with 71its additional role as a fragment specifier; but it will be clear from context 72which interpretation is meant.) 73 74"SEP" will range over separator tokens, "OP" over the repetition operators 75`*`, `+`, and `?`, "OPEN"/"CLOSE" over matching token pairs surrounding a 76delimited sequence (e.g. `[` and `]`). 77 78Greek letters "α" "β" "γ" "δ" stand for potentially empty token-tree sequences. 79(However, the Greek letter "ε" (epsilon) has a special role in the presentation 80and does not stand for a token-tree sequence.) 81 82 * This Greek letter convention is usually just employed when the presence of 83 a sequence is a technical detail; in particular, when we wish to *emphasize* 84 that we are operating on a sequence of token-trees, we will use the notation 85 "tt ..." for the sequence, not a Greek letter. 86 87Note that a matcher is merely a token tree. A "simple NT", as mentioned above, 88is an meta-variable NT; thus it is a non-repetition. For example, `$foo:ty` is 89a simple NT but `$($foo:ty)+` is a complex NT. 90 91Note also that in the context of this formalism, the term "token" generally 92*includes* simple NTs. 93 94Finally, it is useful for the reader to keep in mind that according to the 95definitions of this formalism, no simple NT matches the empty fragment, and 96likewise no token matches the empty fragment of Rust syntax. (Thus, the *only* 97NT that can match the empty fragment is a complex NT.) This is not actually 98true, because the `vis` matcher can match an empty fragment. Thus, for the 99purposes of the formalism, we will treat `$v:vis` as actually being 100`$($v:vis)?`, with a requirement that the matcher match an empty fragment. 101 102### The Matcher Invariants 103 104To be valid, a matcher must meet the following three invariants. The definitions 105of FIRST and FOLLOW are described later. 106 1071. For any two successive token tree sequences in a matcher `M` (i.e. `M = ... 108 tt uu ...`) with `uu ...` nonempty, we must have FOLLOW(`... tt`) ∪ {ε} ⊇ 109 FIRST(`uu ...`). 1101. For any separated complex NT in a matcher, `M = ... $(tt ...) SEP OP ...`, 111 we must have `SEP` ∈ FOLLOW(`tt ...`). 1121. For an unseparated complex NT in a matcher, `M = ... $(tt ...) OP ...`, if 113 OP = `*` or `+`, we must have FOLLOW(`tt ...`) ⊇ FIRST(`tt ...`). 114 115The first invariant says that whatever actual token that comes after a matcher, 116if any, must be somewhere in the predetermined follow set. This ensures that a 117legal macro definition will continue to assign the same determination as to 118where `... tt` ends and `uu ...` begins, even as new syntactic forms are added 119to the language. 120 121The second invariant says that a separated complex NT must use a separator token 122that is part of the predetermined follow set for the internal contents of the 123NT. This ensures that a legal macro definition will continue to parse an input 124fragment into the same delimited sequence of `tt ...`'s, even as new syntactic 125forms are added to the language. 126 127The third invariant says that when we have a complex NT that can match two or 128more copies of the same thing with no separation in between, it must be 129permissible for them to be placed next to each other as per the first invariant. 130This invariant also requires they be nonempty, which eliminates a possible 131ambiguity. 132 133**NOTE: The third invariant is currently unenforced due to historical oversight 134and significant reliance on the behaviour. It is currently undecided what to do 135about this going forward. Macros that do not respect the behaviour may become 136invalid in a future edition of Rust. See the [tracking issue].** 137 138### FIRST and FOLLOW, informally 139 140A given matcher M maps to three sets: FIRST(M), LAST(M) and FOLLOW(M). 141 142Each of the three sets is made up of tokens. FIRST(M) and LAST(M) may also 143contain a distinguished non-token element ε ("epsilon"), which indicates that M 144can match the empty fragment. (But FOLLOW(M) is always just a set of tokens.) 145 146Informally: 147 148 * FIRST(M): collects the tokens potentially used first when matching a 149 fragment to M. 150 151 * LAST(M): collects the tokens potentially used last when matching a fragment 152 to M. 153 154 * FOLLOW(M): the set of tokens allowed to follow immediately after some 155 fragment matched by M. 156 157 In other words: t ∈ FOLLOW(M) if and only if there exists (potentially 158 empty) token sequences α, β, γ, δ where: 159 160 * M matches β, 161 162 * t matches γ, and 163 164 * The concatenation α β γ δ is a parseable Rust program. 165 166We use the shorthand ANYTOKEN to denote the set of all tokens (including simple 167NTs). For example, if any token is legal after a matcher M, then FOLLOW(M) = 168ANYTOKEN. 169 170(To review one's understanding of the above informal descriptions, the reader 171at this point may want to jump ahead to the [examples of 172FIRST/LAST](#examples-of-first-and-last) before reading their formal 173definitions.) 174 175### FIRST, LAST 176 177Below are formal inductive definitions for FIRST and LAST. 178 179"A ∪ B" denotes set union, "A ∩ B" denotes set intersection, and "A \ B" 180denotes set difference (i.e. all elements of A that are not present in B). 181 182#### FIRST 183 184FIRST(M) is defined by case analysis on the sequence M and the structure of its 185first token-tree (if any): 186 187 * if M is the empty sequence, then FIRST(M) = { ε }, 188 189 * if M starts with a token t, then FIRST(M) = { t }, 190 191 (Note: this covers the case where M starts with a delimited token-tree 192 sequence, `M = OPEN tt ... CLOSE ...`, in which case `t = OPEN` and thus 193 FIRST(M) = { `OPEN` }.) 194 195 (Note: this critically relies on the property that no simple NT matches the 196 empty fragment.) 197 198 * Otherwise, M is a token-tree sequence starting with a complex NT: `M = $( tt 199 ... ) OP α`, or `M = $( tt ... ) SEP OP α`, (where `α` is the (potentially 200 empty) sequence of token trees for the rest of the matcher). 201 202 * Let SEP\_SET(M) = { SEP } if SEP is present and ε ∈ FIRST(`tt ...`); 203 otherwise SEP\_SET(M) = {}. 204 205 * Let ALPHA\_SET(M) = FIRST(`α`) if OP = `*` or `?` and ALPHA\_SET(M) = {} if 206 OP = `+`. 207 * FIRST(M) = (FIRST(`tt ...`) \\ {ε}) ∪ SEP\_SET(M) ∪ ALPHA\_SET(M). 208 209The definition for complex NTs deserves some justification. SEP\_SET(M) defines 210the possibility that the separator could be a valid first token for M, which 211happens when there is a separator defined and the repeated fragment could be 212empty. ALPHA\_SET(M) defines the possibility that the complex NT could be empty, 213meaning that M's valid first tokens are those of the following token-tree 214sequences `α`. This occurs when either `*` or `?` is used, in which case there 215could be zero repetitions. In theory, this could also occur if `+` was used with 216a potentially-empty repeating fragment, but this is forbidden by the third 217invariant. 218 219From there, clearly FIRST(M) can include any token from SEP\_SET(M) or 220ALPHA\_SET(M), and if the complex NT match is nonempty, then any token starting 221FIRST(`tt ...`) could work too. The last piece to consider is ε. SEP\_SET(M) and 222FIRST(`tt ...`) \ {ε} cannot contain ε, but ALPHA\_SET(M) could. Hence, this 223definition allows M to accept ε if and only if ε ∈ ALPHA\_SET(M) does. This is 224correct because for M to accept ε in the complex NT case, both the complex NT 225and α must accept it. If OP = `+`, meaning that the complex NT cannot be empty, 226then by definition ε ∉ ALPHA\_SET(M). Otherwise, the complex NT can accept zero 227repetitions, and then ALPHA\_SET(M) = FOLLOW(`α`). So this definition is correct 228with respect to \varepsilon as well. 229 230#### LAST 231 232LAST(M), defined by case analysis on M itself (a sequence of token-trees): 233 234 * if M is the empty sequence, then LAST(M) = { ε } 235 236 * if M is a singleton token t, then LAST(M) = { t } 237 238 * if M is the singleton complex NT repeating zero or more times, `M = $( tt 239 ... ) *`, or `M = $( tt ... ) SEP *` 240 241 * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}. 242 243 * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set 244 245 * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt 246 ...`) ∪ {ε}. 247 248 * if M is the singleton complex NT repeating one or more times, `M = $( tt ... 249 ) +`, or `M = $( tt ... ) SEP +` 250 251 * Let sep_set = { SEP } if SEP present; otherwise sep_set = {}. 252 253 * if ε ∈ LAST(`tt ...`) then LAST(M) = LAST(`tt ...`) ∪ sep_set 254 255 * otherwise, the sequence `tt ...` must be non-empty; LAST(M) = LAST(`tt 256 ...`) 257 258 * if M is the singleton complex NT repeating zero or one time, `M = $( tt ...) 259 ?`, then LAST(M) = LAST(`tt ...`) ∪ {ε}. 260 261 * if M is a delimited token-tree sequence `OPEN tt ... CLOSE`, then LAST(M) = 262 { `CLOSE` }. 263 264 * if M is a non-empty sequence of token-trees `tt uu ...`, 265 266 * If ε ∈ LAST(`uu ...`), then LAST(M) = LAST(`tt`) ∪ (LAST(`uu ...`) \ { ε }). 267 268 * Otherwise, the sequence `uu ...` must be non-empty; then LAST(M) = 269 LAST(`uu ...`). 270 271### Examples of FIRST and LAST 272 273Below are some examples of FIRST and LAST. 274(Note in particular how the special ε element is introduced and 275eliminated based on the interaction between the pieces of the input.) 276 277Our first example is presented in a tree structure to elaborate on how 278the analysis of the matcher composes. (Some of the simpler subtrees 279have been elided.) 280 281```text 282INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g 283 ~~~~~~~~ ~~~~~~~ ~ 284 | | | 285FIRST: { $d:ident } { $e:expr } { h } 286 287 288INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ 289 ~~~~~~~~~~~~~~~~~~ ~~~~~~~ ~~~ 290 | | | 291FIRST: { $d:ident } { h, ε } { f } 292 293INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g 294 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~ ~ 295 | | | | 296FIRST: { $d:ident, ε } { h, ε, ; } { f } { g } 297 298 299INPUT: $( $d:ident $e:expr );* $( $( h )* );* $( f ; )+ g 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 301 | 302FIRST: { $d:ident, h, ;, f } 303``` 304 305Thus: 306 307 * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $( f ;)+ g`) = { `$d:ident`, `h`, `;`, `f` } 308 309Note however that: 310 311 * FIRST(`$($d:ident $e:expr );* $( $(h)* );* $($( f ;)+ g)*`) = { `$d:ident`, `h`, `;`, `f`, ε } 312 313Here are similar examples but now for LAST. 314 315 * LAST(`$d:ident $e:expr`) = { `$e:expr` } 316 * LAST(`$( $d:ident $e:expr );*`) = { `$e:expr`, ε } 317 * LAST(`$( $d:ident $e:expr );* $(h)*`) = { `$e:expr`, ε, `h` } 318 * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+`) = { `;` } 319 * LAST(`$( $d:ident $e:expr );* $(h)* $( f ;)+ g`) = { `g` } 320 321### FOLLOW(M) 322 323Finally, the definition for FOLLOW(M) is built up as follows. pat, expr, etc. 324represent simple nonterminals with the given fragment specifier. 325 326 * FOLLOW(pat) = {`=>`, `,`, `=`, `|`, `if`, `in`}`. 327 328 * FOLLOW(expr) = FOLLOW(stmt) = {`=>`, `,`, `;`}`. 329 330 * FOLLOW(ty) = FOLLOW(path) = {`{`, `[`, `,`, `=>`, `:`, `=`, `>`, `>>`, `;`, 331 `|`, `as`, `where`, block nonterminals}. 332 333 * FOLLOW(vis) = {`,`l any keyword or identifier except a non-raw `priv`; any 334 token that can begin a type; ident, ty, and path nonterminals}. 335 336 * FOLLOW(t) = ANYTOKEN for any other simple token, including block, ident, 337 tt, item, lifetime, literal and meta simple nonterminals, and all terminals. 338 339 * FOLLOW(M), for any other M, is defined as the intersection, as t ranges over 340 (LAST(M) \ {ε}), of FOLLOW(t). 341 342The tokens that can begin a type are, as of this writing, {`(`, `[`, `!`, `*`, 343`&`, `&&`, `?`, lifetimes, `>`, `>>`, `::`, any non-keyword identifier, `super`, 344`self`, `Self`, `extern`, `crate`, `$crate`, `_`, `for`, `impl`, `fn`, `unsafe`, 345`typeof`, `dyn`}, although this list may not be complete because people won't 346always remember to update the appendix when new ones are added. 347 348Examples of FOLLOW for complex M: 349 350 * FOLLOW(`$( $d:ident $e:expr )*`) = FOLLOW(`$e:expr`) 351 * FOLLOW(`$( $d:ident $e:expr )* $(;)*`) = FOLLOW(`$e:expr`) ∩ ANYTOKEN = FOLLOW(`$e:expr`) 352 * FOLLOW(`$( $d:ident $e:expr )* $(;)* $( f |)+`) = ANYTOKEN 353 354### Examples of valid and invalid matchers 355 356With the above specification in hand, we can present arguments for 357why particular matchers are legal and others are not. 358 359 * `($ty:ty < foo ,)` : illegal, because FIRST(`< foo ,`) = { `<` } ⊈ FOLLOW(`ty`) 360 361 * `($ty:ty , foo <)` : legal, because FIRST(`, foo <`) = { `,` } is ⊆ FOLLOW(`ty`). 362 363 * `($pa:pat $pb:pat $ty:ty ,)` : illegal, because FIRST(`$pb:pat $ty:ty ,`) = { `$pb:pat` } ⊈ FOLLOW(`pat`), and also FIRST(`$ty:ty ,`) = { `$ty:ty` } ⊈ FOLLOW(`pat`). 364 365 * `( $($a:tt $b:tt)* ; )` : legal, because FIRST(`$b:tt`) = { `$b:tt` } is ⊆ FOLLOW(`tt`) = ANYTOKEN, as is FIRST(`;`) = { `;` }. 366 367 * `( $($t:tt),* , $(t:tt),* )` : legal, (though any attempt to actually use this macro will signal a local ambiguity error during expansion). 368 369 * `($ty:ty $(; not sep)* -)` : illegal, because FIRST(`$(; not sep)* -`) = { `;`, `-` } is not in FOLLOW(`ty`). 370 371 * `($($ty:ty)-+)` : illegal, because separator `-` is not in FOLLOW(`ty`). 372 373 * `($($e:expr)*)` : illegal, because expr NTs are not in FOLLOW(expr NT). 374 375[Macros by Example]: macros-by-example.md 376[RFC 550]: https://github.com/rust-lang/rfcs/blob/master/text/0550-macro-future-proofing.md 377[tracking issue]: https://github.com/rust-lang/rust/issues/56575 378