1# Grammar Reference
2
3## Definitions
4
5A **grammar** is a list of rules and terminals, that together define a language.
6
7Terminals define the alphabet of the language, while rules define its structure.
8
9In Lark, a terminal may be a string, a regular expression, or a concatenation of these and other terminals.
10
11Each rule is a list of terminals and rules, whose location and nesting define the structure of the resulting parse-tree.
12
13A **parsing algorithm** is an algorithm that takes a grammar definition and a sequence of symbols (members of the alphabet), and matches the entirety of the sequence by searching for a structure that is allowed by the grammar.
14
15### General Syntax and notes
16
17Grammars in Lark are based on [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) syntax, with several enhancements.
18
19EBNF is basically a short-hand for common BNF patterns.
20
21Optionals are expanded:
22
23```ebnf
24  a b? c    ->    (a c | a b c)
25```
26
27Repetition is extracted into a recursion:
28
29```ebnf
30  a: b*    ->    a: _b_tag
31                 _b_tag: (_b_tag b)?
32```
33
34And so on.
35
36Lark grammars are composed of a list of definitions and directives, each on its own line. A definition is either a named rule, or a named terminal, with the following syntax, respectively:
37
38```c
39  rule: <EBNF EXPRESSION>
40      | etc.
41
42  TERM: <EBNF EXPRESSION>   // Rules aren't allowed
43```
44
45
46**Comments** start with `//` and last to the end of the line (C++ style)
47
48Lark begins the parse with the rule 'start', unless specified otherwise in the options.
49
50Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).
51
52
53## Terminals
54
55Terminals are used to match text into symbols. They can be defined as a combination of literals and other terminals.
56
57**Syntax:**
58
59```html
60<NAME> [. <priority>] : <literals-and-or-terminals>
61```
62
63Terminal names must be uppercase.
64
65Literals can be one of:
66
67* `"string"`
68* `/regular expression+/`
69* `"case-insensitive string"i`
70* `/re with flags/imulx`
71* Literal range: `"a".."z"`, `"1".."9"`, etc.
72
73Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.
74
75Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
76
77### Templates
78
79Templates are expanded when preprocessing the grammar.
80
81Definition syntax:
82
83```ebnf
84  my_template{param1, param2, ...}: <EBNF EXPRESSION>
85```
86
87Use syntax:
88
89```ebnf
90some_rule: my_template{arg1, arg2, ...}
91```
92
93Example:
94```ebnf
95_separated{x, sep}: x (sep x)*  // Define a sequence of 'x sep x sep x ...'
96
97num_list: "[" _separated{NUMBER, ","} "]"   // Will match "[1, 2, 3]" etc.
98```
99
100### Priority
101
102Terminals can be assigned priority only when using a lexer (future versions may support Earley's dynamic lexing).
103
104Priority can be either positive or negative. If not specified for a terminal, it defaults to 1.
105
106Highest priority terminals are always matched first.
107
108### Regexp Flags
109
110You can use flags on regexps and strings. For example:
111
112```perl
113SELECT: "select"i     //# Will ignore case, and match SELECT or Select, etc.
114MULTILINE_TEXT: /.+/s
115SIGNED_INTEGER: /
116    [+-]?  # the sign
117    (0|[1-9][0-9]*)  # the digits
118 /x
119```
120
121Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
122
123Regexps/strings of different flags can only be concatenated in Python 3.6+
124
125#### Notes for when using a lexer:
126
127When using a lexer (standard or contextual), it is the grammar-author's responsibility to make sure the literals don't collide, or that if they do, they are matched in the desired order. Literals are matched according to the following precedence:
128
1291. Highest priority first (priority is specified as: TERM.number: ...)
1302. Length of match (for regexps, the longest theoretical match is used)
1313. Length of literal / pattern definition
1324. Name
133
134**Examples:**
135```perl
136IF: "if"
137INTEGER : /[0-9]+/
138INTEGER2 : ("0".."9")+          //# Same as INTEGER
139DECIMAL.2: INTEGER? "." INTEGER  //# Will be matched before INTEGER
140WHITESPACE: (" " | /\t/ )+
141SQL_SELECT: "select"i
142```
143
144### Regular expressions & Ambiguity
145
146Each terminal is eventually compiled to a regular expression. All the operators and references inside it are mapped to their respective expressions.
147
148For example, in the following grammar, `A1` and `A2`, are equivalent:
149```perl
150A1: "a" | "b"
151A2: /a|b/
152```
153
154This means that inside terminals, Lark cannot detect or resolve ambiguity, even when using Earley.
155
156For example, for this grammar:
157```perl
158start           : (A | B)+
159A               : "a" | "ab"
160B               : "b"
161```
162We get only one possible derivation, instead of two:
163
164```bash
165>>> p = Lark(g, ambiguity="explicit")
166>>> p.parse("ab")
167Tree('start', [Token('A', 'ab')])
168```
169
170This is happening because Python's regex engine always returns the best matching option. There is no way to access the alternatives.
171
172If you find yourself in this situation, the recommended solution is to use rules instead.
173
174Example:
175
176```python
177>>> p = Lark("""start: (a | b)+
178...             !a: "a" | "ab"
179...             !b: "b"
180...             """, ambiguity="explicit")
181>>> print(p.parse("ab").pretty())
182_ambig
183  start
184    a   ab
185  start
186    a   a
187    b   b
188```
189
190
191## Rules
192
193**Syntax:**
194```html
195<name> : <items-to-match>  [-> <alias> ]
196       | ...
197```
198
199Names of rules and aliases are always in lowercase.
200
201Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).
202
203An alias is a name for the specific rule alternative. It affects tree construction.
204
205
206Each item is one of:
207
208* `rule`
209* `TERMINAL`
210* `"string literal"` or `/regexp literal/`
211* `(item item ..)` - Group items
212* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
213* `item?` - Zero or one instances of item ("maybe")
214* `item*` - Zero or more instances of item
215* `item+` - One or more instances of item
216* `item ~ n` - Exactly *n* instances of item
217* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
218
219**Examples:**
220```perl
221hello_world: "hello" "world"
222mul: (mul "*")? number     //# Left-recursion is allowed and encouraged!
223expr: expr operator expr
224    | value               //# Multi-line, belongs to expr
225
226four_words: word ~ 4
227```
228
229### Priority
230
231Rules can be assigned priority only when using Earley (future versions may support LALR as well).
232
233Priority can be either positive or negative. In not specified for a terminal, it's assumed to be 1 (i.e. the default).
234
235<a name="dirs"></a>
236## Directives
237
238### %ignore
239
240All occurrences of the terminal will be ignored, and won't be part of the parse.
241
242Using the `%ignore` directive results in a cleaner grammar.
243
244It's especially important for the LALR(1) algorithm, because adding whitespace (or comments, or other extraneous elements) explicitly in the grammar, harms its predictive abilities, which are based on a lookahead of 1.
245
246**Syntax:**
247```html
248%ignore <TERMINAL>
249```
250**Examples:**
251```perl
252%ignore " "
253
254COMMENT: "#" /[^\n]/*
255%ignore COMMENT
256```
257### %import
258
259Allows one to import terminals and rules from lark grammars.
260
261When importing rules, all their dependencies will be imported into a namespace, to avoid collisions. It's not possible to override their dependencies (e.g. like you would when inheriting a class).
262
263**Syntax:**
264```html
265%import <module>.<TERMINAL>
266%import <module>.<rule>
267%import <module>.<TERMINAL> -> <NEWTERMINAL>
268%import <module>.<rule> -> <newrule>
269%import <module> (<TERM1>, <TERM2>, <rule1>, <rule2>)
270```
271
272If the module path is absolute, Lark will attempt to load it from the built-in directory (which currently contains `common.lark`, `python.lark`, and `unicode.lark`).
273
274If the module path is relative, such as `.path.to.file`, Lark will attempt to load it from the current working directory. Grammars must have the `.lark` extension.
275
276The rule or terminal can be imported under another name with the `->` syntax.
277
278**Example:**
279```perl
280%import common.NUMBER
281
282%import .terminals_file (A, B, C)
283
284%import .rules_file.rulea -> ruleb
285```
286
287Note that `%ignore` directives cannot be imported. Imported rules will abide by the `%ignore` directives declared in the main grammar.
288
289### %declare
290
291Declare a terminal without defining it. Useful for plugins.
292
293### %override
294
295Override a rule or terminals, affecting all references to it, even in imported grammars.
296
297Useful for implementing an inheritance pattern when importing grammars.
298
299**Example:**
300```perl
301%import my_grammar (start, number, NUMBER)
302
303// Add hex support to my_grammar
304%override number: NUMBER | /0x\w+/
305```
306
307### %extend
308
309Extend the definition of a rule or terminal, e.g. add a new option on what it can match, like when separated with `|`.
310
311Useful for splitting up a definition of a complex rule with many different options over multiple files.
312
313Can also be used to implement a plugin system where a core grammar is extended by others.
314
315
316**Example:**
317```perl
318%import my_grammar (start, NUMBER)
319
320// Add hex support to my_grammar
321%extend NUMBER: /0x\w+/
322```
323
324For both `%extend` and `%override`, there is not requirement for a rule/terminal to come from another file, but that is probably the most common usecase