1# Rules and Grammars
2
3Writing a PEGTL grammar means implementing custom parsing rules.
4
5Implementing custom parsing rules can be done either by
6
7* combining existing rules and combinators into new rules through inheritance, or
8
9* implementing a rule from scratch, i.e. writing a class with certain properties.
10
11## Contents
12
13* [Combining Existing Rules](#combining-existing-rules)
14* [Toy S-Expression Grammar](#toy-s-expression-grammar)
15* [Creating New Rules](#creating-new-rules)
16  * [Simple Rules](#simple-rules)
17  * [Complex Rules](#complex-rules)
18
19## Combining Existing Rules
20
21Combining existing rules is by far the more frequent way of creating new rules.
22
23Here is an example that shows how existing rules are combined into a new rule through inheritance:
24
25```c++
26using namespace tao::pegtl;
27
28struct integer
29   : seq<
30        opt< one< '+', '-' > >,  // ('+'/'-')?
31        plus< digit >            // digit+
32     > {};
33```
34
35It defines a new rule named `integer` that is a sequence of two parts, an optional character that can be one of `+` or `-`, followed by a non-empty repetition of a digit.
36Using inheritance in this way incurs no run-time penalty.
37
38See the [Rule Reference](Rule-Reference.md) for a complete list of all rules and combinators included with the PEGTL.
39
40Recursion, or cycles in the grammar, can be implemented after a forward-declaration of one or more rules.
41
42```c++
43struct number
44   : tao::pegtl::plus< tao::pegtl::digit > {};
45
46struct addition;  // Forward declaration to break the cyclic dependency.
47
48struct bracket
49   : tao::pegtl::if_must< tao::pegtl::one< '(' >, addition, tao::pegtl::one< ')' > > {};
50
51struct atomic
52   : tao::pegtl::sor< number, bracket > {};
53
54struct addition
55   : tao::pegtl::list< atomic, tao::pegtl::one< '+' > > {};
56```
57
58When defining a large set of grammar rules in this way it can be advisable to include a `using namespace tao::pegtl;`-definition at the beginning in order to prevent the frequent repetition of the `tao::pegtl::` namespace qualifier.
59This `using`-definition is often combined with the practice of confining a PEGTL grammar to a single translation unit, in which case there is no `namespace`-pollution, and the compile time is kept low by including the PEGTL only in the translation unit with the grammar.
60
61A grammar is nothing else than a collection of rules.
62In theory, as long as a grammar does not contain cycles, complete grammars could be implemented as a single, large rule.
63In practice, this is not advisable as it greatly reduces the readability and testability of the grammar, in addition to being quite unmaintainable.
64
65## Toy S-Expression Grammar
66
67To give another example of what a small real-world grammar might look like, below is the grammar for a toy-version of S-expressions.
68It only supports proper lists, symbols, comments and numbers.
69Numbers are non-empty sequences of ASCII digits.
70
71The rule named `file` is the intended top-level rule of the grammar, i.e. the rule that is supplied as template argument to [the `parse()` function](Inputs-and-Parsing.md#parse-function) in order to start a parsing run with this grammar.
72
73```c++
74struct hash_comment
75   : tao::pegtl::until< tao::pegtl::eolf > {};
76
77struct list;
78
79struct list_comment
80   : tao::pegtl::if_must< tao::pegtl::at< tao::pegtl::one< '(' > >, tao::pegtl::disable< list > > {};
81
82struct read_include
83   : tao::pegtl::seq< tao::pegtl::one< ' ' >, tao::pegtl::one< '"' >, tao::pegtl::plus< tao::pegtl::not_one< '"' > >, tao::pegtl::one< '"' > > {};
84
85struct hash_include
86   : tao::pegtl::if_must< tao::pegtl::string< 'i', 'n', 'c', 'l', 'u', 'd', 'e' >, read_include > {};
87
88struct hashed
89   : tao::pegtl::if_must< tao::pegtl::one< '#' >, tao::pegtl::sor< hash_include, list_comment, hash_comment > > {};
90
91struct number
92   : tao::pegtl::plus< tao::pegtl::digit > {};
93
94struct symbol
95   : tao::pegtl::identifier {};
96
97struct atom
98   : tao::pegtl::sor< number, symbol > {};
99
100struct anything;
101
102struct list
103   : tao::pegtl::if_must< tao::pegtl::one< '(' >, tao::pegtl::until< tao::pegtl::one< ')' >, anything > > {};
104
105struct normal
106   : tao::pegtl::sor< atom, list > {};
107
108struct anything
109   : tao::pegtl::sor< tao::pegtl::space, hashed, normal > {};
110
111struct main
112   : tao::pegtl::until< tao::pegtl::eof, tao::pegtl::must< anything > > {};
113```
114
115In order to let a parsing run do more than verify whether an input conforms to the grammar, it is necessary to attach user-defined *actions* to some grammar rules, as explained in [Actions and States](Actions-and-States.md).
116
117## Creating New Rules
118
119Sometimes a grammar requires a parsing rule that can not be readily created as combination of the existing rules.
120In these cases a custom grammar rule, i.e. a class with a static member function called `match()` that has to adhere to one of two possible interfaces or prototypes, can be implemented from scratch.
121
122When implementing a custom rule class, it is important to remember that the input passed to `match()` represents the *remainder* of the complete input.
123At the beginning of a parsing run, the input represents the complete data-to-be-parsed.
124During the parsing run, many rules *consume* the data that matched from the input.
125Consuming data from an input advances the pointer to the data that the input's `begin()` member function returns, and decrements the size by the same amount.
126
127The PEGTL makes one **important** assumption about all parsing rules.
128If a call to `match()` returns with `false`, then the rule **must not** have consumed input (for [complex rules](#complex-rules): only when the `rewind_mode` is `required`).
129For performance reasons this assumption is neither ensured nor verified by the PEGTL.
130
131### Simple Rules
132
133In the simplified rule, `match()` is called with a single argument, the input.
134It returns a `bool` to indicate success or (local) failure.
135Rules with the simplified interface are called without the states as arguments.
136
137```c++
138struct simple_rule
139{
140   template< typename ParseInput >
141   static bool match( ParseInput& in ) { ... }
142};
143```
144
145Here is an excerpt from the included example program `src/example/pegtl/modulus_match.cpp` that shows a simple custom rule.
146The - slightly artificial - rule `my_rule` uses three important `input` functions,
147
1481. first `empty()` to check whether the input is not empty,
149
1502. then `current()` to access the data and check whether the remainder of the first remaining input character `C` happens to satisfy `C % M == R`,
151
1523. and finally `bump()` to consume one `char` from the input if the two above conditions are satisfied.
153
154Note how the return value reflects the result of the checks, and how input is only consumed when the return value is `true`.
155The remainder of the program checks that all characters of `argv[ 1 ]` are equal to 0 when divided by 3.
156
157```c++
158using namespace TAO_PEGTL_NAMESPACE;
159
160namespace modulus
161{
162   template< unsigned M, unsigned R = 0 >
163   struct my_rule
164   {
165      static_assert( M > 1, "Modulus must be greater than 1" );
166      static_assert( R < M, "Remainder must be less than modulus" );
167
168      template< typename ParseInput >
169      static bool match( ParseInput& in )
170      {
171         if( !in.empty() ) {
172            if( ( ( *in.current() ) % M ) == R ) {
173               in.bump( 1 );
174               return true;
175            }
176         }
177         return false;
178      }
179   };
180
181   struct grammar
182      : until< eolf, must< my_rule< 3 > > >
183   {};
184
185}  // namespace modulus
186
187int main( int argc, char** argv )
188{
189   if( argc > 1 ) {
190      argv_input in( argv, 1 );
191      parse< modulus::grammar >( in );
192   }
193   return 0;
194}
195```
196
197### Complex Rules
198
199The complex calling convention gives a rule's `match()` member function access to "everything", i.e. some modes, the action- and control class, and all state arguments.
200All of these parameters are required for custom rules that need to themselves call other rules for matching.
201
202The signature of `match()` in a complex rule takes the following form.
203
204```c++
205struct complex_rule
206{
207   using rule_t = complex_rule;
208   using subs_t = tao::pegtl::empty_list;  // Or tao::pegtl::rule_list< sub_rules_of_complex_rule... >.
209
210   template< tao::pegtl::apply_mode A,
211             tao::pegtl::rewind_mode M,
212             template< typename... > class Action,
213             template< typename... > class Control,
214             typename ParseInput,
215             typename... States >
216   static bool match( ParseInput& in, States&&... )
217   { ... }
218};
219```
220
221#### Modes
222
223The `apply_mode` can take the value `apply_mode::action` or `apply_mode::nothing`, depending on whether actions are currently enabled or disabled.
224Most custom parsing rules will either ignore, or pass on the `apply_mode` unchanged; usually only the control interprets the `apply_mode`.
225
226The `rewind_mode` can take the value `rewind_mode::active`, `rewind_mode::required` or `rewind_mode::dontcare`.
227When `M` is `rewind_mode::required`, the custom rule's `match()`-implementation **must**, on local failure, rewind the input to where it (the input) was when `match()` was called.
228
229When `M` is **not** `rewind_mode::required`, it is not necessary to perform rewinding as either some other rule further up the call stack is already taking care of it (`rewind_mode::active`), or rewinding is not necessary (`rewind_mode::dontcare`).
230For example within a `must<>`-rule (which converts local failure, a return value of `false` from `match()`, to global failure, an exception) the `rewind_mode` is `dontcare`.
231
232The following implementation of the `seq`-rule's `match()` shows how to correctly handle the `rewind_mode`.
233The input's `mark()` member function uses the `rewind_mode` to choose which input marker to return, either one that takes care of rewinding when required, or a dummy object that does nothing.
234In the first case, `next_rewind_mode` is set to `active`, otherwise it is equal to `M`, just as required for the next rules called by the current one.
235The return value of `match()` is then passed through the input marker `m` so that, if the return value is `false` and the marker is not the dummy, it can rewind the input `in`.
236
237```c++
238template< typename... Rules >
239struct seq
240{
241    template< apply_mode A,
242              rewind_mode M,
243              template< typename... > class Action,
244              template< typename... > class Control,
245              typename ParseInput,
246              typename... States >
247    static bool match( ParseInput& in, States&&... st )
248    {
249       auto m = in.template mark< M >();
250       using m_t = decltype( m );
251       return m( rule_conjunction< Rules... >::template
252                 match< A, m_t::next_rewind_mode, Action, Control >( in, st... ) );
253    }
254};
255```
256
257#### Example
258
259The following excerpt from the included example program `src/example/pegtl/dynamic_match.cpp` shows a complex custom rule that itself makes use of a state argument.
260This is necessary to cleanly implement dynamic matching, i.e. where a (set of) string(s) that a rule is intended to match depends on some run-time data structure rather than some compile-time type (the latter of which includes all template arguments).
261
262The aim is to parse a kind of *long string literal*, an arbitrary string literal that does not require escaping of any special characters, as is common in many scripting languages.
263In order to allow for arbitrary content without escaping it has to be possible to choose a string sequence that is not part of the string literal as delimiter.
264
265For this example we adopt the convention that a long string literal begins with `"[foo["` and ends with `"]foo]"` where `"foo"` is any non-empty string that does not contain a `"["` (quotation marks always excluded).
266
267Please note that the following code snippets are not in actual source code order.
268
269First we define a rule for the opening of a long string literal as explained above.
270
271```c++
272namespace dynamic
273{
274   struct long_literal_id
275      : tao::pegtl::plus< tao::pegtl::not_one< '[' > > {};
276
277   struct long_literal_open
278      : tao::pegtl::seq< tao::pegtl::one< '[' >,
279                         long_literal_id,
280                         tao::pegtl::one< '[' > > {};
281```
282
283Then we implement an action class with a specialisation for what is the `"foo"`-part of the long string literal's opening sequence.
284The action stores the matched string that corresponds to `"foo"` in a string variable that is passed as state argument.
285
286```c++
287   template< typename Rule >
288   struct action {};
289
290   template<>
291   struct action< long_literal_id >
292   {
293      template< typename ActionInput >
294      static void apply( const ActionInput& in,
295                         std::string& id,
296                         const std::string& )
297      {
298         id = in.string();
299      }
300   };
301```
302
303The rule for the closing sequence is similar to the opening, with closing instead of opening brackets, and with a custom rule to check for the `"foo"`-part.
304
305```c++
306   struct long_literal_close
307      : tao::pegtl::seq< tao::pegtl::one< ']' >,
308                         long_literal_mark,
309                         tao::pegtl::one< ']' > > {};
310```
311
312The custom rule itself
313
3141. first checks whether the input contains enough bytes to match the string stored by the action,
315
3162. then checks whether the input bytes match the stored string, and
317
3183. finally calls `bump()` to consume the correct number of bytes from the input when both checks succeed.
319
320```c++
321   struct long_literal_mark
322   {
323      template< tao::pegtl::apply_mode A,
324                tao::pegtl::rewind_mode M,
325                template< typename... > class Action,
326                template< typename... > class Control,
327                typename ParseInput >
328      static bool match( ParseInput& in,
329                         const std::string& id,
330                         const std::string& )
331      {
332         if( in.size( id.size() ) >= id.size() ) {
333            if( std::memcmp( in.begin(), id.data(), id.size() ) == 0 ) {
334               in.bump( id.size() );
335               return true;
336            }
337         }
338         return false;
339      }
340   };
341```
342
343The grammar is completed with another two rules for putting everything together, and an action that stores the body of the long string literal in a second state argument.
344In this case the rule `long_literal_body` is redundant, however real-world examples frequently contain a rule like `tao::pegtl::any` multiple times, and so it is necessary to give it another name in order to attach different actions to different uses of the same rule.
345
346```c++
347   struct long_literal_body
348      : tao::pegtl::any {};
349
350   struct grammar
351      : tao::pegtl::if_must< long_literal_open,
352                             tao::pegtl::until< long_literal_close,
353                                                long_literal_body >,
354                             tao::pegtl::eof > {};
355
356   template<> struct action< long_literal_body >
357   {
358      template< typename ActionInput >
359      static void apply( const ActionInput& in,
360                         const std::string&,
361                         std::string& body )
362      {
363         body += in.string();
364      }
365   };
366
367}  // namespace dynamic
368```
369
370Given the main function...
371
372```c++
373int main( int argc, char* argv[] )
374{
375   if( argc > 1 ) {
376      std::string id;
377      std::string body;
378
379      tao::pegtl::argv_input in( argv, 1 );
380      tao::pegtl::parse< dynamic::grammar, dynamic::action >( in, id, body );
381
382      std::cout << "long literal id was: " << id << std::endl;
383      std::cout << "long literal body was: " << body << std::endl;
384   }
385   return 0;
386}
387```
388
389...we can see the grammar in action in the shell:
390
391```sh
392$ build/src/example/pegtl/dynamic_match '[foo["[bla]"]foo]'
393long literal id was: foo
394long literal body was: "[bla]"
395
396$ build/src/example/pegtl/dynamic_match '["fraggle"["[foo["]"fraggle"]'
397long literal id was: "fraggle"
398long literal body was: "[foo["
399```
400
401Copyright (c) 2014-2020 Dr. Colin Hirsch and Daniel Frey
402