1# Rules and Grammars 2 3Writing a PEGTL grammar means implementing custom parsing rules. 4 5Implementing custom parsing rules can be done either by 6 7* combining existing rules and combinators into new rules through inheritance, or 8 9* implementing a rule from scratch, i.e. writing a class with certain properties. 10 11## Contents 12 13* [Combining Existing Rules](#combining-existing-rules) 14* [Toy S-Expression Grammar](#toy-s-expression-grammar) 15* [Creating New Rules](#creating-new-rules) 16 * [Simple Rules](#simple-rules) 17 * [Complex Rules](#complex-rules) 18 19## Combining Existing Rules 20 21Combining existing rules is by far the more frequent way of creating new rules. 22 23Here is an example that shows how existing rules are combined into a new rule through inheritance: 24 25```c++ 26using namespace tao::pegtl; 27 28struct integer 29 : seq< 30 opt< one< '+', '-' > >, // ('+'/'-')? 31 plus< digit > // digit+ 32 > {}; 33``` 34 35It defines a new rule named `integer` that is a sequence of two parts, an optional character that can be one of `+` or `-`, followed by a non-empty repetition of a digit. 36Using inheritance in this way incurs no run-time penalty. 37 38See the [Rule Reference](Rule-Reference.md) for a complete list of all rules and combinators included with the PEGTL. 39 40Recursion, or cycles in the grammar, can be implemented after a forward-declaration of one or more rules. 41 42```c++ 43struct number 44 : tao::pegtl::plus< tao::pegtl::digit > {}; 45 46struct addition; // Forward declaration to break the cyclic dependency. 47 48struct bracket 49 : tao::pegtl::if_must< tao::pegtl::one< '(' >, addition, tao::pegtl::one< ')' > > {}; 50 51struct atomic 52 : tao::pegtl::sor< number, bracket > {}; 53 54struct addition 55 : tao::pegtl::list< atomic, tao::pegtl::one< '+' > > {}; 56``` 57 58When defining a large set of grammar rules in this way it can be advisable to include a `using namespace tao::pegtl;`-definition at the beginning in order to prevent the frequent repetition of the `tao::pegtl::` namespace qualifier. 59This `using`-definition is often combined with the practice of confining a PEGTL grammar to a single translation unit, in which case there is no `namespace`-pollution, and the compile time is kept low by including the PEGTL only in the translation unit with the grammar. 60 61A grammar is nothing else than a collection of rules. 62In theory, as long as a grammar does not contain cycles, complete grammars could be implemented as a single, large rule. 63In practice, this is not advisable as it greatly reduces the readability and testability of the grammar, in addition to being quite unmaintainable. 64 65## Toy S-Expression Grammar 66 67To give another example of what a small real-world grammar might look like, below is the grammar for a toy-version of S-expressions. 68It only supports proper lists, symbols, comments and numbers. 69Numbers are non-empty sequences of ASCII digits. 70 71The rule named `file` is the intended top-level rule of the grammar, i.e. the rule that is supplied as template argument to [the `parse()` function](Inputs-and-Parsing.md#parse-function) in order to start a parsing run with this grammar. 72 73```c++ 74struct hash_comment 75 : tao::pegtl::until< tao::pegtl::eolf > {}; 76 77struct list; 78 79struct list_comment 80 : tao::pegtl::if_must< tao::pegtl::at< tao::pegtl::one< '(' > >, tao::pegtl::disable< list > > {}; 81 82struct read_include 83 : tao::pegtl::seq< tao::pegtl::one< ' ' >, tao::pegtl::one< '"' >, tao::pegtl::plus< tao::pegtl::not_one< '"' > >, tao::pegtl::one< '"' > > {}; 84 85struct hash_include 86 : tao::pegtl::if_must< tao::pegtl::string< 'i', 'n', 'c', 'l', 'u', 'd', 'e' >, read_include > {}; 87 88struct hashed 89 : tao::pegtl::if_must< tao::pegtl::one< '#' >, tao::pegtl::sor< hash_include, list_comment, hash_comment > > {}; 90 91struct number 92 : tao::pegtl::plus< tao::pegtl::digit > {}; 93 94struct symbol 95 : tao::pegtl::identifier {}; 96 97struct atom 98 : tao::pegtl::sor< number, symbol > {}; 99 100struct anything; 101 102struct list 103 : tao::pegtl::if_must< tao::pegtl::one< '(' >, tao::pegtl::until< tao::pegtl::one< ')' >, anything > > {}; 104 105struct normal 106 : tao::pegtl::sor< atom, list > {}; 107 108struct anything 109 : tao::pegtl::sor< tao::pegtl::space, hashed, normal > {}; 110 111struct main 112 : tao::pegtl::until< tao::pegtl::eof, tao::pegtl::must< anything > > {}; 113``` 114 115In order to let a parsing run do more than verify whether an input conforms to the grammar, it is necessary to attach user-defined *actions* to some grammar rules, as explained in [Actions and States](Actions-and-States.md). 116 117## Creating New Rules 118 119Sometimes a grammar requires a parsing rule that can not be readily created as combination of the existing rules. 120In these cases a custom grammar rule, i.e. a class with a static member function called `match()` that has to adhere to one of two possible interfaces or prototypes, can be implemented from scratch. 121 122When implementing a custom rule class, it is important to remember that the input passed to `match()` represents the *remainder* of the complete input. 123At the beginning of a parsing run, the input represents the complete data-to-be-parsed. 124During the parsing run, many rules *consume* the data that matched from the input. 125Consuming data from an input advances the pointer to the data that the input's `begin()` member function returns, and decrements the size by the same amount. 126 127The PEGTL makes one **important** assumption about all parsing rules. 128If a call to `match()` returns with `false`, then the rule **must not** have consumed input (for [complex rules](#complex-rules): only when the `rewind_mode` is `required`). 129For performance reasons this assumption is neither ensured nor verified by the PEGTL. 130 131### Simple Rules 132 133In the simplified rule, `match()` is called with a single argument, the input. 134It returns a `bool` to indicate success or (local) failure. 135Rules with the simplified interface are called without the states as arguments. 136 137```c++ 138struct simple_rule 139{ 140 template< typename ParseInput > 141 static bool match( ParseInput& in ) { ... } 142}; 143``` 144 145Here is an excerpt from the included example program `src/example/pegtl/modulus_match.cpp` that shows a simple custom rule. 146The - slightly artificial - rule `my_rule` uses three important `input` functions, 147 1481. first `empty()` to check whether the input is not empty, 149 1502. then `current()` to access the data and check whether the remainder of the first remaining input character `C` happens to satisfy `C % M == R`, 151 1523. and finally `bump()` to consume one `char` from the input if the two above conditions are satisfied. 153 154Note how the return value reflects the result of the checks, and how input is only consumed when the return value is `true`. 155The remainder of the program checks that all characters of `argv[ 1 ]` are equal to 0 when divided by 3. 156 157```c++ 158using namespace TAO_PEGTL_NAMESPACE; 159 160namespace modulus 161{ 162 template< unsigned M, unsigned R = 0 > 163 struct my_rule 164 { 165 static_assert( M > 1, "Modulus must be greater than 1" ); 166 static_assert( R < M, "Remainder must be less than modulus" ); 167 168 template< typename ParseInput > 169 static bool match( ParseInput& in ) 170 { 171 if( !in.empty() ) { 172 if( ( ( *in.current() ) % M ) == R ) { 173 in.bump( 1 ); 174 return true; 175 } 176 } 177 return false; 178 } 179 }; 180 181 struct grammar 182 : until< eolf, must< my_rule< 3 > > > 183 {}; 184 185} // namespace modulus 186 187int main( int argc, char** argv ) 188{ 189 if( argc > 1 ) { 190 argv_input in( argv, 1 ); 191 parse< modulus::grammar >( in ); 192 } 193 return 0; 194} 195``` 196 197### Complex Rules 198 199The complex calling convention gives a rule's `match()` member function access to "everything", i.e. some modes, the action- and control class, and all state arguments. 200All of these parameters are required for custom rules that need to themselves call other rules for matching. 201 202The signature of `match()` in a complex rule takes the following form. 203 204```c++ 205struct complex_rule 206{ 207 using rule_t = complex_rule; 208 using subs_t = tao::pegtl::empty_list; // Or tao::pegtl::rule_list< sub_rules_of_complex_rule... >. 209 210 template< tao::pegtl::apply_mode A, 211 tao::pegtl::rewind_mode M, 212 template< typename... > class Action, 213 template< typename... > class Control, 214 typename ParseInput, 215 typename... States > 216 static bool match( ParseInput& in, States&&... ) 217 { ... } 218}; 219``` 220 221#### Modes 222 223The `apply_mode` can take the value `apply_mode::action` or `apply_mode::nothing`, depending on whether actions are currently enabled or disabled. 224Most custom parsing rules will either ignore, or pass on the `apply_mode` unchanged; usually only the control interprets the `apply_mode`. 225 226The `rewind_mode` can take the value `rewind_mode::active`, `rewind_mode::required` or `rewind_mode::dontcare`. 227When `M` is `rewind_mode::required`, the custom rule's `match()`-implementation **must**, on local failure, rewind the input to where it (the input) was when `match()` was called. 228 229When `M` is **not** `rewind_mode::required`, it is not necessary to perform rewinding as either some other rule further up the call stack is already taking care of it (`rewind_mode::active`), or rewinding is not necessary (`rewind_mode::dontcare`). 230For example within a `must<>`-rule (which converts local failure, a return value of `false` from `match()`, to global failure, an exception) the `rewind_mode` is `dontcare`. 231 232The following implementation of the `seq`-rule's `match()` shows how to correctly handle the `rewind_mode`. 233The input's `mark()` member function uses the `rewind_mode` to choose which input marker to return, either one that takes care of rewinding when required, or a dummy object that does nothing. 234In the first case, `next_rewind_mode` is set to `active`, otherwise it is equal to `M`, just as required for the next rules called by the current one. 235The return value of `match()` is then passed through the input marker `m` so that, if the return value is `false` and the marker is not the dummy, it can rewind the input `in`. 236 237```c++ 238template< typename... Rules > 239struct seq 240{ 241 template< apply_mode A, 242 rewind_mode M, 243 template< typename... > class Action, 244 template< typename... > class Control, 245 typename ParseInput, 246 typename... States > 247 static bool match( ParseInput& in, States&&... st ) 248 { 249 auto m = in.template mark< M >(); 250 using m_t = decltype( m ); 251 return m( rule_conjunction< Rules... >::template 252 match< A, m_t::next_rewind_mode, Action, Control >( in, st... ) ); 253 } 254}; 255``` 256 257#### Example 258 259The following excerpt from the included example program `src/example/pegtl/dynamic_match.cpp` shows a complex custom rule that itself makes use of a state argument. 260This is necessary to cleanly implement dynamic matching, i.e. where a (set of) string(s) that a rule is intended to match depends on some run-time data structure rather than some compile-time type (the latter of which includes all template arguments). 261 262The aim is to parse a kind of *long string literal*, an arbitrary string literal that does not require escaping of any special characters, as is common in many scripting languages. 263In order to allow for arbitrary content without escaping it has to be possible to choose a string sequence that is not part of the string literal as delimiter. 264 265For this example we adopt the convention that a long string literal begins with `"[foo["` and ends with `"]foo]"` where `"foo"` is any non-empty string that does not contain a `"["` (quotation marks always excluded). 266 267Please note that the following code snippets are not in actual source code order. 268 269First we define a rule for the opening of a long string literal as explained above. 270 271```c++ 272namespace dynamic 273{ 274 struct long_literal_id 275 : tao::pegtl::plus< tao::pegtl::not_one< '[' > > {}; 276 277 struct long_literal_open 278 : tao::pegtl::seq< tao::pegtl::one< '[' >, 279 long_literal_id, 280 tao::pegtl::one< '[' > > {}; 281``` 282 283Then we implement an action class with a specialisation for what is the `"foo"`-part of the long string literal's opening sequence. 284The action stores the matched string that corresponds to `"foo"` in a string variable that is passed as state argument. 285 286```c++ 287 template< typename Rule > 288 struct action {}; 289 290 template<> 291 struct action< long_literal_id > 292 { 293 template< typename ActionInput > 294 static void apply( const ActionInput& in, 295 std::string& id, 296 const std::string& ) 297 { 298 id = in.string(); 299 } 300 }; 301``` 302 303The rule for the closing sequence is similar to the opening, with closing instead of opening brackets, and with a custom rule to check for the `"foo"`-part. 304 305```c++ 306 struct long_literal_close 307 : tao::pegtl::seq< tao::pegtl::one< ']' >, 308 long_literal_mark, 309 tao::pegtl::one< ']' > > {}; 310``` 311 312The custom rule itself 313 3141. first checks whether the input contains enough bytes to match the string stored by the action, 315 3162. then checks whether the input bytes match the stored string, and 317 3183. finally calls `bump()` to consume the correct number of bytes from the input when both checks succeed. 319 320```c++ 321 struct long_literal_mark 322 { 323 template< tao::pegtl::apply_mode A, 324 tao::pegtl::rewind_mode M, 325 template< typename... > class Action, 326 template< typename... > class Control, 327 typename ParseInput > 328 static bool match( ParseInput& in, 329 const std::string& id, 330 const std::string& ) 331 { 332 if( in.size( id.size() ) >= id.size() ) { 333 if( std::memcmp( in.begin(), id.data(), id.size() ) == 0 ) { 334 in.bump( id.size() ); 335 return true; 336 } 337 } 338 return false; 339 } 340 }; 341``` 342 343The grammar is completed with another two rules for putting everything together, and an action that stores the body of the long string literal in a second state argument. 344In this case the rule `long_literal_body` is redundant, however real-world examples frequently contain a rule like `tao::pegtl::any` multiple times, and so it is necessary to give it another name in order to attach different actions to different uses of the same rule. 345 346```c++ 347 struct long_literal_body 348 : tao::pegtl::any {}; 349 350 struct grammar 351 : tao::pegtl::if_must< long_literal_open, 352 tao::pegtl::until< long_literal_close, 353 long_literal_body >, 354 tao::pegtl::eof > {}; 355 356 template<> struct action< long_literal_body > 357 { 358 template< typename ActionInput > 359 static void apply( const ActionInput& in, 360 const std::string&, 361 std::string& body ) 362 { 363 body += in.string(); 364 } 365 }; 366 367} // namespace dynamic 368``` 369 370Given the main function... 371 372```c++ 373int main( int argc, char* argv[] ) 374{ 375 if( argc > 1 ) { 376 std::string id; 377 std::string body; 378 379 tao::pegtl::argv_input in( argv, 1 ); 380 tao::pegtl::parse< dynamic::grammar, dynamic::action >( in, id, body ); 381 382 std::cout << "long literal id was: " << id << std::endl; 383 std::cout << "long literal body was: " << body << std::endl; 384 } 385 return 0; 386} 387``` 388 389...we can see the grammar in action in the shell: 390 391```sh 392$ build/src/example/pegtl/dynamic_match '[foo["[bla]"]foo]' 393long literal id was: foo 394long literal body was: "[bla]" 395 396$ build/src/example/pegtl/dynamic_match '["fraggle"["[foo["]"fraggle"]' 397long literal id was: "fraggle" 398long literal body was: "[foo[" 399``` 400 401Copyright (c) 2014-2020 Dr. Colin Hirsch and Daniel Frey 402