1[/============================================================================== 2 Copyright (C) 2001-2011 Joel de Guzman 3 Copyright (C) 2001-2011 Hartmut Kaiser 4 5 Distributed under the Boost Software License, Version 1.0. (See accompanying 6 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 7===============================================================================/] 8 9[section Introduction] 10 11Boost Spirit is an object-oriented, recursive-descent parser and 12output generation library for C++. It allows you to write grammars and 13format descriptions using a format similar to Extended Backus Naur 14Form (EBNF)[footnote [@http://www.cl.cam.ac.uk/%7Emgk25/iso-14977.pdf 15ISO-EBNF]] directly in C++. These inline grammar 16specifications can mix freely with other C++ code and, thanks to the 17generative power of C++ templates, are immediately executable. In 18retrospect, conventional compiler-compilers or parser-generators have 19to perform an additional translation step from the source EBNF code to 20C or C++ code. 21 22The syntax and semantics of the libraries' API directly form domain-specific 23embedded languages (DSEL). In fact, Spirit exposes 3 different DSELs to the 24user: 25 26* one for creating parser grammars, 27* one for the specification of the required tokens to be used for parsing, 28* and one for the description of the required output formats. 29 30Since the target input grammars and output formats are written entirely in C++ 31we do not need any separate tools to compile, preprocess or integrate those 32into the build process. __spirit__ allows seamless integration of the parsing 33and output generation process with other C++ code. This often allows for 34simpler and more efficient code. 35 36Both the created parsers and generators are fully attributed, which allows you 37to easily build and handle hierarchical data structures in memory. These data 38structures resemble the structure of the input data and can directly be used 39to generate arbitrarily-formatted output. 40 41The [link spirit.spiritstructure figure] below depicts the overall structure 42of the Boost Spirit library. The library consists of 4 major parts: 43 44* __classic__: This is the almost-unchanged code base taken from the 45 former Boost Spirit V1.8 distribution. It has been moved into the namespace 46 boost::spirit::classic. A special compatibility layer has been added to 47 ensure complete compatibility with existing code using Spirit V1.8. 48* __qi__: This is the parser library allowing you to build recursive 49 descent parsers. The exposed domain-specific language can be used to describe 50 the grammars to implement, and the rules for storing the parsed information. 51* __lex__: This is the library usable to create tokenizers (lexers). The 52 domain-specific language exposed by __lex__ allows you to define regular 53 expressions used to match tokens (create token definitions), associate these 54 regular expressions with code to be executed whenever they are matched, and 55 to add the token definitions to the lexical analyzer. 56* __karma__: This is the generator library allowing you to create code for 57 recursive descent, data type-driven output formatting. The exposed 58 domain-specific language is almost equivalent to the parser description language 59 used in __qi__, except that it is used to describe the required output 60 format to generate from a given data structure. 61 62[fig spiritstructure.png..The overall structure of the Boost Spirit library..spirit.spiritstructure] 63 64 65The three components, __qi__, __karma__ and __lex__, are designed to be used 66either stand alone, or together. The general methodology is to use the token 67sequence generated by __lex__ as the input for a parser generated by __qi__. 68On the opposite side of the equation, the hierarchical data structures generated 69by __qi__ are used for the output generators created using __karma__. 70However, there is nothing to stop you from using any of these components all 71by themselves. 72 73The [link spirit.spiritkarmaflow figure] below shows the typical data flow of 74some input being converted to some internal representation. After some 75(optional) transformation these data are converted back into some different, 76external representation. The picture highlights Spirit's place in this data 77transformation flow. 78 79[fig spiritkarmaflow.png..The place of __qi__ and __karma__ in a data transformation flow of a typical application..spirit.spiritkarmaflow] 80 81[heading A Quick Overview of Parsing with __qi__] 82 83__qi__ is Spirit's sublibrary dealing with generating parsers based on a given 84target grammar (essentially a format description of the input data to read). 85 86A simple EBNF grammar snippet: 87 88 group ::= '(' expression ')' 89 factor ::= integer | group 90 term ::= factor (('*' factor) | ('/' factor))* 91 expression ::= term (('+' term) | ('-' term))* 92 93is approximated using facilities of Spirit's /Qi/ sublibrary as seen in this 94code snippet: 95 96 group = '(' >> expression >> ')'; 97 factor = integer | group; 98 term = factor >> *(('*' >> factor) | ('/' >> factor)); 99 expression = term >> *(('+' >> term) | ('-' >> term)); 100 101Through the magic of expression templates, this is perfectly valid and 102executable C++ code. The production rule `expression` is, in fact, an object that 103has a member function `parse` that does the work given a source code written in 104the grammar that we have just declared. Yes, it's a calculator. We shall 105simplify for now by skipping the type declarations and the definition of the 106rule `integer` invoked by `factor`. Now, the production rule `expression` in our 107grammar specification, traditionally called the `start` symbol, can recognize 108inputs such as: 109 110 12345 111 -12345 112 +12345 113 1 + 2 114 1 * 2 115 1/2 + 3/4 116 1 + 2 + 3 + 4 117 1 * 2 * 3 * 4 118 (1 + 2) * (3 + 4) 119 (-1 + 2) * (3 + -4) 120 1 + ((6 * 200) - 20) / 6 121 (1 + (2 + (3 + (4 + 5)))) 122 123Certainly we have modified the original EBNF syntax. This is done to 124conform to C++ syntax rules. Most notably we see the abundance of 125shift >> operators. Since there are no 'empty' operators in C++, it is 126simply not possible to write something like: 127 128 a b 129 130as seen in math syntax, for example, to mean multiplication or, in our case, 131as seen in EBNF syntax to mean sequencing (b should follow a). __qi__ 132uses the shift `>>` operator instead for this purpose. We take the `>>` operator, 133with arrows pointing to the right, to mean "is followed by". Thus we write: 134 135 a >> b 136 137The alternative operator `|` and the parentheses `()` remain as is. The 138assignment operator `=` is used in place of EBNF's `::=`. Last but not least, 139the Kleene star `*`, which in this case is a postfix operator in EBNF becomes a 140prefix. Instead of: 141 142 a* //... in EBNF syntax, 143 144we write: 145 146 *a //... in Spirit. 147 148since there are no postfix stars, `*`, in C/C++. Finally, we terminate each 149rule with the ubiquitous semi-colon, `;`. 150 151 152[heading A Quick Overview of Output Generation with __karma__] 153 154Spirit not only allows you to describe the structure of the input, it also enables 155the specification of the output format for your data in a similar way, and based 156on a single syntax and compatible semantics. 157 158Let's assume we need to generate a textual representation from a simple data 159structure such as a `std::vector<int>`. Conventional code probably would look like: 160 161 std::vector<int> v (initialize_and_fill()); 162 std::vector<int>::iterator end = v.end(); 163 for (std::vector<int>::iterator it = v.begin(); it != end; ++it) 164 std::cout << *it << std::endl; 165 166which is not very flexible and quite difficult to maintain when it comes to 167changing the required output format. Spirit's sublibrary /Karma/ allows you to 168specify output formats for arbitrary data structures in a very flexible way. 169The following snippet is the /Karma/ format description used to create the 170same output as the traditional code above: 171 172 *(int_ << eol) 173 174Here are some more examples of format descriptions for different output 175representations of the same `std::vector<int>`: 176 177[table Different output formats for `std::vector<int>` 178 [ [Format] [Example] [Description] ] 179 [ [`'[' << *(int_ << ',') << ']'`] [`[1,8,10,]`] [Comma separated list of integers] ] 180 [ [`*('(' << int_ << ')' << ',')`] [`(1),(8),(10),`] [Comma separated list of integers in parenthesis] ] 181 [ [`*hex`] [`18a`] [A list of hexadecimal numbers] ] 182 [ [`*(double_ << ',')`] [`1.0,8.0,10.0,`] [A list of floating point numbers] ] 183] 184 185We will see later in this documentation how it is possible to avoid printing 186the trailing `','`. 187 188Overall, the syntax is similar to __qi__ with the exception that we use the `<<` 189operator for output concatenation. This should be easy to understand as it 190follows the conventions used in the Standard's I/O streams. 191 192Another important feature of __karma__ allows you to fully decouple the data 193type from the output format. You can use the same output format with different 194data types as long as these conform conceptually. The next table gives some 195related examples. 196 197[table Different data types usable with the output format `*(int_ << eol)` 198 [ [Data type] [Description] ] 199 [ [`int i[4]`] [C style arrays] ] 200 [ [`std::vector<int>`] [Standard vector] ] 201 [ [`std::list<int>`] [Standard list] ] 202 [ [`boost::array<long, 20>`] [Boost array] ] 203] 204 205[endsect] 206