1<!--===- documentation/Parsing.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9The F18 Parser 10============== 11This program source code implements a parser for the Fortran programming 12language. 13 14The draft ISO standard for Fortran 2018 dated July 2017 was used as the 15primary definition of the language. The parser also accepts many features 16from previous versions of the standard that are no longer part of the Fortran 172018 language. 18 19It also accepts many features that have never been part of any version 20of the standard Fortran language but have been supported by previous 21implementations and are known or suspected to remain in use. As a 22general principle, we want to recognize and implement any such feature 23so long as it does not conflict with requirements of the current standard 24for Fortran. 25 26The parser is implemented in standard ISO C++ and requires the 2017 27edition of the language and library. The parser constitutes a reentrant 28library with no mutable or constructed static data. Best modern C++ 29programming practices are observed to ensure that the ownership of 30dynamic memory is clear, that value rather than object semantics are 31defined for the data structures, that most functions are free from 32invisible side effects, and that the strictest available type checking 33is enforced by the C++ compiler when the Fortran parser is built. 34Class inheritance is rare and dynamic polymorphism is avoided in favor 35of modern discriminated unions. To the furthest reasonable extent, the 36parser has been implemented in a declarative fashion that corresponds 37closely to the text of the Fortran language standard. 38 39The several major modules of the Fortran parser are composed into a 40top-level Parsing class, by means of which one may drive the parsing of a 41source file and receive its parse tree and error messages. The interfaces 42of the Parsing class correspond to the two major passes of the parser, 43which are described below. 44 45Prescanning and Preprocessing 46----------------------------- 47The first pass is performed by an instance of the Prescanner class, 48with help from an instance of Preprocessor. 49 50The prescanner generates the "cooked character stream", implemented 51by a CookedSource class instance, in which: 52* line ends have been normalized to single ASCII LF characters (UNIX newlines) 53* all `INCLUDE` files have been expanded 54* all continued Fortran source lines have been unified 55* all comments and insignificant spaces have been removed 56* fixed form right margins have been clipped 57* extra blank card columns have been inserted into character literals 58 and Hollerith constants 59* preprocessing directives have been implemented 60* preprocessing macro invocations have been expanded 61* legacy `D` lines in fixed form source have been omitted or included 62* except for the payload in character literals, Hollerith constants, 63 and character and Hollerith edit descriptors, all letters have been 64 normalized to lower case 65* all original non-ASCII characters in Hollerith constants have been 66 decoded and re-encoded into UTF-8 67 68Lines in the cooked character stream can be of arbitrary length. 69 70The purpose of the cooked character stream is to enable the implementation 71of a parser whose sole concern is the recognition of the Fortran language 72from productions that closely correspond to the grammar that is presented 73in the Fortran standard, without having to deal with the complexity of 74all of the source-level concerns in the preceding list. 75 76The implementation of the preprocessor interacts with the prescanner by 77means of _token sequences_. These are partitionings of input lines into 78contiguous virtual blocks of characters, and are the only place in this 79Fortran compiler in which we have reified a tokenization of the program 80source; the parser proper does not have a tokenizer. The prescanner 81builds these token sequences out of source lines and supplies them 82to the preprocessor, which interprets directives and expands macro 83invocations. The token sequences returned by the preprocessor are then 84marshaled to constitute the cooked character stream that is the output of 85the prescanner. 86 87The preprocessor and prescanner can both instantiate new temporary 88instances of the Prescanner class to locate, open, and process any 89include files. 90 91The tight interaction and mutual design of the prescanner and preprocessor 92enable a principled implementation of preprocessing for the Fortran 93language that implements a reasonable facsimile of the C language 94preprocessor that is fully aware of Fortran's source forms, line 95continuation mechanisms, case insensitivity, token syntax, &c. 96 97The preprocessor always runs. There's no good reason for it not to. 98 99The content of the cooked character stream is available and useful 100for debugging, being as it is a simple value forwarded from the first major 101pass of the compiler to the second. 102 103Source Provenance 104----------------- 105The prescanner constructs a chronicle of every file that is read by the 106parser, viz. the original source file and all others that it directly 107or indirectly includes. One copy of the content of each of these files 108is mapped or read into the address space of the parser. Memory mapping 109is used initially, but files with DOS line breaks or a missing terminal 110newline are immediately normalized in a buffer when necessary. 111 112The virtual input stream, which marshals every appearance of every file 113and every expansion of every macro invocation, is not materialized as 114an actual stream of bytes. There is, however, a mapping from each byte 115position in this virtual input stream back to whence it came (maintained 116by an instance of the AllSources class). Offsets into this virtual input 117stream constitute values of the Provenance class. Provenance values, 118and contiguous ranges thereof, are used to describe and delimit source 119positions for messaging. 120 121Further, every byte in the cooked character stream supplied by the 122prescanner to the parser can be inexpensively mapped to its provenance. 123Simple `const char *` pointers to characters in the cooked character 124stream, or to contiguous ranges thereof, are used as source position 125indicators within the parser and in the parse tree. 126 127Messages 128-------- 129Message texts, and snprintf-like formatting strings for constructing 130messages, are instantiated in the various components of the parser with 131C++ user defined character literals tagged with `_err_en_US` and `_en_US` 132(signifying fatality and language, with the default being the dialect of 133English used in the United States) so that they may be easily identified 134for localization. As described above, messages are associated with 135source code positions by means of provenance values. 136 137The Parse Tree 138-------------- 139Each of the ca. 450 numbered requirement productions in the standard 140Fortran language grammar, as well as the productions implied by legacy 141extensions and preserved obsolescent features, maps to a distinct class 142in the parse tree so as to maximize the efficacy of static type checking 143by the C++ compiler. 144 145A transcription of the Fortran grammar appears with production requirement 146numbers in the commentary before these class definitions, so that one 147may easily refer to the standard (or to the parse tree definitions while 148reading that document). 149 150Three paradigms collectively implement most of the parse tree classes: 151* *wrappers*, in which a single data member `v` has been encapsulated 152 in a new type 153* *tuples* (or product types), in which several values of arbitrary 154 types have been encapsulated in a single data member `t` whose type 155 is an instance of `std::tuple<>` 156* *discriminated unions* (or sum types), in which one value whose type is 157 a dynamic selection from a set of distinct types is saved in a data 158 member `u` whose type is an instance of `std::variant<>` 159 160The use of these patterns is a design convenience, and exceptions to them 161are not uncommon wherever it made better sense to write custom definitions. 162 163Parse tree entities should be viewed as values, not objects; their 164addresses should not be abused for purposes of identification. They are 165assembled with C++ move semantics during parse tree construction. 166Their default and copy constructors are deliberately deleted in their 167declarations. 168 169The std::list<> data type is used in the parse tree to reliably store pointers 170to other relevant entries in the tree. Since the tree lists are moved and 171spliced at certain points std::list<> provides the necessary guarantee of the 172stability of pointers into these lists. 173 174There is a general purpose library by means of which parse trees may 175be traversed. 176 177Parsing 178------- 179This compiler attempts to recognize the entire cooked character stream 180(see above) as a Fortran program. It records the reductions made during 181a successful recognition as a parse tree value. The recognized grammar 182is that of a whole source file, not just of its possible statements, 183so the parser has no global state that tracks the subprogram hierarchy 184or the structure of their nested block constructs. The parser performs 185no semantic analysis along the way, deferring all of that work to the 186next pass of the compiler. 187 188The resulting parse tree therefore necessarily contains ambiguous parses 189that cannot be resolved without recourse to a symbol table. Most notably, 190leading assignments to array elements can be misrecognized as statement 191function definitions, and array element references can be misrecognized 192as function calls. The semantic analysis phase of the compiler performs 193local rewrites of the parse tree once it can be disambiguated by symbols 194and types. 195 196Formally speaking, this parser is based on recursive descent with 197localized backtracking (specifically, it will not backtrack into a 198successful reduction to try its other alternatives). It is not generated 199as a table or code from a specification of the Fortran grammar; rather, it 200_is_ the grammar, as declaratively respecified in C++ constant expressions 201using a small collection of basic token recognition objects and a library 202of "parser combinator" template functions that compose them to form more 203complicated recognizers and their correspondences to the construction 204of parse tree values. 205 206Unparsing 207--------- 208Parse trees can be converted back into free form Fortran source code. 209This formatter is not really a classical "pretty printer", but is 210more of a data structure dump whose output is suitable for compilation 211by another compiler. It is also used for testing the parser, since a 212reparse of an unparsed parse tree should be an identity function apart from 213source provenance. 214