1<!--
2Copyright (c) 2018-2019, NVIDIA CORPORATION.  All rights reserved.
3-->
4
5The F18 Parser
6==============
7This program source code implements a parser for the Fortran programming
8language.
9
10The draft ISO standard for Fortran 2018 dated July 2017 was used as the
11primary definition of the language.  The parser also accepts many features
12from previous versions of the standard that are no longer part of the Fortran
132018 language.
14
15It also accepts many features that have never been part of any version
16of the standard Fortran language but have been supported by previous
17implementations and are known or suspected to remain in use.  As a
18general principle, we want to recognize and implement any such feature
19so long as it does not conflict with requirements of the current standard
20for Fortran.
21
22The parser is implemented in standard ISO C++ and requires the 2017
23edition of the language and library.  The parser constitutes a reentrant
24library with no mutable or constructed static data.  Best modern C++
25programming practices are observed to ensure that the ownership of
26dynamic memory is clear, that value rather than object semantics are
27defined for the data structures, that most functions are free from
28invisible side effects, and that the strictest available type checking
29is enforced by the C++ compiler when the Fortran parser is built.
30Class inheritance is rare and dynamic polymorphism is avoided in favor
31of modern discriminated unions.  To the furthest reasonable extent, the
32parser has been implemented in a declarative fashion that corresponds
33closely to the text of the Fortran language standard.
34
35The several major modules of the Fortran parser are composed into a
36top-level Parsing class, by means of which one may drive the parsing of a
37source file and receive its parse tree and error messages.  The interfaces
38of the Parsing class correspond to the two major passes of the parser,
39which are described below.
40
41Prescanning and Preprocessing
42-----------------------------
43The first pass is performed by an instance of the Prescanner class,
44with help from an instance of Preprocessor.
45
46The prescanner generates the "cooked character stream", implemented
47by a CookedSource class instance, in which:
48* line ends have been normalized to single ASCII LF characters (UNIX newlines)
49* all `INCLUDE` files have been expanded
50* all continued Fortran source lines have been unified
51* all comments and insignificant spaces have been removed
52* fixed form right margins have been clipped
53* extra blank card columns have been inserted into character literals
54  and Hollerith constants
55* preprocessing directives have been implemented
56* preprocessing macro invocations have been expanded
57* legacy `D` lines in fixed form source have been omitted or included
58* except for the payload in character literals, Hollerith constants,
59  and character and Hollerith edit descriptors, all letters have been
60  normalized to lower case
61* all original non-ASCII characters in Hollerith constants have been
62  decoded and re-encoded into UTF-8
63
64Lines in the cooked character stream can be of arbitrary length.
65
66The purpose of the cooked character stream is to enable the implementation
67of a parser whose sole concern is the recognition of the Fortran language
68from productions that closely correspond to the grammar that is presented
69in the Fortran standard, without having to deal with the complexity of
70all of the source-level concerns in the preceding list.
71
72The implementation of the preprocessor interacts with the prescanner by
73means of _token sequences_.  These are partitionings of input lines into
74contiguous virtual blocks of characters, and are the only place in this
75Fortran compiler in which we have reified a tokenization of the program
76source; the parser proper does not have a tokenizer.  The prescanner
77builds these token sequences out of source lines and supplies them
78to the preprocessor, which interprets directives and expands macro
79invocations.  The token sequences returned by the preprocessor are then
80marshaled to constitute the cooked character stream that is the output of
81the prescanner.
82
83The preprocessor and prescanner can both instantiate new temporary
84instances of the Prescanner class to locate, open, and process any
85include files.
86
87The tight interaction and mutual design of the prescanner and preprocessor
88enable a principled implementation of preprocessing for the Fortran
89language that implements a reasonable facsimile of the C language
90preprocessor that is fully aware of Fortran's source forms, line
91continuation mechanisms, case insensitivity, token syntax, &c.
92
93The preprocessor always runs.  There's no good reason for it not to.
94
95The content of the cooked character stream is available and useful
96for debugging, being as it is a simple value forwarded from the first major
97pass of the compiler to the second.
98
99Source Provenance
100-----------------
101The prescanner constructs a chronicle of every file that is read by the
102parser, viz. the original source file and all others that it directly
103or indirectly includes.  One copy of the content of each of these files
104is mapped or read into the address space of the parser.  Memory mapping
105is used initially, but files with DOS line breaks or a missing terminal
106newline are immediately normalized in a buffer when necessary.
107
108The virtual input stream, which marshals every appearance of every file
109and every expansion of every macro invocation, is not materialized as
110an actual stream of bytes.  There is, however, a mapping from each byte
111position in this virtual input stream back to whence it came (maintained
112by an instance of the AllSources class).  Offsets into this virtual input
113stream constitute values of the Provenance class.  Provenance values,
114and contiguous ranges thereof, are used to describe and delimit source
115positions for messaging.
116
117Further, every byte in the cooked character stream supplied by the
118prescanner to the parser can be inexpensively mapped to its provenance.
119Simple `const char *` pointers to characters in the cooked character
120stream, or to contiguous ranges thereof, are used as source position
121indicators within the parser and in the parse tree.
122
123Messages
124--------
125Message texts, and snprintf-like formatting strings for constructing
126messages, are instantiated in the various components of the parser with
127C++ user defined character literals tagged with `_err_en_US` and `_en_US`
128(signifying fatality and language, with the default being the dialect of
129English used in the United States) so that they may be easily identified
130for localization.  As described above, messages are associated with
131source code positions by means of provenance values.
132
133The Parse Tree
134--------------
135Each of the ca. 450 numbered requirement productions in the standard
136Fortran language grammar, as well as the productions implied by legacy
137extensions and preserved obsolescent features, maps to a distinct class
138in the parse tree so as to maximize the efficacy of static type checking
139by the C++ compiler.
140
141A transcription of the Fortran grammar appears with production requirement
142numbers in the commentary before these class definitions, so that one
143may easily refer to the standard (or to the parse tree definitions while
144reading that document).
145
146Three paradigms collectively implement most of the parse tree classes:
147* *wrappers*, in which a single data member `v` has been encapsulated
148  in a new type
149* *tuples* (or product types), in which several values of arbitrary
150  types have been encapsulated in a single data member `t` whose type
151  is an instance of `std::tuple<>`
152* *discriminated unions* (or sum types), in which one value whose type is
153  a dynamic selection from a set of distinct types is saved in a data
154  member `u` whose type is an instance of `std::variant<>`
155
156The use of these patterns is a design convenience, and exceptions to them
157are not uncommon wherever it made better sense to write custom definitions.
158
159Parse tree entities should be viewed as values, not objects; their
160addresses should not be abused for purposes of identification.  They are
161assembled with C++ move semantics during parse tree construction.
162Their default and copy constructors are deliberately deleted in their
163declarations.
164
165There is a general purpose library by means of which parse trees may
166be traversed.
167
168Parsing
169-------
170This compiler attempts to recognize the entire cooked character stream
171(see above) as a Fortran program.  It records the reductions made during
172a successful recognition as a parse tree value.  The recognized grammar
173is that of a whole source file, not just of its possible statements,
174so the parser has no global state that tracks the subprogram hierarchy
175or the structure of their nested block constructs.  The parser performs
176no semantic analysis along the way, deferring all of that work to the
177next pass of the compiler.
178
179The resulting parse tree therefore necessarily contains ambiguous parses
180that cannot be resolved without recourse to a symbol table.  Most notably,
181leading assignments to array elements can be misrecognized as statement
182function definitions, and array element references can be misrecognized
183as function calls.  The semantic analysis phase of the compiler performs
184local rewrites of the parse tree once it can be disambiguated by symbols
185and types.
186
187Formally speaking, this parser is based on recursive descent with
188localized backtracking (specifically, it will not backtrack into a
189successful reduction to try its other alternatives).  It is not generated
190as a table or code from a specification of the Fortran grammar; rather, it
191_is_ the grammar, as declaratively respecified in C++ constant expressions
192using a small collection of basic token recognition objects and a library
193of "parser combinator" template functions that compose them to form more
194complicated recognizers and their correspondences to the construction
195of parse tree values.
196
197Unparsing
198---------
199Parse trees can be converted back into free form Fortran source code.
200This formatter is not really a classical "pretty printer", but is
201more of a data structure dump whose output is suitable for compilation
202by another compiler.  It is also used for testing the parser, since a
203reparse of an unparsed parse tree should be an identity function apart from
204source provenance.
205