1This is cppinternals.info, produced by makeinfo version 6.5 from 2cppinternals.texi. 3 4INFO-DIR-SECTION Software development 5START-INFO-DIR-ENTRY 6* Cpplib: (cppinternals). Cpplib internals. 7END-INFO-DIR-ENTRY 8 9This file documents the internals of the GNU C Preprocessor. 10 11 Copyright (C) 2000-2020 Free Software Foundation, Inc. 12 13 Permission is granted to make and distribute verbatim copies of this 14manual provided the copyright notice and this permission notice are 15preserved on all copies. 16 17 Permission is granted to copy and distribute modified versions of 18this manual under the conditions for verbatim copying, provided also 19that the entire resulting derived work is distributed under the terms of 20a permission notice identical to this one. 21 22 Permission is granted to copy and distribute translations of this 23manual into another language, under the above conditions for modified 24versions. 25 26 27File: cppinternals.info, Node: Top, Next: Conventions, Up: (dir) 28 29The GNU C Preprocessor Internals 30******************************** 31 32* Menu: 33 34* Conventions:: 35* Lexer:: 36* Hash Nodes:: 37* Macro Expansion:: 38* Token Spacing:: 39* Line Numbering:: 40* Guard Macros:: 41* Files:: 42* Concept Index:: 43 441 Cpplib--the GNU C Preprocessor 45******************************** 46 47The GNU C preprocessor is implemented as a library, "cpplib", so it can 48be easily shared between a stand-alone preprocessor, and a preprocessor 49integrated with the C, C++ and Objective-C front ends. It is also 50available for use by other programs, though this is not recommended as 51its exposed interface has not yet reached a point of reasonable 52stability. 53 54 The library has been written to be re-entrant, so that it can be used 55to preprocess many files simultaneously if necessary. It has also been 56written with the preprocessing token as the fundamental unit; the 57preprocessor in previous versions of GCC would operate on text strings 58as the fundamental unit. 59 60 This brief manual documents the internals of cpplib, and explains 61some of the tricky issues. It is intended that, along with the comments 62in the source code, a reasonably competent C programmer should be able 63to figure out what the code is doing, and why things have been 64implemented the way they have. 65 66* Menu: 67 68* Conventions:: Conventions used in the code. 69* Lexer:: The combined C, C++ and Objective-C Lexer. 70* Hash Nodes:: All identifiers are entered into a hash table. 71* Macro Expansion:: Macro expansion algorithm. 72* Token Spacing:: Spacing and paste avoidance issues. 73* Line Numbering:: Tracking location within files. 74* Guard Macros:: Optimizing header files with guard macros. 75* Files:: File handling. 76* Concept Index:: Index. 77 78 79File: cppinternals.info, Node: Conventions, Next: Lexer, Prev: Top, Up: Top 80 81Conventions 82*********** 83 84cpplib has two interfaces--one is exposed internally only, and the other 85is for both internal and external use. 86 87 The convention is that functions and types that are exposed to 88multiple files internally are prefixed with '_cpp_', and are to be found 89in the file 'internal.h'. Functions and types exposed to external 90clients are in 'cpplib.h', and prefixed with 'cpp_'. For historical 91reasons this is no longer quite true, but we should strive to stick to 92it. 93 94 We are striving to reduce the information exposed in 'cpplib.h' to 95the bare minimum necessary, and then to keep it there. This makes clear 96exactly what external clients are entitled to assume, and allows us to 97change internals in the future without worrying whether library clients 98are perhaps relying on some kind of undocumented implementation-specific 99behavior. 100 101 102File: cppinternals.info, Node: Lexer, Next: Hash Nodes, Prev: Conventions, Up: Top 103 104The Lexer 105********* 106 107Overview 108======== 109 110The lexer is contained in the file 'lex.c'. It is a hand-coded lexer, 111and not implemented as a state machine. It can understand C, C++ and 112Objective-C source code, and has been extended to allow reasonably 113successful preprocessing of assembly language. The lexer does not make 114an initial pass to strip out trigraphs and escaped newlines, but handles 115them as they are encountered in a single pass of the input file. It 116returns preprocessing tokens individually, not a line at a time. 117 118 It is mostly transparent to users of the library, since the library's 119interface for obtaining the next token, 'cpp_get_token', takes care of 120lexing new tokens, handling directives, and expanding macros as 121necessary. However, the lexer does expose some functionality so that 122clients of the library can easily spell a given token, such as 123'cpp_spell_token' and 'cpp_token_len'. These functions are useful when 124generating diagnostics, and for emitting the preprocessed output. 125 126Lexing a token 127============== 128 129Lexing of an individual token is handled by '_cpp_lex_direct' and its 130subroutines. In its current form the code is quite complicated, with 131read ahead characters and such-like, since it strives to not step back 132in the character stream in preparation for handling non-ASCII file 133encodings. The current plan is to convert any such files to UTF-8 134before processing them. This complexity is therefore unnecessary and 135will be removed, so I'll not discuss it further here. 136 137 The job of '_cpp_lex_direct' is simply to lex a token. It is not 138responsible for issues like directive handling, returning lookahead 139tokens directly, multiple-include optimization, or conditional block 140skipping. It necessarily has a minor ro^le to play in memory management 141of lexed lines. I discuss these issues in a separate section (*note 142Lexing a line::). 143 144 The lexer places the token it lexes into storage pointed to by the 145variable 'cur_token', and then increments it. This variable is 146important for correct diagnostic positioning. Unless a specific line 147and column are passed to the diagnostic routines, they will examine the 148'line' and 'col' values of the token just before the location that 149'cur_token' points to, and use that location to report the diagnostic. 150 151 The lexer does not consider whitespace to be a token in its own 152right. If whitespace (other than a new line) precedes a token, it sets 153the 'PREV_WHITE' bit in the token's flags. Each token has its 'line' 154and 'col' variables set to the line and column of the first character of 155the token. This line number is the line number in the translation unit, 156and can be converted to a source (file, line) pair using the line map 157code. 158 159 The first token on a logical, i.e. unescaped, line has the flag 'BOL' 160set for beginning-of-line. This flag is intended for internal use, both 161to distinguish a '#' that begins a directive from one that doesn't, and 162to generate a call-back to clients that want to be notified about the 163start of every non-directive line with tokens on it. Clients cannot 164reliably determine this for themselves: the first token might be a 165macro, and the tokens of a macro expansion do not have the 'BOL' flag 166set. The macro expansion may even be empty, and the next token on the 167line certainly won't have the 'BOL' flag set. 168 169 New lines are treated specially; exactly how the lexer handles them 170is context-dependent. The C standard mandates that directives are 171terminated by the first unescaped newline character, even if it appears 172in the middle of a macro expansion. Therefore, if the state variable 173'in_directive' is set, the lexer returns a 'CPP_EOF' token, which is 174normally used to indicate end-of-file, to indicate end-of-directive. In 175a directive a 'CPP_EOF' token never means end-of-file. Conveniently, if 176the caller was 'collect_args', it already handles 'CPP_EOF' as if it 177were end-of-file, and reports an error about an unterminated macro 178argument list. 179 180 The C standard also specifies that a new line in the middle of the 181arguments to a macro is treated as whitespace. This white space is 182important in case the macro argument is stringized. The state variable 183'parsing_args' is nonzero when the preprocessor is collecting the 184arguments to a macro call. It is set to 1 when looking for the opening 185parenthesis to a function-like macro, and 2 when collecting the actual 186arguments up to the closing parenthesis, since these two cases need to 187be distinguished sometimes. One such time is here: the lexer sets the 188'PREV_WHITE' flag of a token if it meets a new line when 'parsing_args' 189is set to 2. It doesn't set it if it meets a new line when 190'parsing_args' is 1, since then code like 191 192 #define foo() bar 193 foo 194 baz 195 196would be output with an erroneous space before 'baz': 197 198 foo 199 baz 200 201 This is a good example of the subtlety of getting token spacing 202correct in the preprocessor; there are plenty of tests in the testsuite 203for corner cases like this. 204 205 The lexer is written to treat each of '\r', '\n', '\r\n' and '\n\r' 206as a single new line indicator. This allows it to transparently 207preprocess MS-DOS, Macintosh and Unix files without their needing to 208pass through a special filter beforehand. 209 210 We also decided to treat a backslash, either '\' or the trigraph 211'??/', separated from one of the above newline indicators by non-comment 212whitespace only, as intending to escape the newline. It tends to be a 213typing mistake, and cannot reasonably be mistaken for anything else in 214any of the C-family grammars. Since handling it this way is not 215strictly conforming to the ISO standard, the library issues a warning 216wherever it encounters it. 217 218 Handling newlines like this is made simpler by doing it in one place 219only. The function 'handle_newline' takes care of all newline 220characters, and 'skip_escaped_newlines' takes care of arbitrarily long 221sequences of escaped newlines, deferring to 'handle_newline' to handle 222the newlines themselves. 223 224 The most painful aspect of lexing ISO-standard C and C++ is handling 225trigraphs and backlash-escaped newlines. Trigraphs are processed before 226any interpretation of the meaning of a character is made, and 227unfortunately there is a trigraph representation for a backslash, so it 228is possible for the trigraph '??/' to introduce an escaped newline. 229 230 Escaped newlines are tedious because theoretically they can occur 231anywhere--between the '+' and '=' of the '+=' token, within the 232characters of an identifier, and even between the '*' and '/' that 233terminates a comment. Moreover, you cannot be sure there is just 234one--there might be an arbitrarily long sequence of them. 235 236 So, for example, the routine that lexes a number, 'parse_number', 237cannot assume that it can scan forwards until the first non-number 238character and be done with it, because this could be the '\' introducing 239an escaped newline, or the '?' introducing the trigraph sequence that 240represents the '\' of an escaped newline. If it encounters a '?' or 241'\', it calls 'skip_escaped_newlines' to skip over any potential escaped 242newlines before checking whether the number has been finished. 243 244 Similarly code in the main body of '_cpp_lex_direct' cannot simply 245check for a '=' after a '+' character to determine whether it has a '+=' 246token; it needs to be prepared for an escaped newline of some sort. 247Such cases use the function 'get_effective_char', which returns the 248first character after any intervening escaped newlines. 249 250 The lexer needs to keep track of the correct column position, 251including counting tabs as specified by the '-ftabstop=' option. This 252should be done even within C-style comments; they can appear in the 253middle of a line, and we want to report diagnostics in the correct 254position for text appearing after the end of the comment. 255 256 Some identifiers, such as '__VA_ARGS__' and poisoned identifiers, may 257be invalid and require a diagnostic. However, if they appear in a macro 258expansion we don't want to complain with each use of the macro. It is 259therefore best to catch them during the lexing stage, in 260'parse_identifier'. In both cases, whether a diagnostic is needed or 261not is dependent upon the lexer's state. For example, we don't want to 262issue a diagnostic for re-poisoning a poisoned identifier, or for using 263'__VA_ARGS__' in the expansion of a variable-argument macro. Therefore 264'parse_identifier' makes use of state flags to determine whether a 265diagnostic is appropriate. Since we change state on a per-token basis, 266and don't lex whole lines at a time, this is not a problem. 267 268 Another place where state flags are used to change behavior is whilst 269lexing header names. Normally, a '<' would be lexed as a single token. 270After a '#include' directive, though, it should be lexed as a single 271token as far as the nearest '>' character. Note that we don't allow the 272terminators of header names to be escaped; the first '"' or '>' 273terminates the header name. 274 275 Interpretation of some character sequences depends upon whether we 276are lexing C, C++ or Objective-C, and on the revision of the standard in 277force. For example, '::' is a single token in C++, but in C it is two 278separate ':' tokens and almost certainly a syntax error. Such cases are 279handled by '_cpp_lex_direct' based upon command-line flags stored in the 280'cpp_options' structure. 281 282 Once a token has been lexed, it leads an independent existence. The 283spelling of numbers, identifiers and strings is copied to permanent 284storage from the original input buffer, so a token remains valid and 285correct even if its source buffer is freed with '_cpp_pop_buffer'. The 286storage holding the spellings of such tokens remains until the client 287program calls cpp_destroy, probably at the end of the translation unit. 288 289Lexing a line 290============= 291 292When the preprocessor was changed to return pointers to tokens, one 293feature I wanted was some sort of guarantee regarding how long a 294returned pointer remains valid. This is important to the stand-alone 295preprocessor, the future direction of the C family front ends, and even 296to cpplib itself internally. 297 298 Occasionally the preprocessor wants to be able to peek ahead in the 299token stream. For example, after the name of a function-like macro, it 300wants to check the next token to see if it is an opening parenthesis. 301Another example is that, after reading the first few tokens of a 302'#pragma' directive and not recognizing it as a registered pragma, it 303wants to backtrack and allow the user-defined handler for unknown 304pragmas to access the full '#pragma' token stream. The stand-alone 305preprocessor wants to be able to test the current token with the 306previous one to see if a space needs to be inserted to preserve their 307separate tokenization upon re-lexing (paste avoidance), so it needs to 308be sure the pointer to the previous token is still valid. The 309recursive-descent C++ parser wants to be able to perform tentative 310parsing arbitrarily far ahead in the token stream, and then to be able 311to jump back to a prior position in that stream if necessary. 312 313 The rule I chose, which is fairly natural, is to arrange that the 314preprocessor lex all tokens on a line consecutively into a token buffer, 315which I call a "token run", and when meeting an unescaped new line 316(newlines within comments do not count either), to start lexing back at 317the beginning of the run. Note that we do _not_ lex a line of tokens at 318once; if we did that 'parse_identifier' would not have state flags 319available to warn about invalid identifiers (*note Invalid 320identifiers::). 321 322 In other words, accessing tokens that appeared earlier in the current 323line is valid, but since each logical line overwrites the tokens of the 324previous line, tokens from prior lines are unavailable. In particular, 325since a directive only occupies a single logical line, this means that 326the directive handlers like the '#pragma' handler can jump around in the 327directive's tokens if necessary. 328 329 Two issues remain: what about tokens that arise from macro 330expansions, and what happens when we have a long line that overflows the 331token run? 332 333 Since we promise clients that we preserve the validity of pointers 334that we have already returned for tokens that appeared earlier in the 335line, we cannot reallocate the run. Instead, on overflow it is expanded 336by chaining a new token run on to the end of the existing one. 337 338 The tokens forming a macro's replacement list are collected by the 339'#define' handler, and placed in storage that is only freed by 340'cpp_destroy'. So if a macro is expanded in the line of tokens, the 341pointers to the tokens of its expansion that are returned will always 342remain valid. However, macros are a little trickier than that, since 343they give rise to three sources of fresh tokens. They are the built-in 344macros like '__LINE__', and the '#' and '##' operators for stringizing 345and token pasting. I handled this by allocating space for these tokens 346from the lexer's token run chain. This means they automatically receive 347the same lifetime guarantees as lexed tokens, and we don't need to 348concern ourselves with freeing them. 349 350 Lexing into a line of tokens solves some of the token memory 351management issues, but not all. The opening parenthesis after a 352function-like macro name might lie on a different line, and the front 353ends definitely want the ability to look ahead past the end of the 354current line. So cpplib only moves back to the start of the token run 355at the end of a line if the variable 'keep_tokens' is zero. 356Line-buffering is quite natural for the preprocessor, and as a result 357the only time cpplib needs to increment this variable is whilst looking 358for the opening parenthesis to, and reading the arguments of, a 359function-like macro. In the near future cpplib will export an interface 360to increment and decrement this variable, so that clients can share full 361control over the lifetime of token pointers too. 362 363 The routine '_cpp_lex_token' handles moving to new token runs, 364calling '_cpp_lex_direct' to lex new tokens, or returning 365previously-lexed tokens if we stepped back in the token stream. It also 366checks each token for the 'BOL' flag, which might indicate a directive 367that needs to be handled, or require a start-of-line call-back to be 368made. '_cpp_lex_token' also handles skipping over tokens in failed 369conditional blocks, and invalidates the control macro of the 370multiple-include optimization if a token was successfully lexed outside 371a directive. In other words, its callers do not need to concern 372themselves with such issues. 373 374 375File: cppinternals.info, Node: Hash Nodes, Next: Macro Expansion, Prev: Lexer, Up: Top 376 377Hash Nodes 378********** 379 380When cpplib encounters an "identifier", it generates a hash code for it 381and stores it in the hash table. By "identifier" we mean tokens with 382type 'CPP_NAME'; this includes identifiers in the usual C sense, as well 383as keywords, directive names, macro names and so on. For example, all 384of 'pragma', 'int', 'foo' and '__GNUC__' are identifiers and hashed when 385lexed. 386 387 Each node in the hash table contain various information about the 388identifier it represents. For example, its length and type. At any one 389time, each identifier falls into exactly one of three categories: 390 391 * Macros 392 393 These have been declared to be macros, either on the command line 394 or with '#define'. A few, such as '__TIME__' are built-ins entered 395 in the hash table during initialization. The hash node for a 396 normal macro points to a structure with more information about the 397 macro, such as whether it is function-like, how many arguments it 398 takes, and its expansion. Built-in macros are flagged as special, 399 and instead contain an enum indicating which of the various 400 built-in macros it is. 401 402 * Assertions 403 404 Assertions are in a separate namespace to macros. To enforce this, 405 cpp actually prepends a '#' character before hashing and entering 406 it in the hash table. An assertion's node points to a chain of 407 answers to that assertion. 408 409 * Void 410 411 Everything else falls into this category--an identifier that is not 412 currently a macro, or a macro that has since been undefined with 413 '#undef'. 414 415 When preprocessing C++, this category also includes the named 416 operators, such as 'xor'. In expressions these behave like the 417 operators they represent, but in contexts where the spelling of a 418 token matters they are spelt differently. This spelling 419 distinction is relevant when they are operands of the stringizing 420 and pasting macro operators '#' and '##'. Named operator hash 421 nodes are flagged, both to catch the spelling distinction and to 422 prevent them from being defined as macros. 423 424 The same identifiers share the same hash node. Since each identifier 425token, after lexing, contains a pointer to its hash node, this is used 426to provide rapid lookup of various information. For example, when 427parsing a '#define' statement, CPP flags each argument's identifier hash 428node with the index of that argument. This makes duplicated argument 429checking an O(1) operation for each argument. Similarly, for each 430identifier in the macro's expansion, lookup to see if it is an argument, 431and which argument it is, is also an O(1) operation. Further, each 432directive name, such as 'endif', has an associated directive enum stored 433in its hash node, so that directive lookup is also O(1). 434 435 436File: cppinternals.info, Node: Macro Expansion, Next: Token Spacing, Prev: Hash Nodes, Up: Top 437 438Macro Expansion Algorithm 439************************* 440 441Macro expansion is a tricky operation, fraught with nasty corner cases 442and situations that render what you thought was a nifty way to optimize 443the preprocessor's expansion algorithm wrong in quite subtle ways. 444 445 I strongly recommend you have a good grasp of how the C and C++ 446standards require macros to be expanded before diving into this section, 447let alone the code!. If you don't have a clear mental picture of how 448things like nested macro expansion, stringizing and token pasting are 449supposed to work, damage to your sanity can quickly result. 450 451Internal representation of macros 452================================= 453 454The preprocessor stores macro expansions in tokenized form. This saves 455repeated lexing passes during expansion, at the cost of a small increase 456in memory consumption on average. The tokens are stored contiguously in 457memory, so a pointer to the first one and a token count is all you need 458to get the replacement list of a macro. 459 460 If the macro is a function-like macro the preprocessor also stores 461its parameters, in the form of an ordered list of pointers to the hash 462table entry of each parameter's identifier. Further, in the macro's 463stored expansion each occurrence of a parameter is replaced with a 464special token of type 'CPP_MACRO_ARG'. Each such token holds the index 465of the parameter it represents in the parameter list, which allows rapid 466replacement of parameters with their arguments during expansion. 467Despite this optimization it is still necessary to store the original 468parameters to the macro, both for dumping with e.g., '-dD', and to warn 469about non-trivial macro redefinitions when the parameter names have 470changed. 471 472Macro expansion overview 473======================== 474 475The preprocessor maintains a "context stack", implemented as a linked 476list of 'cpp_context' structures, which together represent the macro 477expansion state at any one time. The 'struct cpp_reader' member 478variable 'context' points to the current top of this stack. The top 479normally holds the unexpanded replacement list of the innermost macro 480under expansion, except when cpplib is about to pre-expand an argument, 481in which case it holds that argument's unexpanded tokens. 482 483 When there are no macros under expansion, cpplib is in "base 484context". All contexts other than the base context contain a contiguous 485list of tokens delimited by a starting and ending token. When not in 486base context, cpplib obtains the next token from the list of the top 487context. If there are no tokens left in the list, it pops that context 488off the stack, and subsequent ones if necessary, until an unexhausted 489context is found or it returns to base context. In base context, cpplib 490reads tokens directly from the lexer. 491 492 If it encounters an identifier that is both a macro and enabled for 493expansion, cpplib prepares to push a new context for that macro on the 494stack by calling the routine 'enter_macro_context'. When this routine 495returns, the new context will contain the unexpanded tokens of the 496replacement list of that macro. In the case of function-like macros, 497'enter_macro_context' also replaces any parameters in the replacement 498list, stored as 'CPP_MACRO_ARG' tokens, with the appropriate macro 499argument. If the standard requires that the parameter be replaced with 500its expanded argument, the argument will have been fully macro expanded 501first. 502 503 'enter_macro_context' also handles special macros like '__LINE__'. 504Although these macros expand to a single token which cannot contain any 505further macros, for reasons of token spacing (*note Token Spacing::) and 506simplicity of implementation, cpplib handles these special macros by 507pushing a context containing just that one token. 508 509 The final thing that 'enter_macro_context' does before returning is 510to mark the macro disabled for expansion (except for special macros like 511'__TIME__'). The macro is re-enabled when its context is later popped 512from the context stack, as described above. This strict ordering 513ensures that a macro is disabled whilst its expansion is being scanned, 514but that it is _not_ disabled whilst any arguments to it are being 515expanded. 516 517Scanning the replacement list for macros to expand 518================================================== 519 520The C standard states that, after any parameters have been replaced with 521their possibly-expanded arguments, the replacement list is scanned for 522nested macros. Further, any identifiers in the replacement list that 523are not expanded during this scan are never again eligible for expansion 524in the future, if the reason they were not expanded is that the macro in 525question was disabled. 526 527 Clearly this latter condition can only apply to tokens resulting from 528argument pre-expansion. Other tokens never have an opportunity to be 529re-tested for expansion. It is possible for identifiers that are 530function-like macros to not expand initially but to expand during a 531later scan. This occurs when the identifier is the last token of an 532argument (and therefore originally followed by a comma or a closing 533parenthesis in its macro's argument list), and when it replaces its 534parameter in the macro's replacement list, the subsequent token happens 535to be an opening parenthesis (itself possibly the first token of an 536argument). 537 538 It is important to note that when cpplib reads the last token of a 539given context, that context still remains on the stack. Only when 540looking for the _next_ token do we pop it off the stack and drop to a 541lower context. This makes backing up by one token easy, but more 542importantly ensures that the macro corresponding to the current context 543is still disabled when we are considering the last token of its 544replacement list for expansion (or indeed expanding it). As an example, 545which illustrates many of the points above, consider 546 547 #define foo(x) bar x 548 foo(foo) (2) 549 550which fully expands to 'bar foo (2)'. During pre-expansion of the 551argument, 'foo' does not expand even though the macro is enabled, since 552it has no following parenthesis [pre-expansion of an argument only uses 553tokens from that argument; it cannot take tokens from whatever follows 554the macro invocation]. This still leaves the argument token 'foo' 555eligible for future expansion. Then, when re-scanning after argument 556replacement, the token 'foo' is rejected for expansion, and marked 557ineligible for future expansion, since the macro is now disabled. It is 558disabled because the replacement list 'bar foo' of the macro is still on 559the context stack. 560 561 If instead the algorithm looked for an opening parenthesis first and 562then tested whether the macro were disabled it would be subtly wrong. 563In the example above, the replacement list of 'foo' would be popped in 564the process of finding the parenthesis, re-enabling 'foo' and expanding 565it a second time. 566 567Looking for a function-like macro's opening parenthesis 568======================================================= 569 570Function-like macros only expand when immediately followed by a 571parenthesis. To do this cpplib needs to temporarily disable macros and 572read the next token. Unfortunately, because of spacing issues (*note 573Token Spacing::), there can be fake padding tokens in-between, and if 574the next real token is not a parenthesis cpplib needs to be able to back 575up that one token as well as retain the information in any intervening 576padding tokens. 577 578 Backing up more than one token when macros are involved is not 579permitted by cpplib, because in general it might involve issues like 580restoring popped contexts onto the context stack, which are too hard. 581Instead, searching for the parenthesis is handled by a special function, 582'funlike_invocation_p', which remembers padding information as it reads 583tokens. If the next real token is not an opening parenthesis, it backs 584up that one token, and then pushes an extra context just containing the 585padding information if necessary. 586 587Marking tokens ineligible for future expansion 588============================================== 589 590As discussed above, cpplib needs a way of marking tokens as 591unexpandable. Since the tokens cpplib handles are read-only once they 592have been lexed, it instead makes a copy of the token and adds the flag 593'NO_EXPAND' to the copy. 594 595 For efficiency and to simplify memory management by avoiding having 596to remember to free these tokens, they are allocated as temporary tokens 597from the lexer's current token run (*note Lexing a line::) using the 598function '_cpp_temp_token'. The tokens are then re-used once the 599current line of tokens has been read in. 600 601 This might sound unsafe. However, tokens runs are not re-used at the 602end of a line if it happens to be in the middle of a macro argument 603list, and cpplib only wants to back-up more than one lexer token in 604situations where no macro expansion is involved, so the optimization is 605safe. 606 607 608File: cppinternals.info, Node: Token Spacing, Next: Line Numbering, Prev: Macro Expansion, Up: Top 609 610Token Spacing 611************* 612 613First, consider an issue that only concerns the stand-alone 614preprocessor: there needs to be a guarantee that re-reading its 615preprocessed output results in an identical token stream. Without 616taking special measures, this might not be the case because of macro 617substitution. For example: 618 619 #define PLUS + 620 #define EMPTY 621 #define f(x) =x= 622 +PLUS -EMPTY- PLUS+ f(=) 623 ==> + + - - + + = = = 624 _not_ 625 ==> ++ -- ++ === 626 627 One solution would be to simply insert a space between all adjacent 628tokens. However, we would like to keep space insertion to a minimum, 629both for aesthetic reasons and because it causes problems for people who 630still try to abuse the preprocessor for things like Fortran source and 631Makefiles. 632 633 For now, just notice that when tokens are added (or removed, as shown 634by the 'EMPTY' example) from the original lexed token stream, we need to 635check for accidental token pasting. We call this "paste avoidance". 636Token addition and removal can only occur because of macro expansion, 637but accidental pasting can occur in many places: both before and after 638each macro replacement, each argument replacement, and additionally each 639token created by the '#' and '##' operators. 640 641 Look at how the preprocessor gets whitespace output correct normally. 642The 'cpp_token' structure contains a flags byte, and one of those flags 643is 'PREV_WHITE'. This is flagged by the lexer, and indicates that the 644token was preceded by whitespace of some form other than a new line. 645The stand-alone preprocessor can use this flag to decide whether to 646insert a space between tokens in the output. 647 648 Now consider the result of the following macro expansion: 649 650 #define add(x, y, z) x + y +z; 651 sum = add (1,2, 3); 652 ==> sum = 1 + 2 +3; 653 654 The interesting thing here is that the tokens '1' and '2' are output 655with a preceding space, and '3' is output without a preceding space, but 656when lexed none of these tokens had that property. Careful 657consideration reveals that '1' gets its preceding whitespace from the 658space preceding 'add' in the macro invocation, _not_ replacement list. 659'2' gets its whitespace from the space preceding the parameter 'y' in 660the macro replacement list, and '3' has no preceding space because 661parameter 'z' has none in the replacement list. 662 663 Once lexed, tokens are effectively fixed and cannot be altered, since 664pointers to them might be held in many places, in particular by 665in-progress macro expansions. So instead of modifying the two tokens 666above, the preprocessor inserts a special token, which I call a "padding 667token", into the token stream to indicate that spacing of the subsequent 668token is special. The preprocessor inserts padding tokens in front of 669every macro expansion and expanded macro argument. These point to a 670"source token" from which the subsequent real token should inherit its 671spacing. In the above example, the source tokens are 'add' in the macro 672invocation, and 'y' and 'z' in the macro replacement list, respectively. 673 674 It is quite easy to get multiple padding tokens in a row, for example 675if a macro's first replacement token expands straight into another 676macro. 677 678 #define foo bar 679 #define bar baz 680 [foo] 681 ==> [baz] 682 683 Here, two padding tokens are generated with sources the 'foo' token 684between the brackets, and the 'bar' token from foo's replacement list, 685respectively. Clearly the first padding token is the one to use, so the 686output code should contain a rule that the first padding token in a 687sequence is the one that matters. 688 689 But what if a macro expansion is left? Adjusting the above example 690slightly: 691 692 #define foo bar 693 #define bar EMPTY baz 694 #define EMPTY 695 [foo] EMPTY; 696 ==> [ baz] ; 697 698 As shown, now there should be a space before 'baz' and the semicolon 699in the output. 700 701 The rules we decided above fail for 'baz': we generate three padding 702tokens, one per macro invocation, before the token 'baz'. We would then 703have it take its spacing from the first of these, which carries source 704token 'foo' with no leading space. 705 706 It is vital that cpplib get spacing correct in these examples since 707any of these macro expansions could be stringized, where spacing 708matters. 709 710 So, this demonstrates that not just entering macro and argument 711expansions, but leaving them requires special handling too. I made 712cpplib insert a padding token with a 'NULL' source token when leaving 713macro expansions, as well as after each replaced argument in a macro's 714replacement list. It also inserts appropriate padding tokens on either 715side of tokens created by the '#' and '##' operators. I expanded the 716rule so that, if we see a padding token with a 'NULL' source token, 717_and_ that source token has no leading space, then we behave as if we 718have seen no padding tokens at all. A quick check shows this rule will 719then get the above example correct as well. 720 721 Now a relationship with paste avoidance is apparent: we have to be 722careful about paste avoidance in exactly the same locations we have 723padding tokens in order to get white space correct. This makes 724implementation of paste avoidance easy: wherever the stand-alone 725preprocessor is fixing up spacing because of padding tokens, and it 726turns out that no space is needed, it has to take the extra step to 727check that a space is not needed after all to avoid an accidental paste. 728The function 'cpp_avoid_paste' advises whether a space is required 729between two consecutive tokens. To avoid excessive spacing, it tries 730hard to only require a space if one is likely to be necessary, but for 731reasons of efficiency it is slightly conservative and might recommend a 732space where one is not strictly needed. 733 734 735File: cppinternals.info, Node: Line Numbering, Next: Guard Macros, Prev: Token Spacing, Up: Top 736 737Line numbering 738************** 739 740Just which line number anyway? 741============================== 742 743There are three reasonable requirements a cpplib client might have for 744the line number of a token passed to it: 745 746 * The source line it was lexed on. 747 * The line it is output on. This can be different to the line it was 748 lexed on if, for example, there are intervening escaped newlines or 749 C-style comments. For example: 750 751 foo /* A long 752 comment */ bar \ 753 baz 754 => 755 foo bar baz 756 757 * If the token results from a macro expansion, the line of the macro 758 name, or possibly the line of the closing parenthesis in the case 759 of function-like macro expansion. 760 761 The 'cpp_token' structure contains 'line' and 'col' members. The 762lexer fills these in with the line and column of the first character of 763the token. Consequently, but maybe unexpectedly, a token from the 764replacement list of a macro expansion carries the location of the token 765within the '#define' directive, because cpplib expands a macro by 766returning pointers to the tokens in its replacement list. The current 767implementation of cpplib assigns tokens created from built-in macros and 768the '#' and '##' operators the location of the most recently lexed 769token. This is a because they are allocated from the lexer's token 770runs, and because of the way the diagnostic routines infer the 771appropriate location to report. 772 773 The diagnostic routines in cpplib display the location of the most 774recently _lexed_ token, unless they are passed a specific line and 775column to report. For diagnostics regarding tokens that arise from 776macro expansions, it might also be helpful for the user to see the 777original location in the macro definition that the token came from. 778Since that is exactly the information each token carries, such an 779enhancement could be made relatively easily in future. 780 781 The stand-alone preprocessor faces a similar problem when determining 782the correct line to output the token on: the position attached to a 783token is fairly useless if the token came from a macro expansion. All 784tokens on a logical line should be output on its first physical line, so 785the token's reported location is also wrong if it is part of a physical 786line other than the first. 787 788 To solve these issues, cpplib provides a callback that is generated 789whenever it lexes a preprocessing token that starts a new logical line 790other than a directive. It passes this token (which may be a 'CPP_EOF' 791token indicating the end of the translation unit) to the callback 792routine, which can then use the line and column of this token to produce 793correct output. 794 795Representation of line numbers 796============================== 797 798As mentioned above, cpplib stores with each token the line number that 799it was lexed on. In fact, this number is not the number of the line in 800the source file, but instead bears more resemblance to the number of the 801line in the translation unit. 802 803 The preprocessor maintains a monotonic increasing line count, which 804is incremented at every new line character (and also at the end of any 805buffer that does not end in a new line). Since a line number of zero is 806useful to indicate certain special states and conditions, this variable 807starts counting from one. 808 809 This variable therefore uniquely enumerates each line in the 810translation unit. With some simple infrastructure, it is straight 811forward to map from this to the original source file and line number 812pair, saving space whenever line number information needs to be saved. 813The code the implements this mapping lies in the files 'line-map.c' and 814'line-map.h'. 815 816 Command-line macros and assertions are implemented by pushing a 817buffer containing the right hand side of an equivalent '#define' or 818'#assert' directive. Some built-in macros are handled similarly. Since 819these are all processed before the first line of the main input file, it 820will typically have an assigned line closer to twenty than to one. 821 822 823File: cppinternals.info, Node: Guard Macros, Next: Files, Prev: Line Numbering, Up: Top 824 825The Multiple-Include Optimization 826********************************* 827 828Header files are often of the form 829 830 #ifndef FOO 831 #define FOO 832 ... 833 #endif 834 835to prevent the compiler from processing them more than once. The 836preprocessor notices such header files, so that if the header file 837appears in a subsequent '#include' directive and 'FOO' is defined, then 838it is ignored and it doesn't preprocess or even re-open the file a 839second time. This is referred to as the "multiple include 840optimization". 841 842 Under what circumstances is such an optimization valid? If the file 843were included a second time, it can only be optimized away if that 844inclusion would result in no tokens to return, and no relevant 845directives to process. Therefore the current implementation imposes 846requirements and makes some allowances as follows: 847 848 1. There must be no tokens outside the controlling '#if'-'#endif' 849 pair, but whitespace and comments are permitted. 850 851 2. There must be no directives outside the controlling directive pair, 852 but the "null directive" (a line containing nothing other than a 853 single '#' and possibly whitespace) is permitted. 854 855 3. The opening directive must be of the form 856 857 #ifndef FOO 858 859 or 860 861 #if !defined FOO [equivalently, #if !defined(FOO)] 862 863 4. In the second form above, the tokens forming the '#if' expression 864 must have come directly from the source file--no macro expansion 865 must have been involved. This is because macro definitions can 866 change, and tracking whether or not a relevant change has been made 867 is not worth the implementation cost. 868 869 5. There can be no '#else' or '#elif' directives at the outer 870 conditional block level, because they would probably contain 871 something of interest to a subsequent pass. 872 873 First, when pushing a new file on the buffer stack, 874'_stack_include_file' sets the controlling macro 'mi_cmacro' to 'NULL', 875and sets 'mi_valid' to 'true'. This indicates that the preprocessor has 876not yet encountered anything that would invalidate the multiple-include 877optimization. As described in the next few paragraphs, these two 878variables having these values effectively indicates top-of-file. 879 880 When about to return a token that is not part of a directive, 881'_cpp_lex_token' sets 'mi_valid' to 'false'. This enforces the 882constraint that tokens outside the controlling conditional block 883invalidate the optimization. 884 885 The 'do_if', when appropriate, and 'do_ifndef' directive handlers 886pass the controlling macro to the function 'push_conditional'. cpplib 887maintains a stack of nested conditional blocks, and after processing 888every opening conditional this function pushes an 'if_stack' structure 889onto the stack. In this structure it records the controlling macro for 890the block, provided there is one and we're at top-of-file (as described 891above). If an '#elif' or '#else' directive is encountered, the 892controlling macro for that block is cleared to 'NULL'. Otherwise, it 893survives until the '#endif' closing the block, upon which 'do_endif' 894sets 'mi_valid' to true and stores the controlling macro in 'mi_cmacro'. 895 896 '_cpp_handle_directive' clears 'mi_valid' when processing any 897directive other than an opening conditional and the null directive. 898With this, and requiring top-of-file to record a controlling macro, and 899no '#else' or '#elif' for it to survive and be copied to 'mi_cmacro' by 900'do_endif', we have enforced the absence of directives outside the main 901conditional block for the optimization to be on. 902 903 Note that whilst we are inside the conditional block, 'mi_valid' is 904likely to be reset to 'false', but this does not matter since the 905closing '#endif' restores it to 'true' if appropriate. 906 907 Finally, since '_cpp_lex_direct' pops the file off the buffer stack 908at 'EOF' without returning a token, if the '#endif' directive was not 909followed by any tokens, 'mi_valid' is 'true' and '_cpp_pop_file_buffer' 910remembers the controlling macro associated with the file. Subsequent 911calls to 'stack_include_file' result in no buffer being pushed if the 912controlling macro is defined, effecting the optimization. 913 914 A quick word on how we handle the 915 916 #if !defined FOO 917 918case. '_cpp_parse_expr' and 'parse_defined' take steps to see whether 919the three stages '!', 'defined-expression' and 'end-of-directive' occur 920in order in a '#if' expression. If so, they return the guard macro to 921'do_if' in the variable 'mi_ind_cmacro', and otherwise set it to 'NULL'. 922'enter_macro_context' sets 'mi_valid' to false, so if a macro was 923expanded whilst parsing any part of the expression, then the top-of-file 924test in 'push_conditional' fails and the optimization is turned off. 925 926 927File: cppinternals.info, Node: Files, Next: Concept Index, Prev: Guard Macros, Up: Top 928 929File Handling 930************* 931 932Fairly obviously, the file handling code of cpplib resides in the file 933'files.c'. It takes care of the details of file searching, opening, 934reading and caching, for both the main source file and all the headers 935it recursively includes. 936 937 The basic strategy is to minimize the number of system calls. On 938many systems, the basic 'open ()' and 'fstat ()' system calls can be 939quite expensive. For every '#include'-d file, we need to try all the 940directories in the search path until we find a match. Some projects, 941such as glibc, pass twenty or thirty include paths on the command line, 942so this can rapidly become time consuming. 943 944 For a header file we have not encountered before we have little 945choice but to do this. However, it is often the case that the same 946headers are repeatedly included, and in these cases we try to avoid 947repeating the filesystem queries whilst searching for the correct file. 948 949 For each file we try to open, we store the constructed path in a 950splay tree. This path first undergoes simplification by the function 951'_cpp_simplify_pathname'. For example, '/usr/include/bits/../foo.h' is 952simplified to '/usr/include/foo.h' before we enter it in the splay tree 953and try to 'open ()' the file. CPP will then find subsequent uses of 954'foo.h', even as '/usr/include/foo.h', in the splay tree and save system 955calls. 956 957 Further, it is likely the file contents have also been cached, saving 958a 'read ()' system call. We don't bother caching the contents of header 959files that are re-inclusion protected, and whose re-inclusion macro is 960defined when we leave the header file for the first time. If the host 961supports it, we try to map suitably large files into memory, rather than 962reading them in directly. 963 964 The include paths are internally stored on a null-terminated 965singly-linked list, starting with the '"header.h"' directory search 966chain, which then links into the '<header.h>' directory chain. 967 968 Files included with the '<foo.h>' syntax start the lookup directly in 969the second half of this chain. However, files included with the 970'"foo.h"' syntax start at the beginning of the chain, but with one extra 971directory prepended. This is the directory of the current file; the one 972containing the '#include' directive. Prepending this directory on a 973per-file basis is handled by the function 'search_from'. 974 975 Note that a header included with a directory component, such as 976'#include "mydir/foo.h"' and opened as '/usr/local/include/mydir/foo.h', 977will have the complete path minus the basename 'foo.h' as the current 978directory. 979 980 Enough information is stored in the splay tree that CPP can 981immediately tell whether it can skip the header file because of the 982multiple include optimization, whether the file didn't exist or couldn't 983be opened for some reason, or whether the header was flagged not to be 984re-used, as it is with the obsolete '#import' directive. 985 986 For the benefit of MS-DOS filesystems with an 8.3 filename 987limitation, CPP offers the ability to treat various include file names 988as aliases for the real header files with shorter names. The map from 989one to the other is found in a special file called 'header.gcc', stored 990in the command line (or system) include directories to which the mapping 991applies. This may be higher up the directory tree than the full path to 992the file minus the base name. 993 994 995File: cppinternals.info, Node: Concept Index, Prev: Files, Up: Top 996 997Concept Index 998************* 999 1000[index] 1001* Menu: 1002 1003* assertions: Hash Nodes. (line 6) 1004* controlling macros: Guard Macros. (line 6) 1005* escaped newlines: Lexer. (line 5) 1006* files: Files. (line 6) 1007* guard macros: Guard Macros. (line 6) 1008* hash table: Hash Nodes. (line 6) 1009* header files: Conventions. (line 6) 1010* identifiers: Hash Nodes. (line 6) 1011* interface: Conventions. (line 6) 1012* lexer: Lexer. (line 6) 1013* line numbers: Line Numbering. (line 5) 1014* macro expansion: Macro Expansion. (line 6) 1015* macro representation (internal): Macro Expansion. (line 19) 1016* macros: Hash Nodes. (line 6) 1017* multiple-include optimization: Guard Macros. (line 6) 1018* named operators: Hash Nodes. (line 6) 1019* newlines: Lexer. (line 6) 1020* paste avoidance: Token Spacing. (line 6) 1021* spacing: Token Spacing. (line 6) 1022* token run: Lexer. (line 191) 1023* token spacing: Token Spacing. (line 6) 1024 1025 1026 1027Tag Table: 1028Node: Top905 1029Node: Conventions2743 1030Node: Lexer3685 1031Ref: Invalid identifiers11599 1032Ref: Lexing a line13549 1033Node: Hash Nodes18318 1034Node: Macro Expansion21197 1035Node: Token Spacing30141 1036Node: Line Numbering35997 1037Node: Guard Macros40082 1038Node: Files44873 1039Node: Concept Index48339 1040 1041End Tag Table 1042