1@c Copyright (C) 2010-2020 Free Software Foundation, Inc. 2@c This is part of the GCC manual. 3@c For copying conditions, see the file gcc.texi. 4@c Contributed by Jan Hubicka <jh@suse.cz> and 5@c Diego Novillo <dnovillo@google.com> 6 7@node LTO 8@chapter Link Time Optimization 9@cindex lto 10@cindex whopr 11@cindex wpa 12@cindex ltrans 13 14Link Time Optimization (LTO) gives GCC the capability of 15dumping its internal representation (GIMPLE) to disk, 16so that all the different compilation units that make up 17a single executable can be optimized as a single module. 18This expands the scope of inter-procedural optimizations 19to encompass the whole program (or, rather, everything 20that is visible at link time). 21 22@menu 23* LTO Overview:: Overview of LTO. 24* LTO object file layout:: LTO file sections in ELF. 25* IPA:: Using summary information in IPA passes. 26* WHOPR:: Whole program assumptions, 27 linker plugin and symbol visibilities. 28* Internal flags:: Internal flags controlling @code{lto1}. 29@end menu 30 31@node LTO Overview 32@section Design Overview 33 34Link time optimization is implemented as a GCC front end for a 35bytecode representation of GIMPLE that is emitted in special sections 36of @code{.o} files. Currently, LTO support is enabled in most 37ELF-based systems, as well as darwin, cygwin and mingw systems. 38 39Since GIMPLE bytecode is saved alongside final object code, object 40files generated with LTO support are larger than regular object files. 41This ``fat'' object format makes it easy to integrate LTO into 42existing build systems, as one can, for instance, produce archives of 43the files. Additionally, one might be able to ship one set of fat 44objects which could be used both for development and the production of 45optimized builds. A, perhaps surprising, side effect of this feature 46is that any mistake in the toolchain leads to LTO information not 47being used (e.g.@: an older @code{libtool} calling @code{ld} directly). 48This is both an advantage, as the system is more robust, and a 49disadvantage, as the user is not informed that the optimization has 50been disabled. 51 52The current implementation only produces ``fat'' objects, effectively 53doubling compilation time and increasing file sizes up to 5x the 54original size. This hides the problem that some tools, such as 55@code{ar} and @code{nm}, need to understand symbol tables of LTO 56sections. These tools were extended to use the plugin infrastructure, 57and with these problems solved, GCC will also support ``slim'' objects 58consisting of the intermediate code alone. 59 60At the highest level, LTO splits the compiler in two. The first half 61(the ``writer'') produces a streaming representation of all the 62internal data structures needed to optimize and generate code. This 63includes declarations, types, the callgraph and the GIMPLE representation 64of function bodies. 65 66When @option{-flto} is given during compilation of a source file, the 67pass manager executes all the passes in @code{all_lto_gen_passes}. 68Currently, this phase is composed of two IPA passes: 69 70@itemize @bullet 71@item @code{pass_ipa_lto_gimple_out} 72This pass executes the function @code{lto_output} in 73@file{lto-streamer-out.c}, which traverses the call graph encoding 74every reachable declaration, type and function. This generates a 75memory representation of all the file sections described below. 76 77@item @code{pass_ipa_lto_finish_out} 78This pass executes the function @code{produce_asm_for_decls} in 79@file{lto-streamer-out.c}, which takes the memory image built in the 80previous pass and encodes it in the corresponding ELF file sections. 81@end itemize 82 83The second half of LTO support is the ``reader''. This is implemented 84as the GCC front end @file{lto1} in @file{lto/lto.c}. When 85@file{collect2} detects a link set of @code{.o}/@code{.a} files with 86LTO information and the @option{-flto} is enabled, it invokes 87@file{lto1} which reads the set of files and aggregates them into a 88single translation unit for optimization. The main entry point for 89the reader is @file{lto/lto.c}:@code{lto_main}. 90 91@subsection LTO modes of operation 92 93One of the main goals of the GCC link-time infrastructure was to allow 94effective compilation of large programs. For this reason GCC implements two 95link-time compilation modes. 96 97@enumerate 98@item @emph{LTO mode}, in which the whole program is read into the 99compiler at link-time and optimized in a similar way as if it 100were a single source-level compilation unit. 101 102@item @emph{WHOPR or partitioned mode}, designed to utilize multiple 103CPUs and/or a distributed compilation environment to quickly link 104large applications. WHOPR stands for WHOle Program optimizeR (not to 105be confused with the semantics of @option{-fwhole-program}). It 106partitions the aggregated callgraph from many different @code{.o} 107files and distributes the compilation of the sub-graphs to different 108CPUs. 109 110Note that distributed compilation is not implemented yet, but since 111the parallelism is facilitated via generating a @code{Makefile}, it 112would be easy to implement. 113@end enumerate 114 115WHOPR splits LTO into three main stages: 116@enumerate 117@item Local generation (LGEN) 118This stage executes in parallel. Every file in the program is compiled 119into the intermediate language and packaged together with the local 120call-graph and summary information. This stage is the same for both 121the LTO and WHOPR compilation mode. 122 123@item Whole Program Analysis (WPA) 124WPA is performed sequentially. The global call-graph is generated, and 125a global analysis procedure makes transformation decisions. The global 126call-graph is partitioned to facilitate parallel optimization during 127phase 3. The results of the WPA stage are stored into new object files 128which contain the partitions of program expressed in the intermediate 129language and the optimization decisions. 130 131@item Local transformations (LTRANS) 132This stage executes in parallel. All the decisions made during phase 2 133are implemented locally in each partitioned object file, and the final 134object code is generated. Optimizations which cannot be decided 135efficiently during the phase 2 may be performed on the local 136call-graph partitions. 137@end enumerate 138 139WHOPR can be seen as an extension of the usual LTO mode of 140compilation. In LTO, WPA and LTRANS are executed within a single 141execution of the compiler, after the whole program has been read into 142memory. 143 144When compiling in WHOPR mode, the callgraph is partitioned during 145the WPA stage. The whole program is split into a given number of 146partitions of roughly the same size. The compiler tries to 147minimize the number of references which cross partition boundaries. 148The main advantage of WHOPR is to allow the parallel execution of 149LTRANS stages, which are the most time-consuming part of the 150compilation process. Additionally, it avoids the need to load the 151whole program into memory. 152 153 154@node LTO object file layout 155@section LTO file sections 156 157LTO information is stored in several ELF sections inside object files. 158Data structures and enum codes for sections are defined in 159@file{lto-streamer.h}. 160 161These sections are emitted from @file{lto-streamer-out.c} and mapped 162in all at once from @file{lto/lto.c}:@code{lto_file_read}. The 163individual functions dealing with the reading/writing of each section 164are described below. 165 166@itemize @bullet 167@item Command line options (@code{.gnu.lto_.opts}) 168 169This section contains the command line options used to generate the 170object files. This is used at link time to determine the optimization 171level and other settings when they are not explicitly specified at the 172linker command line. 173 174Currently, GCC does not support combining LTO object files compiled 175with different set of the command line options into a single binary. 176At link time, the options given on the command line and the options 177saved on all the files in a link-time set are applied globally. No 178attempt is made at validating the combination of flags (other than the 179usual validation done by option processing). This is implemented in 180@file{lto/lto.c}:@code{lto_read_all_file_options}. 181 182 183@item Symbol table (@code{.gnu.lto_.symtab}) 184 185This table replaces the ELF symbol table for functions and variables 186represented in the LTO IL. Symbols used and exported by the optimized 187assembly code of ``fat'' objects might not match the ones used and 188exported by the intermediate code. This table is necessary because 189the intermediate code is less optimized and thus requires a separate 190symbol table. 191 192Additionally, the binary code in the ``fat'' object will lack a call 193to a function, since the call was optimized out at compilation time 194after the intermediate language was streamed out. In some special 195cases, the same optimization may not happen during link-time 196optimization. This would lead to an undefined symbol if only one 197symbol table was used. 198 199The symbol table is emitted in 200@file{lto-streamer-out.c}:@code{produce_symtab}. 201 202 203@item Global declarations and types (@code{.gnu.lto_.decls}) 204 205This section contains an intermediate language dump of all 206declarations and types required to represent the callgraph, static 207variables and top-level debug info. 208 209The contents of this section are emitted in 210@file{lto-streamer-out.c}:@code{produce_asm_for_decls}. Types and 211symbols are emitted in a topological order that preserves the sharing 212of pointers when the file is read back in 213(@file{lto.c}:@code{read_cgraph_and_symbols}). 214 215 216@item The callgraph (@code{.gnu.lto_.cgraph}) 217 218This section contains the basic data structure used by the GCC 219inter-procedural optimization infrastructure. This section stores an 220annotated multi-graph which represents the functions and call sites as 221well as the variables, aliases and top-level @code{asm} statements. 222 223This section is emitted in 224@file{lto-streamer-out.c}:@code{output_cgraph} and read in 225@file{lto-cgraph.c}:@code{input_cgraph}. 226 227 228@item IPA references (@code{.gnu.lto_.refs}) 229 230This section contains references between function and static 231variables. It is emitted by @file{lto-cgraph.c}:@code{output_refs} 232and read by @file{lto-cgraph.c}:@code{input_refs}. 233 234 235@item Function bodies (@code{.gnu.lto_.function_body.<name>}) 236 237This section contains function bodies in the intermediate language 238representation. Every function body is in a separate section to allow 239copying of the section independently to different object files or 240reading the function on demand. 241 242Functions are emitted in 243@file{lto-streamer-out.c}:@code{output_function} and read in 244@file{lto-streamer-in.c}:@code{input_function}. 245 246 247@item Static variable initializers (@code{.gnu.lto_.vars}) 248 249This section contains all the symbols in the global variable pool. It 250is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in 251@file{lto-cgraph.c}:@code{input_cgraph}. 252 253@item Summaries and optimization summaries used by IPA passes 254(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs}, 255@code{pureconst} or @code{reference}) 256 257These sections are used by IPA passes that need to emit summary 258information during LTO generation to be read and aggregated at 259link time. Each pass is responsible for implementing two pass manager 260hooks: one for writing the summary and another for reading it in. The 261format of these sections is entirely up to each individual pass. The 262only requirement is that the writer and reader hooks agree on the 263format. 264@end itemize 265 266 267@node IPA 268@section Using summary information in IPA passes 269 270Programs are represented internally as a @emph{callgraph} (a 271multi-graph where nodes are functions and edges are call sites) 272and a @emph{varpool} (a list of static and external variables in 273the program). 274 275The inter-procedural optimization is organized as a sequence of 276individual passes, which operate on the callgraph and the 277varpool. To make the implementation of WHOPR possible, every 278inter-procedural optimization pass is split into several stages 279that are executed at different times during WHOPR compilation: 280 281@itemize @bullet 282@item LGEN time 283@enumerate 284@item @emph{Generate summary} (@code{generate_summary} in 285@code{struct ipa_opt_pass_d}). This stage analyzes every function 286body and variable initializer is examined and stores relevant 287information into a pass-specific data structure. 288 289@item @emph{Write summary} (@code{write_summary} in 290@code{struct ipa_opt_pass_d}). This stage writes all the 291pass-specific information generated by @code{generate_summary}. 292Summaries go into their own @code{LTO_section_*} sections that 293have to be declared in @file{lto-streamer.h}:@code{enum 294lto_section_type}. A new section is created by calling 295@code{create_output_block} and data can be written using the 296@code{lto_output_*} routines. 297@end enumerate 298 299@item WPA time 300@enumerate 301@item @emph{Read summary} (@code{read_summary} in 302@code{struct ipa_opt_pass_d}). This stage reads all the 303pass-specific information in exactly the same order that it was 304written by @code{write_summary}. 305 306@item @emph{Execute} (@code{execute} in @code{struct 307opt_pass}). This performs inter-procedural propagation. This 308must be done without actual access to the individual function 309bodies or variable initializers. Typically, this results in a 310transitive closure operation over the summary information of all 311the nodes in the callgraph. 312 313@item @emph{Write optimization summary} 314(@code{write_optimization_summary} in @code{struct 315ipa_opt_pass_d}). This writes the result of the inter-procedural 316propagation into the object file. This can use the same data 317structures and helper routines used in @code{write_summary}. 318@end enumerate 319 320@item LTRANS time 321@enumerate 322@item @emph{Read optimization summary} 323(@code{read_optimization_summary} in @code{struct 324ipa_opt_pass_d}). The counterpart to 325@code{write_optimization_summary}. This reads the interprocedural 326optimization decisions in exactly the same format emitted by 327@code{write_optimization_summary}. 328 329@item @emph{Transform} (@code{function_transform} and 330@code{variable_transform} in @code{struct ipa_opt_pass_d}). 331The actual function bodies and variable initializers are updated 332based on the information passed down from the @emph{Execute} stage. 333@end enumerate 334@end itemize 335 336The implementation of the inter-procedural passes are shared 337between LTO, WHOPR and classic non-LTO compilation. 338 339@itemize 340@item During the traditional file-by-file mode every pass executes its 341own @emph{Generate summary}, @emph{Execute}, and @emph{Transform} 342stages within the single execution context of the compiler. 343 344@item In LTO compilation mode, every pass uses @emph{Generate 345summary} and @emph{Write summary} stages at compilation time, 346while the @emph{Read summary}, @emph{Execute}, and 347@emph{Transform} stages are executed at link time. 348 349@item In WHOPR mode all stages are used. 350@end itemize 351 352To simplify development, the GCC pass manager differentiates 353between normal inter-procedural passes (@pxref{Regular IPA passes}), 354small inter-procedural passes (@pxref{Small IPA passes}) 355and late inter-procedural passes (@pxref{Late IPA passes}). 356A small or late IPA pass (@code{SIMPLE_IPA_PASS}) does 357everything at once and thus cannot be executed during WPA in 358WHOPR mode. It defines only the @emph{Execute} stage and during 359this stage it accesses and modifies the function bodies. Such 360passes are useful for optimization at LGEN or LTRANS time and are 361used, for example, to implement early optimization before writing 362object files. The simple inter-procedural passes can also be used 363for easier prototyping and development of a new inter-procedural 364pass. 365 366 367@subsection Virtual clones 368 369One of the main challenges of introducing the WHOPR compilation 370mode was addressing the interactions between optimization passes. 371In LTO compilation mode, the passes are executed in a sequence, 372each of which consists of analysis (or @emph{Generate summary}), 373propagation (or @emph{Execute}) and @emph{Transform} stages. 374Once the work of one pass is finished, the next pass sees the 375updated program representation and can execute. This makes the 376individual passes dependent on each other. 377 378In WHOPR mode all passes first execute their @emph{Generate 379summary} stage. Then summary writing marks the end of the LGEN 380stage. At WPA time, 381the summaries are read back into memory and all passes run the 382@emph{Execute} stage. Optimization summaries are streamed and 383sent to LTRANS, where all the passes execute the @emph{Transform} 384stage. 385 386Most optimization passes split naturally into analysis, 387propagation and transformation stages. But some do not. The 388main problem arises when one pass performs changes and the 389following pass gets confused by seeing different callgraphs 390between the @emph{Transform} stage and the @emph{Generate summary} 391or @emph{Execute} stage. This means that the passes are required 392to communicate their decisions with each other. 393 394To facilitate this communication, the GCC callgraph 395infrastructure implements @emph{virtual clones}, a method of 396representing the changes performed by the optimization passes in 397the callgraph without needing to update function bodies. 398 399A @emph{virtual clone} in the callgraph is a function that has no 400associated body, just a description of how to create its body based 401on a different function (which itself may be a virtual clone). 402 403The description of function modifications includes adjustments to 404the function's signature (which allows, for example, removing or 405adding function arguments), substitutions to perform on the 406function body, and, for inlined functions, a pointer to the 407function that it will be inlined into. 408 409It is also possible to redirect any edge of the callgraph from a 410function to its virtual clone. This implies updating of the call 411site to adjust for the new function signature. 412 413Most of the transformations performed by inter-procedural 414optimizations can be represented via virtual clones. For 415instance, a constant propagation pass can produce a virtual clone 416of the function which replaces one of its arguments by a 417constant. The inliner can represent its decisions by producing a 418clone of a function whose body will be later integrated into 419a given function. 420 421Using @emph{virtual clones}, the program can be easily updated 422during the @emph{Execute} stage, solving most of pass interactions 423problems that would otherwise occur during @emph{Transform}. 424 425Virtual clones are later materialized in the LTRANS stage and 426turned into real functions. Passes executed after the virtual 427clone were introduced also perform their @emph{Transform} stage 428on new functions, so for a pass there is no significant 429difference between operating on a real function or a virtual 430clone introduced before its @emph{Execute} stage. 431 432Optimization passes then work on virtual clones introduced before 433their @emph{Execute} stage as if they were real functions. The 434only difference is that clones are not visible during the 435@emph{Generate Summary} stage. 436 437To keep function summaries updated, the callgraph interface 438allows an optimizer to register a callback that is called every 439time a new clone is introduced as well as when the actual 440function or variable is generated or when a function or variable 441is removed. These hooks are registered in the @emph{Generate 442summary} stage and allow the pass to keep its information intact 443until the @emph{Execute} stage. The same hooks can also be 444registered during the @emph{Execute} stage to keep the 445optimization summaries updated for the @emph{Transform} stage. 446 447@subsection IPA references 448 449GCC represents IPA references in the callgraph. For a function 450or variable @code{A}, the @emph{IPA reference} is a list of all 451locations where the address of @code{A} is taken and, when 452@code{A} is a variable, a list of all direct stores and reads 453to/from @code{A}. References represent an oriented multi-graph on 454the union of nodes of the callgraph and the varpool. See 455@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary} 456and 457@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary} 458for details. 459 460@subsection Jump functions 461Suppose that an optimization pass sees a function @code{A} and it 462knows the values of (some of) its arguments. The @emph{jump 463function} describes the value of a parameter of a given function 464call in function @code{A} based on this knowledge. 465 466Jump functions are used by several optimizations, such as the 467inter-procedural constant propagation pass and the 468devirtualization pass. The inliner also uses jump functions to 469perform inlining of callbacks. 470 471@node WHOPR 472@section Whole program assumptions, linker plugin and symbol visibilities 473 474Link-time optimization gives relatively minor benefits when used 475alone. The problem is that propagation of inter-procedural 476information does not work well across functions and variables 477that are called or referenced by other compilation units (such as 478from a dynamically linked library). We say that such functions 479and variables are @emph{externally visible}. 480 481To make the situation even more difficult, many applications 482organize themselves as a set of shared libraries, and the default 483ELF visibility rules allow one to overwrite any externally 484visible symbol with a different symbol at runtime. This 485basically disables any optimizations across such functions and 486variables, because the compiler cannot be sure that the function 487body it is seeing is the same function body that will be used at 488runtime. Any function or variable not declared @code{static} in 489the sources degrades the quality of inter-procedural 490optimization. 491 492To avoid this problem the compiler must assume that it sees the 493whole program when doing link-time optimization. Strictly 494speaking, the whole program is rarely visible even at link-time. 495Standard system libraries are usually linked dynamically or not 496provided with the link-time information. In GCC, the whole 497program option (@option{-fwhole-program}) asserts that every 498function and variable defined in the current compilation 499unit is static, except for function @code{main} (note: at 500link time, the current unit is the union of all objects compiled 501with LTO). Since some functions and variables need to 502be referenced externally, for example by another DSO or from an 503assembler file, GCC also provides the function and variable 504attribute @code{externally_visible} which can be used to disable 505the effect of @option{-fwhole-program} on a specific symbol. 506 507The whole program mode assumptions are slightly more complex in 508C++, where inline functions in headers are put into @emph{COMDAT} 509sections. COMDAT function and variables can be defined by 510multiple object files and their bodies are unified at link-time 511and dynamic link-time. COMDAT functions are changed to local only 512when their address is not taken and thus un-sharing them with a 513library is not harmful. COMDAT variables always remain externally 514visible, however for readonly variables it is assumed that their 515initializers cannot be overwritten by a different value. 516 517GCC provides the function and variable attribute 518@code{visibility} that can be used to specify the visibility of 519externally visible symbols (or alternatively an 520@option{-fdefault-visibility} command line option). ELF defines 521the @code{default}, @code{protected}, @code{hidden} and 522@code{internal} visibilities. 523 524The most commonly used is visibility is @code{hidden}. It 525specifies that the symbol cannot be referenced from outside of 526the current shared library. Unfortunately, this information 527cannot be used directly by the link-time optimization in the 528compiler since the whole shared library also might contain 529non-LTO objects and those are not visible to the compiler. 530 531GCC solves this problem using linker plugins. A @emph{linker 532plugin} is an interface to the linker that allows an external 533program to claim the ownership of a given object file. The linker 534then performs the linking procedure by querying the plugin about 535the symbol table of the claimed objects and once the linking 536decisions are complete, the plugin is allowed to provide the 537final object file before the actual linking is made. The linker 538plugin obtains the symbol resolution information which specifies 539which symbols provided by the claimed objects are bound from the 540rest of a binary being linked. 541 542GCC is designed to be independent of the rest of the toolchain 543and aims to support linkers without plugin support. For this 544reason it does not use the linker plugin by default. Instead, 545the object files are examined by @command{collect2} before being 546passed to the linker and objects found to have LTO sections are 547passed to @command{lto1} first. This mode does not work for 548library archives. The decision on what object files from the 549archive are needed depends on the actual linking and thus GCC 550would have to implement the linker itself. The resolution 551information is missing too and thus GCC needs to make an educated 552guess based on @option{-fwhole-program}. Without the linker 553plugin GCC also assumes that symbols are declared @code{hidden} 554and not referred by non-LTO code by default. 555 556@node Internal flags 557@section Internal flags controlling @code{lto1} 558 559The following flags are passed into @command{lto1} and are not 560meant to be used directly from the command line. 561 562@itemize 563@item -fwpa 564@opindex fwpa 565This option runs the serial part of the link-time optimizer 566performing the inter-procedural propagation (WPA mode). The 567compiler reads in summary information from all inputs and 568performs an analysis based on summary information only. It 569generates object files for subsequent runs of the link-time 570optimizer where individual object files are optimized using both 571summary information from the WPA mode and the actual function 572bodies. It then drives the LTRANS phase. 573 574@item -fltrans 575@opindex fltrans 576This option runs the link-time optimizer in the 577local-transformation (LTRANS) mode, which reads in output from a 578previous run of the LTO in WPA mode. In the LTRANS mode, LTO 579optimizes an object and produces the final assembly. 580 581@item -fltrans-output-list=@var{file} 582@opindex fltrans-output-list 583This option specifies a file to which the names of LTRANS output 584files are written. This option is only meaningful in conjunction 585with @option{-fwpa}. 586 587@item -fresolution=@var{file} 588@opindex fresolution 589This option specifies the linker resolution file. This option is 590only meaningful in conjunction with @option{-fwpa} and as option 591to pass through to the LTO linker plugin. 592@end itemize 593