1@c Copyright (c) 2010 Free Software Foundation, Inc. 2@c Free Software Foundation, Inc. 3@c This is part of the GCC manual. 4@c For copying conditions, see the file gcc.texi. 5@c Contributed by Jan Hubicka <jh@suse.cz> and 6@c Diego Novillo <dnovillo@google.com> 7 8@node LTO 9@chapter Link Time Optimization 10@cindex lto 11@cindex whopr 12@cindex wpa 13@cindex ltrans 14 15@section Design Overview 16 17Link time optimization is implemented as a GCC front end for a 18bytecode representation of GIMPLE that is emitted in special sections 19of @code{.o} files. Currently, LTO support is enabled in most 20ELF-based systems, as well as darwin, cygwin and mingw systems. 21 22Since GIMPLE bytecode is saved alongside final object code, object 23files generated with LTO support are larger than regular object files. 24This ``fat'' object format makes it easy to integrate LTO into 25existing build systems, as one can, for instance, produce archives of 26the files. Additionally, one might be able to ship one set of fat 27objects which could be used both for development and the production of 28optimized builds. A, perhaps surprising, side effect of this feature 29is that any mistake in the toolchain that leads to LTO information not 30being used (e.g.@: an older @code{libtool} calling @code{ld} directly). 31This is both an advantage, as the system is more robust, and a 32disadvantage, as the user is not informed that the optimization has 33been disabled. 34 35The current implementation only produces ``fat'' objects, effectively 36doubling compilation time and increasing file sizes up to 5x the 37original size. This hides the problem that some tools, such as 38@code{ar} and @code{nm}, need to understand symbol tables of LTO 39sections. These tools were extended to use the plugin infrastructure, 40and with these problems solved, GCC will also support ``slim'' objects 41consisting of the intermediate code alone. 42 43At the highest level, LTO splits the compiler in two. The first half 44(the ``writer'') produces a streaming representation of all the 45internal data structures needed to optimize and generate code. This 46includes declarations, types, the callgraph and the GIMPLE representation 47of function bodies. 48 49When @option{-flto} is given during compilation of a source file, the 50pass manager executes all the passes in @code{all_lto_gen_passes}. 51Currently, this phase is composed of two IPA passes: 52 53@itemize @bullet 54@item @code{pass_ipa_lto_gimple_out} 55This pass executes the function @code{lto_output} in 56@file{lto-streamer-out.c}, which traverses the call graph encoding 57every reachable declaration, type and function. This generates a 58memory representation of all the file sections described below. 59 60@item @code{pass_ipa_lto_finish_out} 61This pass executes the function @code{produce_asm_for_decls} in 62@file{lto-streamer-out.c}, which takes the memory image built in the 63previous pass and encodes it in the corresponding ELF file sections. 64@end itemize 65 66The second half of LTO support is the ``reader''. This is implemented 67as the GCC front end @file{lto1} in @file{lto/lto.c}. When 68@file{collect2} detects a link set of @code{.o}/@code{.a} files with 69LTO information and the @option{-flto} is enabled, it invokes 70@file{lto1} which reads the set of files and aggregates them into a 71single translation unit for optimization. The main entry point for 72the reader is @file{lto/lto.c}:@code{lto_main}. 73 74@subsection LTO modes of operation 75 76One of the main goals of the GCC link-time infrastructure was to allow 77effective compilation of large programs. For this reason GCC implements two 78link-time compilation modes. 79 80@enumerate 81@item @emph{LTO mode}, in which the whole program is read into the 82compiler at link-time and optimized in a similar way as if it 83were a single source-level compilation unit. 84 85@item @emph{WHOPR or partitioned mode}, designed to utilize multiple 86CPUs and/or a distributed compilation environment to quickly link 87large applications. WHOPR stands for WHOle Program optimizeR (not to 88be confused with the semantics of @option{-fwhole-program}). It 89partitions the aggregated callgraph from many different @code{.o} 90files and distributes the compilation of the sub-graphs to different 91CPUs. 92 93Note that distributed compilation is not implemented yet, but since 94the parallelism is facilitated via generating a @code{Makefile}, it 95would be easy to implement. 96@end enumerate 97 98WHOPR splits LTO into three main stages: 99@enumerate 100@item Local generation (LGEN) 101This stage executes in parallel. Every file in the program is compiled 102into the intermediate language and packaged together with the local 103call-graph and summary information. This stage is the same for both 104the LTO and WHOPR compilation mode. 105 106@item Whole Program Analysis (WPA) 107WPA is performed sequentially. The global call-graph is generated, and 108a global analysis procedure makes transformation decisions. The global 109call-graph is partitioned to facilitate parallel optimization during 110phase 3. The results of the WPA stage are stored into new object files 111which contain the partitions of program expressed in the intermediate 112language and the optimization decisions. 113 114@item Local transformations (LTRANS) 115This stage executes in parallel. All the decisions made during phase 2 116are implemented locally in each partitioned object file, and the final 117object code is generated. Optimizations which cannot be decided 118efficiently during the phase 2 may be performed on the local 119call-graph partitions. 120@end enumerate 121 122WHOPR can be seen as an extension of the usual LTO mode of 123compilation. In LTO, WPA and LTRANS are executed within a single 124execution of the compiler, after the whole program has been read into 125memory. 126 127When compiling in WHOPR mode, the callgraph is partitioned during 128the WPA stage. The whole program is split into a given number of 129partitions of roughly the same size. The compiler tries to 130minimize the number of references which cross partition boundaries. 131The main advantage of WHOPR is to allow the parallel execution of 132LTRANS stages, which are the most time-consuming part of the 133compilation process. Additionally, it avoids the need to load the 134whole program into memory. 135 136 137@section LTO file sections 138 139LTO information is stored in several ELF sections inside object files. 140Data structures and enum codes for sections are defined in 141@file{lto-streamer.h}. 142 143These sections are emitted from @file{lto-streamer-out.c} and mapped 144in all at once from @file{lto/lto.c}:@code{lto_file_read}. The 145individual functions dealing with the reading/writing of each section 146are described below. 147 148@itemize @bullet 149@item Command line options (@code{.gnu.lto_.opts}) 150 151This section contains the command line options used to generate the 152object files. This is used at link time to determine the optimization 153level and other settings when they are not explicitly specified at the 154linker command line. 155 156Currently, GCC does not support combining LTO object files compiled 157with different set of the command line options into a single binary. 158At link time, the options given on the command line and the options 159saved on all the files in a link-time set are applied globally. No 160attempt is made at validating the combination of flags (other than the 161usual validation done by option processing). This is implemented in 162@file{lto/lto.c}:@code{lto_read_all_file_options}. 163 164 165@item Symbol table (@code{.gnu.lto_.symtab}) 166 167This table replaces the ELF symbol table for functions and variables 168represented in the LTO IL. Symbols used and exported by the optimized 169assembly code of ``fat'' objects might not match the ones used and 170exported by the intermediate code. This table is necessary because 171the intermediate code is less optimized and thus requires a separate 172symbol table. 173 174Additionally, the binary code in the ``fat'' object will lack a call 175to a function, since the call was optimized out at compilation time 176after the intermediate language was streamed out. In some special 177cases, the same optimization may not happen during link-time 178optimization. This would lead to an undefined symbol if only one 179symbol table was used. 180 181The symbol table is emitted in 182@file{lto-streamer-out.c}:@code{produce_symtab}. 183 184 185@item Global declarations and types (@code{.gnu.lto_.decls}) 186 187This section contains an intermediate language dump of all 188declarations and types required to represent the callgraph, static 189variables and top-level debug info. 190 191The contents of this section are emitted in 192@file{lto-streamer-out.c}:@code{produce_asm_for_decls}. Types and 193symbols are emitted in a topological order that preserves the sharing 194of pointers when the file is read back in 195(@file{lto.c}:@code{read_cgraph_and_symbols}). 196 197 198@item The callgraph (@code{.gnu.lto_.cgraph}) 199 200This section contains the basic data structure used by the GCC 201inter-procedural optimization infrastructure. This section stores an 202annotated multi-graph which represents the functions and call sites as 203well as the variables, aliases and top-level @code{asm} statements. 204 205This section is emitted in 206@file{lto-streamer-out.c}:@code{output_cgraph} and read in 207@file{lto-cgraph.c}:@code{input_cgraph}. 208 209 210@item IPA references (@code{.gnu.lto_.refs}) 211 212This section contains references between function and static 213variables. It is emitted by @file{lto-cgraph.c}:@code{output_refs} 214and read by @file{lto-cgraph.c}:@code{input_refs}. 215 216 217@item Function bodies (@code{.gnu.lto_.function_body.<name>}) 218 219This section contains function bodies in the intermediate language 220representation. Every function body is in a separate section to allow 221copying of the section independently to different object files or 222reading the function on demand. 223 224Functions are emitted in 225@file{lto-streamer-out.c}:@code{output_function} and read in 226@file{lto-streamer-in.c}:@code{input_function}. 227 228 229@item Static variable initializers (@code{.gnu.lto_.vars}) 230 231This section contains all the symbols in the global variable pool. It 232is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in 233@file{lto-cgraph.c}:@code{input_cgraph}. 234 235@item Summaries and optimization summaries used by IPA passes 236(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs}, 237@code{pureconst} or @code{reference}) 238 239These sections are used by IPA passes that need to emit summary 240information during LTO generation to be read and aggregated at 241link time. Each pass is responsible for implementing two pass manager 242hooks: one for writing the summary and another for reading it in. The 243format of these sections is entirely up to each individual pass. The 244only requirement is that the writer and reader hooks agree on the 245format. 246@end itemize 247 248 249@section Using summary information in IPA passes 250 251Programs are represented internally as a @emph{callgraph} (a 252multi-graph where nodes are functions and edges are call sites) 253and a @emph{varpool} (a list of static and external variables in 254the program). 255 256The inter-procedural optimization is organized as a sequence of 257individual passes, which operate on the callgraph and the 258varpool. To make the implementation of WHOPR possible, every 259inter-procedural optimization pass is split into several stages 260that are executed at different times during WHOPR compilation: 261 262@itemize @bullet 263@item LGEN time 264@enumerate 265@item @emph{Generate summary} (@code{generate_summary} in 266@code{struct ipa_opt_pass_d}). This stage analyzes every function 267body and variable initializer is examined and stores relevant 268information into a pass-specific data structure. 269 270@item @emph{Write summary} (@code{write_summary} in 271@code{struct ipa_opt_pass_d}). This stage writes all the 272pass-specific information generated by @code{generate_summary}. 273Summaries go into their own @code{LTO_section_*} sections that 274have to be declared in @file{lto-streamer.h}:@code{enum 275lto_section_type}. A new section is created by calling 276@code{create_output_block} and data can be written using the 277@code{lto_output_*} routines. 278@end enumerate 279 280@item WPA time 281@enumerate 282@item @emph{Read summary} (@code{read_summary} in 283@code{struct ipa_opt_pass_d}). This stage reads all the 284pass-specific information in exactly the same order that it was 285written by @code{write_summary}. 286 287@item @emph{Execute} (@code{execute} in @code{struct 288opt_pass}). This performs inter-procedural propagation. This 289must be done without actual access to the individual function 290bodies or variable initializers. Typically, this results in a 291transitive closure operation over the summary information of all 292the nodes in the callgraph. 293 294@item @emph{Write optimization summary} 295(@code{write_optimization_summary} in @code{struct 296ipa_opt_pass_d}). This writes the result of the inter-procedural 297propagation into the object file. This can use the same data 298structures and helper routines used in @code{write_summary}. 299@end enumerate 300 301@item LTRANS time 302@enumerate 303@item @emph{Read optimization summary} 304(@code{read_optimization_summary} in @code{struct 305ipa_opt_pass_d}). The counterpart to 306@code{write_optimization_summary}. This reads the interprocedural 307optimization decisions in exactly the same format emitted by 308@code{write_optimization_summary}. 309 310@item @emph{Transform} (@code{function_transform} and 311@code{variable_transform} in @code{struct ipa_opt_pass_d}). 312The actual function bodies and variable initializers are updated 313based on the information passed down from the @emph{Execute} stage. 314@end enumerate 315@end itemize 316 317The implementation of the inter-procedural passes are shared 318between LTO, WHOPR and classic non-LTO compilation. 319 320@itemize 321@item During the traditional file-by-file mode every pass executes its 322own @emph{Generate summary}, @emph{Execute}, and @emph{Transform} 323stages within the single execution context of the compiler. 324 325@item In LTO compilation mode, every pass uses @emph{Generate 326summary} and @emph{Write summary} stages at compilation time, 327while the @emph{Read summary}, @emph{Execute}, and 328@emph{Transform} stages are executed at link time. 329 330@item In WHOPR mode all stages are used. 331@end itemize 332 333To simplify development, the GCC pass manager differentiates 334between normal inter-procedural passes and small inter-procedural 335passes. A @emph{small inter-procedural pass} 336(@code{SIMPLE_IPA_PASS}) is a pass that does 337everything at once and thus it can not be executed during WPA in 338WHOPR mode. It defines only the @emph{Execute} stage and during 339this stage it accesses and modifies the function bodies. Such 340passes are useful for optimization at LGEN or LTRANS time and are 341used, for example, to implement early optimization before writing 342object files. The simple inter-procedural passes can also be used 343for easier prototyping and development of a new inter-procedural 344pass. 345 346 347@subsection Virtual clones 348 349One of the main challenges of introducing the WHOPR compilation 350mode was addressing the interactions between optimization passes. 351In LTO compilation mode, the passes are executed in a sequence, 352each of which consists of analysis (or @emph{Generate summary}), 353propagation (or @emph{Execute}) and @emph{Transform} stages. 354Once the work of one pass is finished, the next pass sees the 355updated program representation and can execute. This makes the 356individual passes dependent on each other. 357 358In WHOPR mode all passes first execute their @emph{Generate 359summary} stage. Then summary writing marks the end of the LGEN 360stage. At WPA time, 361the summaries are read back into memory and all passes run the 362@emph{Execute} stage. Optimization summaries are streamed and 363sent to LTRANS, where all the passes execute the @emph{Transform} 364stage. 365 366Most optimization passes split naturally into analysis, 367propagation and transformation stages. But some do not. The 368main problem arises when one pass performs changes and the 369following pass gets confused by seeing different callgraphs 370between the @emph{Transform} stage and the @emph{Generate summary} 371or @emph{Execute} stage. This means that the passes are required 372to communicate their decisions with each other. 373 374To facilitate this communication, the GCC callgraph 375infrastructure implements @emph{virtual clones}, a method of 376representing the changes performed by the optimization passes in 377the callgraph without needing to update function bodies. 378 379A @emph{virtual clone} in the callgraph is a function that has no 380associated body, just a description of how to create its body based 381on a different function (which itself may be a virtual clone). 382 383The description of function modifications includes adjustments to 384the function's signature (which allows, for example, removing or 385adding function arguments), substitutions to perform on the 386function body, and, for inlined functions, a pointer to the 387function that it will be inlined into. 388 389It is also possible to redirect any edge of the callgraph from a 390function to its virtual clone. This implies updating of the call 391site to adjust for the new function signature. 392 393Most of the transformations performed by inter-procedural 394optimizations can be represented via virtual clones. For 395instance, a constant propagation pass can produce a virtual clone 396of the function which replaces one of its arguments by a 397constant. The inliner can represent its decisions by producing a 398clone of a function whose body will be later integrated into 399a given function. 400 401Using @emph{virtual clones}, the program can be easily updated 402during the @emph{Execute} stage, solving most of pass interactions 403problems that would otherwise occur during @emph{Transform}. 404 405Virtual clones are later materialized in the LTRANS stage and 406turned into real functions. Passes executed after the virtual 407clone were introduced also perform their @emph{Transform} stage 408on new functions, so for a pass there is no significant 409difference between operating on a real function or a virtual 410clone introduced before its @emph{Execute} stage. 411 412Optimization passes then work on virtual clones introduced before 413their @emph{Execute} stage as if they were real functions. The 414only difference is that clones are not visible during the 415@emph{Generate Summary} stage. 416 417To keep function summaries updated, the callgraph interface 418allows an optimizer to register a callback that is called every 419time a new clone is introduced as well as when the actual 420function or variable is generated or when a function or variable 421is removed. These hooks are registered in the @emph{Generate 422summary} stage and allow the pass to keep its information intact 423until the @emph{Execute} stage. The same hooks can also be 424registered during the @emph{Execute} stage to keep the 425optimization summaries updated for the @emph{Transform} stage. 426 427@subsection IPA references 428 429GCC represents IPA references in the callgraph. For a function 430or variable @code{A}, the @emph{IPA reference} is a list of all 431locations where the address of @code{A} is taken and, when 432@code{A} is a variable, a list of all direct stores and reads 433to/from @code{A}. References represent an oriented multi-graph on 434the union of nodes of the callgraph and the varpool. See 435@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary} 436and 437@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary} 438for details. 439 440@subsection Jump functions 441Suppose that an optimization pass sees a function @code{A} and it 442knows the values of (some of) its arguments. The @emph{jump 443function} describes the value of a parameter of a given function 444call in function @code{A} based on this knowledge. 445 446Jump functions are used by several optimizations, such as the 447inter-procedural constant propagation pass and the 448devirtualization pass. The inliner also uses jump functions to 449perform inlining of callbacks. 450 451@section Whole program assumptions, linker plugin and symbol visibilities 452 453Link-time optimization gives relatively minor benefits when used 454alone. The problem is that propagation of inter-procedural 455information does not work well across functions and variables 456that are called or referenced by other compilation units (such as 457from a dynamically linked library). We say that such functions 458are variables are @emph{externally visible}. 459 460To make the situation even more difficult, many applications 461organize themselves as a set of shared libraries, and the default 462ELF visibility rules allow one to overwrite any externally 463visible symbol with a different symbol at runtime. This 464basically disables any optimizations across such functions and 465variables, because the compiler cannot be sure that the function 466body it is seeing is the same function body that will be used at 467runtime. Any function or variable not declared @code{static} in 468the sources degrades the quality of inter-procedural 469optimization. 470 471To avoid this problem the compiler must assume that it sees the 472whole program when doing link-time optimization. Strictly 473speaking, the whole program is rarely visible even at link-time. 474Standard system libraries are usually linked dynamically or not 475provided with the link-time information. In GCC, the whole 476program option (@option{-fwhole-program}) asserts that every 477function and variable defined in the current compilation 478unit is static, except for function @code{main} (note: at 479link time, the current unit is the union of all objects compiled 480with LTO). Since some functions and variables need to 481be referenced externally, for example by another DSO or from an 482assembler file, GCC also provides the function and variable 483attribute @code{externally_visible} which can be used to disable 484the effect of @option{-fwhole-program} on a specific symbol. 485 486The whole program mode assumptions are slightly more complex in 487C++, where inline functions in headers are put into @emph{COMDAT} 488sections. COMDAT function and variables can be defined by 489multiple object files and their bodies are unified at link-time 490and dynamic link-time. COMDAT functions are changed to local only 491when their address is not taken and thus un-sharing them with a 492library is not harmful. COMDAT variables always remain externally 493visible, however for readonly variables it is assumed that their 494initializers cannot be overwritten by a different value. 495 496GCC provides the function and variable attribute 497@code{visibility} that can be used to specify the visibility of 498externally visible symbols (or alternatively an 499@option{-fdefault-visibility} command line option). ELF defines 500the @code{default}, @code{protected}, @code{hidden} and 501@code{internal} visibilities. 502 503The most commonly used is visibility is @code{hidden}. It 504specifies that the symbol cannot be referenced from outside of 505the current shared library. Unfortunately, this information 506cannot be used directly by the link-time optimization in the 507compiler since the whole shared library also might contain 508non-LTO objects and those are not visible to the compiler. 509 510GCC solves this problem using linker plugins. A @emph{linker 511plugin} is an interface to the linker that allows an external 512program to claim the ownership of a given object file. The linker 513then performs the linking procedure by querying the plugin about 514the symbol table of the claimed objects and once the linking 515decisions are complete, the plugin is allowed to provide the 516final object file before the actual linking is made. The linker 517plugin obtains the symbol resolution information which specifies 518which symbols provided by the claimed objects are bound from the 519rest of a binary being linked. 520 521Currently, the linker plugin works only in combination 522with the Gold linker, but a GNU ld implementation is under 523development. 524 525GCC is designed to be independent of the rest of the toolchain 526and aims to support linkers without plugin support. For this 527reason it does not use the linker plugin by default. Instead, 528the object files are examined by @command{collect2} before being 529passed to the linker and objects found to have LTO sections are 530passed to @command{lto1} first. This mode does not work for 531library archives. The decision on what object files from the 532archive are needed depends on the actual linking and thus GCC 533would have to implement the linker itself. The resolution 534information is missing too and thus GCC needs to make an educated 535guess based on @option{-fwhole-program}. Without the linker 536plugin GCC also assumes that symbols are declared @code{hidden} 537and not referred by non-LTO code by default. 538 539@section Internal flags controlling @code{lto1} 540 541The following flags are passed into @command{lto1} and are not 542meant to be used directly from the command line. 543 544@itemize 545@item -fwpa 546@opindex fwpa 547This option runs the serial part of the link-time optimizer 548performing the inter-procedural propagation (WPA mode). The 549compiler reads in summary information from all inputs and 550performs an analysis based on summary information only. It 551generates object files for subsequent runs of the link-time 552optimizer where individual object files are optimized using both 553summary information from the WPA mode and the actual function 554bodies. It then drives the LTRANS phase. 555 556@item -fltrans 557@opindex fltrans 558This option runs the link-time optimizer in the 559local-transformation (LTRANS) mode, which reads in output from a 560previous run of the LTO in WPA mode. In the LTRANS mode, LTO 561optimizes an object and produces the final assembly. 562 563@item -fltrans-output-list=@var{file} 564@opindex fltrans-output-list 565This option specifies a file to which the names of LTRANS output 566files are written. This option is only meaningful in conjunction 567with @option{-fwpa}. 568@end itemize 569