xref: /dragonfly/contrib/gcc-4.7/gcc/doc/lto.texi (revision ef2b2b9d)
1@c Copyright (c) 2010 Free Software Foundation, Inc.
2@c Free Software Foundation, Inc.
3@c This is part of the GCC manual.
4@c For copying conditions, see the file gcc.texi.
5@c Contributed by Jan Hubicka <jh@suse.cz> and
6@c Diego Novillo <dnovillo@google.com>
7
8@node LTO
9@chapter Link Time Optimization
10@cindex lto
11@cindex whopr
12@cindex wpa
13@cindex ltrans
14
15@section Design Overview
16
17Link time optimization is implemented as a GCC front end for a
18bytecode representation of GIMPLE that is emitted in special sections
19of @code{.o} files.  Currently, LTO support is enabled in most
20ELF-based systems, as well as darwin, cygwin and mingw systems.
21
22Since GIMPLE bytecode is saved alongside final object code, object
23files generated with LTO support are larger than regular object files.
24This ``fat'' object format makes it easy to integrate LTO into
25existing build systems, as one can, for instance, produce archives of
26the files.  Additionally, one might be able to ship one set of fat
27objects which could be used both for development and the production of
28optimized builds.  A, perhaps surprising, side effect of this feature
29is that any mistake in the toolchain that leads to LTO information not
30being used (e.g.@: an older @code{libtool} calling @code{ld} directly).
31This is both an advantage, as the system is more robust, and a
32disadvantage, as the user is not informed that the optimization has
33been disabled.
34
35The current implementation only produces ``fat'' objects, effectively
36doubling compilation time and increasing file sizes up to 5x the
37original size.  This hides the problem that some tools, such as
38@code{ar} and @code{nm}, need to understand symbol tables of LTO
39sections.  These tools were extended to use the plugin infrastructure,
40and with these problems solved, GCC will also support ``slim'' objects
41consisting of the intermediate code alone.
42
43At the highest level, LTO splits the compiler in two.  The first half
44(the ``writer'') produces a streaming representation of all the
45internal data structures needed to optimize and generate code.  This
46includes declarations, types, the callgraph and the GIMPLE representation
47of function bodies.
48
49When @option{-flto} is given during compilation of a source file, the
50pass manager executes all the passes in @code{all_lto_gen_passes}.
51Currently, this phase is composed of two IPA passes:
52
53@itemize @bullet
54@item @code{pass_ipa_lto_gimple_out}
55This pass executes the function @code{lto_output} in
56@file{lto-streamer-out.c}, which traverses the call graph encoding
57every reachable declaration, type and function.  This generates a
58memory representation of all the file sections described below.
59
60@item @code{pass_ipa_lto_finish_out}
61This pass executes the function @code{produce_asm_for_decls} in
62@file{lto-streamer-out.c}, which takes the memory image built in the
63previous pass and encodes it in the corresponding ELF file sections.
64@end itemize
65
66The second half of LTO support is the ``reader''.  This is implemented
67as the GCC front end @file{lto1} in @file{lto/lto.c}.  When
68@file{collect2} detects a link set of @code{.o}/@code{.a} files with
69LTO information and the @option{-flto} is enabled, it invokes
70@file{lto1} which reads the set of files and aggregates them into a
71single translation unit for optimization.  The main entry point for
72the reader is @file{lto/lto.c}:@code{lto_main}.
73
74@subsection LTO modes of operation
75
76One of the main goals of the GCC link-time infrastructure was to allow
77effective compilation of large programs.  For this reason GCC implements two
78link-time compilation modes.
79
80@enumerate
81@item	@emph{LTO mode}, in which the whole program is read into the
82compiler at link-time and optimized in a similar way as if it
83were a single source-level compilation unit.
84
85@item	@emph{WHOPR or partitioned mode}, designed to utilize multiple
86CPUs and/or a distributed compilation environment to quickly link
87large applications.  WHOPR stands for WHOle Program optimizeR (not to
88be confused with the semantics of @option{-fwhole-program}).  It
89partitions the aggregated callgraph from many different @code{.o}
90files and distributes the compilation of the sub-graphs to different
91CPUs.
92
93Note that distributed compilation is not implemented yet, but since
94the parallelism is facilitated via generating a @code{Makefile}, it
95would be easy to implement.
96@end enumerate
97
98WHOPR splits LTO into three main stages:
99@enumerate
100@item Local generation (LGEN)
101This stage executes in parallel.  Every file in the program is compiled
102into the intermediate language and packaged together with the local
103call-graph and summary information.  This stage is the same for both
104the LTO and WHOPR compilation mode.
105
106@item Whole Program Analysis (WPA)
107WPA is performed sequentially.  The global call-graph is generated, and
108a global analysis procedure makes transformation decisions.  The global
109call-graph is partitioned to facilitate parallel optimization during
110phase 3.  The results of the WPA stage are stored into new object files
111which contain the partitions of program expressed in the intermediate
112language and the optimization decisions.
113
114@item Local transformations (LTRANS)
115This stage executes in parallel.  All the decisions made during phase 2
116are implemented locally in each partitioned object file, and the final
117object code is generated.  Optimizations which cannot be decided
118efficiently during the phase 2 may be performed on the local
119call-graph partitions.
120@end enumerate
121
122WHOPR can be seen as an extension of the usual LTO mode of
123compilation.  In LTO, WPA and LTRANS are executed within a single
124execution of the compiler, after the whole program has been read into
125memory.
126
127When compiling in WHOPR mode, the callgraph is partitioned during
128the WPA stage.  The whole program is split into a given number of
129partitions of roughly the same size.  The compiler tries to
130minimize the number of references which cross partition boundaries.
131The main advantage of WHOPR is to allow the parallel execution of
132LTRANS stages, which are the most time-consuming part of the
133compilation process.  Additionally, it avoids the need to load the
134whole program into memory.
135
136
137@section LTO file sections
138
139LTO information is stored in several ELF sections inside object files.
140Data structures and enum codes for sections are defined in
141@file{lto-streamer.h}.
142
143These sections are emitted from @file{lto-streamer-out.c} and mapped
144in all at once from @file{lto/lto.c}:@code{lto_file_read}.  The
145individual functions dealing with the reading/writing of each section
146are described below.
147
148@itemize @bullet
149@item Command line options (@code{.gnu.lto_.opts})
150
151This section contains the command line options used to generate the
152object files.  This is used at link time to determine the optimization
153level and other settings when they are not explicitly specified at the
154linker command line.
155
156Currently, GCC does not support combining LTO object files compiled
157with different set of the command line options into a single binary.
158At link time, the options given on the command line and the options
159saved on all the files in a link-time set are applied globally.  No
160attempt is made at validating the combination of flags (other than the
161usual validation done by option processing).  This is implemented in
162@file{lto/lto.c}:@code{lto_read_all_file_options}.
163
164
165@item Symbol table (@code{.gnu.lto_.symtab})
166
167This table replaces the ELF symbol table for functions and variables
168represented in the LTO IL.  Symbols used and exported by the optimized
169assembly code of ``fat'' objects might not match the ones used and
170exported by the intermediate code.  This table is necessary because
171the intermediate code is less optimized and thus requires a separate
172symbol table.
173
174Additionally, the binary code in the ``fat'' object will lack a call
175to a function, since the call was optimized out at compilation time
176after the intermediate language was streamed out.  In some special
177cases, the same optimization may not happen during link-time
178optimization.  This would lead to an undefined symbol if only one
179symbol table was used.
180
181The symbol table is emitted in
182@file{lto-streamer-out.c}:@code{produce_symtab}.
183
184
185@item Global declarations and types (@code{.gnu.lto_.decls})
186
187This section contains an intermediate language dump of all
188declarations and types required to represent the callgraph, static
189variables and top-level debug info.
190
191The contents of this section are emitted in
192@file{lto-streamer-out.c}:@code{produce_asm_for_decls}.  Types and
193symbols are emitted in a topological order that preserves the sharing
194of pointers when the file is read back in
195(@file{lto.c}:@code{read_cgraph_and_symbols}).
196
197
198@item The callgraph (@code{.gnu.lto_.cgraph})
199
200This section contains the basic data structure used by the GCC
201inter-procedural optimization infrastructure.  This section stores an
202annotated multi-graph which represents the functions and call sites as
203well as the variables, aliases and top-level @code{asm} statements.
204
205This section is emitted in
206@file{lto-streamer-out.c}:@code{output_cgraph} and read in
207@file{lto-cgraph.c}:@code{input_cgraph}.
208
209
210@item IPA references (@code{.gnu.lto_.refs})
211
212This section contains references between function and static
213variables.  It is emitted by @file{lto-cgraph.c}:@code{output_refs}
214and read by @file{lto-cgraph.c}:@code{input_refs}.
215
216
217@item Function bodies (@code{.gnu.lto_.function_body.<name>})
218
219This section contains function bodies in the intermediate language
220representation.  Every function body is in a separate section to allow
221copying of the section independently to different object files or
222reading the function on demand.
223
224Functions are emitted in
225@file{lto-streamer-out.c}:@code{output_function} and read in
226@file{lto-streamer-in.c}:@code{input_function}.
227
228
229@item Static variable initializers (@code{.gnu.lto_.vars})
230
231This section contains all the symbols in the global variable pool.  It
232is emitted by @file{lto-cgraph.c}:@code{output_varpool} and read in
233@file{lto-cgraph.c}:@code{input_cgraph}.
234
235@item Summaries and optimization summaries used by IPA passes
236(@code{.gnu.lto_.<xxx>}, where @code{<xxx>} is one of @code{jmpfuncs},
237@code{pureconst} or @code{reference})
238
239These sections are used by IPA passes that need to emit summary
240information during LTO generation to be read and aggregated at
241link time.  Each pass is responsible for implementing two pass manager
242hooks: one for writing the summary and another for reading it in.  The
243format of these sections is entirely up to each individual pass.  The
244only requirement is that the writer and reader hooks agree on the
245format.
246@end itemize
247
248
249@section Using summary information in IPA passes
250
251Programs are represented internally as a @emph{callgraph} (a
252multi-graph where nodes are functions and edges are call sites)
253and a @emph{varpool} (a list of static and external variables in
254the program).
255
256The inter-procedural optimization is organized as a sequence of
257individual passes, which operate on the callgraph and the
258varpool.  To make the implementation of WHOPR possible, every
259inter-procedural optimization pass is split into several stages
260that are executed at different times during WHOPR compilation:
261
262@itemize @bullet
263@item LGEN time
264@enumerate
265@item @emph{Generate summary} (@code{generate_summary} in
266@code{struct ipa_opt_pass_d}).  This stage analyzes every function
267body and variable initializer is examined and stores relevant
268information into a pass-specific data structure.
269
270@item @emph{Write summary} (@code{write_summary} in
271@code{struct ipa_opt_pass_d}).  This stage writes all the
272pass-specific information generated by @code{generate_summary}.
273Summaries go into their own @code{LTO_section_*} sections that
274have to be declared in @file{lto-streamer.h}:@code{enum
275lto_section_type}.  A new section is created by calling
276@code{create_output_block} and data can be written using the
277@code{lto_output_*} routines.
278@end enumerate
279
280@item WPA time
281@enumerate
282@item @emph{Read summary} (@code{read_summary} in
283@code{struct ipa_opt_pass_d}).  This stage reads all the
284pass-specific information in exactly the same order that it was
285written by @code{write_summary}.
286
287@item @emph{Execute} (@code{execute} in @code{struct
288opt_pass}).  This performs inter-procedural propagation.  This
289must be done without actual access to the individual function
290bodies or variable initializers.  Typically, this results in a
291transitive closure operation over the summary information of all
292the nodes in the callgraph.
293
294@item @emph{Write optimization summary}
295(@code{write_optimization_summary} in @code{struct
296ipa_opt_pass_d}).  This writes the result of the inter-procedural
297propagation into the object file.  This can use the same data
298structures and helper routines used in @code{write_summary}.
299@end enumerate
300
301@item LTRANS time
302@enumerate
303@item @emph{Read optimization summary}
304(@code{read_optimization_summary} in @code{struct
305ipa_opt_pass_d}).  The counterpart to
306@code{write_optimization_summary}.  This reads the interprocedural
307optimization decisions in exactly the same format emitted by
308@code{write_optimization_summary}.
309
310@item @emph{Transform} (@code{function_transform} and
311@code{variable_transform} in @code{struct ipa_opt_pass_d}).
312The actual function bodies and variable initializers are updated
313based on the information passed down from the @emph{Execute} stage.
314@end enumerate
315@end itemize
316
317The implementation of the inter-procedural passes are shared
318between LTO, WHOPR and classic non-LTO compilation.
319
320@itemize
321@item During the traditional file-by-file mode every pass executes its
322own @emph{Generate summary}, @emph{Execute}, and @emph{Transform}
323stages within the single execution context of the compiler.
324
325@item In LTO compilation mode, every pass uses @emph{Generate
326summary} and @emph{Write summary} stages at compilation time,
327while the @emph{Read summary}, @emph{Execute}, and
328@emph{Transform} stages are executed at link time.
329
330@item In WHOPR mode all stages are used.
331@end itemize
332
333To simplify development, the GCC pass manager differentiates
334between normal inter-procedural passes and small inter-procedural
335passes.  A @emph{small inter-procedural pass}
336(@code{SIMPLE_IPA_PASS}) is a pass that does
337everything at once and thus it can not be executed during WPA in
338WHOPR mode.  It defines only the @emph{Execute} stage and during
339this stage it accesses and modifies the function bodies.  Such
340passes are useful for optimization at LGEN or LTRANS time and are
341used, for example, to implement early optimization before writing
342object files.  The simple inter-procedural passes can also be used
343for easier prototyping and development of a new inter-procedural
344pass.
345
346
347@subsection Virtual clones
348
349One of the main challenges of introducing the WHOPR compilation
350mode was addressing the interactions between optimization passes.
351In LTO compilation mode, the passes are executed in a sequence,
352each of which consists of analysis (or @emph{Generate summary}),
353propagation (or @emph{Execute}) and @emph{Transform} stages.
354Once the work of one pass is finished, the next pass sees the
355updated program representation and can execute.  This makes the
356individual passes dependent on each other.
357
358In WHOPR mode all passes first execute their @emph{Generate
359summary} stage.  Then summary writing marks the end of the LGEN
360stage.  At WPA time,
361the summaries are read back into memory and all passes run the
362@emph{Execute} stage.  Optimization summaries are streamed and
363sent to LTRANS, where all the passes execute the @emph{Transform}
364stage.
365
366Most optimization passes split naturally into analysis,
367propagation and transformation stages.  But some do not.  The
368main problem arises when one pass performs changes and the
369following pass gets confused by seeing different callgraphs
370between the @emph{Transform} stage and the @emph{Generate summary}
371or @emph{Execute} stage.  This means that the passes are required
372to communicate their decisions with each other.
373
374To facilitate this communication, the GCC callgraph
375infrastructure implements @emph{virtual clones}, a method of
376representing the changes performed by the optimization passes in
377the callgraph without needing to update function bodies.
378
379A @emph{virtual clone} in the callgraph is a function that has no
380associated body, just a description of how to create its body based
381on a different function (which itself may be a virtual clone).
382
383The description of function modifications includes adjustments to
384the function's signature (which allows, for example, removing or
385adding function arguments), substitutions to perform on the
386function body, and, for inlined functions, a pointer to the
387function that it will be inlined into.
388
389It is also possible to redirect any edge of the callgraph from a
390function to its virtual clone.  This implies updating of the call
391site to adjust for the new function signature.
392
393Most of the transformations performed by inter-procedural
394optimizations can be represented via virtual clones.  For
395instance, a constant propagation pass can produce a virtual clone
396of the function which replaces one of its arguments by a
397constant.  The inliner can represent its decisions by producing a
398clone of a function whose body will be later integrated into
399a given function.
400
401Using @emph{virtual clones}, the program can be easily updated
402during the @emph{Execute} stage, solving most of pass interactions
403problems that would otherwise occur during @emph{Transform}.
404
405Virtual clones are later materialized in the LTRANS stage and
406turned into real functions.  Passes executed after the virtual
407clone were introduced also perform their @emph{Transform} stage
408on new functions, so for a pass there is no significant
409difference between operating on a real function or a virtual
410clone introduced before its @emph{Execute} stage.
411
412Optimization passes then work on virtual clones introduced before
413their @emph{Execute} stage as if they were real functions.  The
414only difference is that clones are not visible during the
415@emph{Generate Summary} stage.
416
417To keep function summaries updated, the callgraph interface
418allows an optimizer to register a callback that is called every
419time a new clone is introduced as well as when the actual
420function or variable is generated or when a function or variable
421is removed.  These hooks are registered in the @emph{Generate
422summary} stage and allow the pass to keep its information intact
423until the @emph{Execute} stage.  The same hooks can also be
424registered during the @emph{Execute} stage to keep the
425optimization summaries updated for the @emph{Transform} stage.
426
427@subsection IPA references
428
429GCC represents IPA references in the callgraph.  For a function
430or variable @code{A}, the @emph{IPA reference} is a list of all
431locations where the address of @code{A} is taken and, when
432@code{A} is a variable, a list of all direct stores and reads
433to/from @code{A}.  References represent an oriented multi-graph on
434the union of nodes of the callgraph and the varpool.  See
435@file{ipa-reference.c}:@code{ipa_reference_write_optimization_summary}
436and
437@file{ipa-reference.c}:@code{ipa_reference_read_optimization_summary}
438for details.
439
440@subsection Jump functions
441Suppose that an optimization pass sees a function @code{A} and it
442knows the values of (some of) its arguments.  The @emph{jump
443function} describes the value of a parameter of a given function
444call in function @code{A} based on this knowledge.
445
446Jump functions are used by several optimizations, such as the
447inter-procedural constant propagation pass and the
448devirtualization pass.  The inliner also uses jump functions to
449perform inlining of callbacks.
450
451@section Whole program assumptions, linker plugin and symbol visibilities
452
453Link-time optimization gives relatively minor benefits when used
454alone.  The problem is that propagation of inter-procedural
455information does not work well across functions and variables
456that are called or referenced by other compilation units (such as
457from a dynamically linked library).  We say that such functions
458are variables are @emph{externally visible}.
459
460To make the situation even more difficult, many applications
461organize themselves as a set of shared libraries, and the default
462ELF visibility rules allow one to overwrite any externally
463visible symbol with a different symbol at runtime.  This
464basically disables any optimizations across such functions and
465variables, because the compiler cannot be sure that the function
466body it is seeing is the same function body that will be used at
467runtime.  Any function or variable not declared @code{static} in
468the sources degrades the quality of inter-procedural
469optimization.
470
471To avoid this problem the compiler must assume that it sees the
472whole program when doing link-time optimization.  Strictly
473speaking, the whole program is rarely visible even at link-time.
474Standard system libraries are usually linked dynamically or not
475provided with the link-time information.  In GCC, the whole
476program option (@option{-fwhole-program}) asserts that every
477function and variable defined in the current compilation
478unit is static, except for function @code{main} (note: at
479link time, the current unit is the union of all objects compiled
480with LTO).  Since some functions and variables need to
481be referenced externally, for example by another DSO or from an
482assembler file, GCC also provides the function and variable
483attribute @code{externally_visible} which can be used to disable
484the effect of @option{-fwhole-program} on a specific symbol.
485
486The whole program mode assumptions are slightly more complex in
487C++, where inline functions in headers are put into @emph{COMDAT}
488sections.  COMDAT function and variables can be defined by
489multiple object files and their bodies are unified at link-time
490and dynamic link-time.  COMDAT functions are changed to local only
491when their address is not taken and thus un-sharing them with a
492library is not harmful.  COMDAT variables always remain externally
493visible, however for readonly variables it is assumed that their
494initializers cannot be overwritten by a different value.
495
496GCC provides the function and variable attribute
497@code{visibility} that can be used to specify the visibility of
498externally visible symbols (or alternatively an
499@option{-fdefault-visibility} command line option).  ELF defines
500the @code{default}, @code{protected}, @code{hidden} and
501@code{internal} visibilities.
502
503The most commonly used is visibility is @code{hidden}.  It
504specifies that the symbol cannot be referenced from outside of
505the current shared library.  Unfortunately, this information
506cannot be used directly by the link-time optimization in the
507compiler since the whole shared library also might contain
508non-LTO objects and those are not visible to the compiler.
509
510GCC solves this problem using linker plugins.  A @emph{linker
511plugin} is an interface to the linker that allows an external
512program to claim the ownership of a given object file.  The linker
513then performs the linking procedure by querying the plugin about
514the symbol table of the claimed objects and once the linking
515decisions are complete, the plugin is allowed to provide the
516final object file before the actual linking is made.  The linker
517plugin obtains the symbol resolution information which specifies
518which symbols provided by the claimed objects are bound from the
519rest of a binary being linked.
520
521Currently, the linker plugin  works only in combination
522with the Gold linker, but a GNU ld implementation is under
523development.
524
525GCC is designed to be independent of the rest of the toolchain
526and aims to support linkers without plugin support.  For this
527reason it does not use the linker plugin by default.  Instead,
528the object files are examined by @command{collect2} before being
529passed to the linker and objects found to have LTO sections are
530passed to @command{lto1} first.  This mode does not work for
531library archives.  The decision on what object files from the
532archive are needed depends on the actual linking and thus GCC
533would have to implement the linker itself.  The resolution
534information is missing too and thus GCC needs to make an educated
535guess based on @option{-fwhole-program}.  Without the linker
536plugin GCC also assumes that symbols are declared @code{hidden}
537and not referred by non-LTO code by default.
538
539@section Internal flags controlling @code{lto1}
540
541The following flags are passed into @command{lto1} and are not
542meant to be used directly from the command line.
543
544@itemize
545@item -fwpa
546@opindex fwpa
547This option runs the serial part of the link-time optimizer
548performing the inter-procedural propagation (WPA mode).  The
549compiler reads in summary information from all inputs and
550performs an analysis based on summary information only.  It
551generates object files for subsequent runs of the link-time
552optimizer where individual object files are optimized using both
553summary information from the WPA mode and the actual function
554bodies.  It then drives the LTRANS phase.
555
556@item -fltrans
557@opindex fltrans
558This option runs the link-time optimizer in the
559local-transformation (LTRANS) mode, which reads in output from a
560previous run of the LTO in WPA mode.  In the LTRANS mode, LTO
561optimizes an object and produces the final assembly.
562
563@item -fltrans-output-list=@var{file}
564@opindex fltrans-output-list
565This option specifies a file to which the names of LTRANS output
566files are written.  This option is only meaningful in conjunction
567with @option{-fwpa}.
568@end itemize
569