1<chapter xmlns="http://docbook.org/ns/docbook" version="5.0"
2	 xml:id="manual.ext.profile_mode" xreflabel="Profile Mode">
3<?dbhtml filename="profile_mode.html"?>
4
5<info><title>Profile Mode</title>
6  <keywordset>
7    <keyword>C++</keyword>
8    <keyword>library</keyword>
9    <keyword>profile</keyword>
10  </keywordset>
11</info>
12
13
14
15
16<section xml:id="manual.ext.profile_mode.intro" xreflabel="Intro"><info><title>Intro</title></info>
17
18  <para>
19  <emphasis>Goal: </emphasis>Give performance improvement advice based on
20  recognition of suboptimal usage patterns of the standard library.
21  </para>
22
23  <para>
24  <emphasis>Method: </emphasis>Wrap the standard library code.  Insert
25  calls to an instrumentation library to record the internal state of
26  various components at interesting entry/exit points to/from the standard
27  library.  Process trace, recognize suboptimal patterns, give advice.
28  For details, see the
29  <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ieeexplore.ieee.org/document/4907670/">Perflint
30  paper presented at CGO 2009</link>.
31  </para>
32  <para>
33  <emphasis>Strengths: </emphasis>
34<itemizedlist>
35  <listitem><para>
36  Unintrusive solution.  The application code does not require any
37  modification.
38  </para></listitem>
39  <listitem><para> The advice is call context sensitive, thus capable of
40  identifying precisely interesting dynamic performance behavior.
41  </para></listitem>
42  <listitem><para>
43  The overhead model is pay-per-view.  When you turn off a diagnostic class
44  at compile time, its overhead disappears.
45  </para></listitem>
46</itemizedlist>
47  </para>
48  <para>
49  <emphasis>Drawbacks: </emphasis>
50<itemizedlist>
51  <listitem><para>
52  You must recompile the application code with custom options.
53  </para></listitem>
54  <listitem><para>You must run the application on representative input.
55  The advice is input dependent.
56  </para></listitem>
57  <listitem><para>
58  The execution time will increase, in some cases by factors.
59  </para></listitem>
60</itemizedlist>
61  </para>
62
63
64<section xml:id="manual.ext.profile_mode.using" xreflabel="Using"><info><title>Using the Profile Mode</title></info>
65
66
67  <para>
68  This is the anticipated common workflow for program <code>foo.cc</code>:
69<programlisting>
70$ cat foo.cc
71#include &lt;vector&gt;
72int main() {
73  vector&lt;int&gt; v;
74  for (int k = 0; k &lt; 1024; ++k) v.insert(v.begin(), k);
75}
76
77$ g++ -D_GLIBCXX_PROFILE foo.cc
78$ ./a.out
79$ cat libstdcxx-profile.txt
80vector-to-list: improvement = 5: call stack = 0x804842c ...
81    : advice = change std::vector to std::list
82vector-size: improvement = 3: call stack = 0x804842c ...
83    : advice = change initial container size from 0 to 1024
84</programlisting>
85  </para>
86
87  <para>
88  Anatomy of a warning:
89  <itemizedlist>
90  <listitem>
91  <para>
92  Warning id.  This is a short descriptive string for the class
93  that this warning belongs to.  E.g., "vector-to-list".
94  </para>
95  </listitem>
96  <listitem>
97  <para>
98  Estimated improvement.  This is an approximation of the benefit expected
99  from implementing the change suggested by the warning.  It is given on
100  a log10 scale.  Negative values mean that the alternative would actually
101  do worse than the current choice.
102  In the example above, 5 comes from the fact that the overhead of
103  inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,
104  which is around 10e5.  The improvement from setting the initial size to
105  1024 is in the range of 10e3, since the overhead of dynamic resizing is
106  linear in this case.
107  </para>
108  </listitem>
109  <listitem>
110  <para>
111  Call stack.  Currently, the addresses are printed without
112  symbol name or code location attribution.
113  Users are expected to postprocess the output using, for instance, addr2line.
114  </para>
115  </listitem>
116  <listitem>
117  <para>
118  The warning message.  For some warnings, this is static text, e.g.,
119  "change vector to list".  For other warnings, such as the one above,
120  the message contains numeric advice, e.g., the suggested initial size
121  of the vector.
122  </para>
123  </listitem>
124  </itemizedlist>
125  </para>
126
127  <para>Three files are generated.  <code>libstdcxx-profile.txt</code>
128   contains human readable advice.  <code>libstdcxx-profile.raw</code>
129   contains implementation specific data about each diagnostic.
130   Their format is not documented.  They are sufficient to generate
131   all the advice given in <code>libstdcxx-profile.txt</code>.  The advantage
132   of keeping this raw format is that traces from multiple executions can
133   be aggregated simply by concatenating the raw traces.  We intend to
134   offer an external utility program that can issue advice from a trace.
135   <code>libstdcxx-profile.conf.out</code> lists the actual diagnostic
136   parameters used.  To alter parameters, edit this file and rename it to
137   <code>libstdcxx-profile.conf</code>.
138  </para>
139
140  <para>Advice is given regardless whether the transformation is valid.
141  For instance, we advise changing a map to an unordered_map even if the
142  application semantics require that data be ordered.
143  We believe such warnings can help users understand the performance
144  behavior of their application better, which can lead to changes
145  at a higher abstraction level.
146  </para>
147
148</section>
149
150<section xml:id="manual.ext.profile_mode.tuning" xreflabel="Tuning"><info><title>Tuning the Profile Mode</title></info>
151
152
153  <para>Compile time switches and environment variables (see also file
154   profiler.h).  Unless specified otherwise, they can be set at compile time
155   using -D_&lt;name&gt; or by setting variable &lt;name&gt;
156   in the environment where the program is run, before starting execution.
157  <itemizedlist>
158  <listitem><para>
159   <code>_GLIBCXX_PROFILE_NO_&lt;diagnostic&gt;</code>:
160   disable specific diagnostics.
161   See section Diagnostics for possible values.
162   (Environment variables not supported.)
163   </para></listitem>
164  <listitem><para>
165   <code>_GLIBCXX_PROFILE_TRACE_PATH_ROOT</code>: set an alternative root
166   path for the output files.
167   </para></listitem>
168  <listitem><para>_GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum
169   number of warnings desired.  The default value is 10.</para></listitem>
170  <listitem><para>
171   <code>_GLIBCXX_PROFILE_MAX_STACK_DEPTH</code>: if set to 0,
172   the advice will
173   be collected and reported for the program as a whole, and not for each
174   call context.
175   This could also be used in continuous regression tests, where you
176   just need to know whether there is a regression or not.
177   The default value is 32.
178   </para></listitem>
179  <listitem><para>
180   <code>_GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC</code>:
181   set a limit on how much memory to use for the accounting tables for each
182   diagnostic type.  When this limit is reached, new events are ignored
183   until the memory usage decreases under the limit.  Generally, this means
184   that newly created containers will not be instrumented until some
185   live containers are deleted.  The default is 128 MB.
186   </para></listitem>
187  <listitem><para>
188   <code>_GLIBCXX_PROFILE_NO_THREADS</code>:
189   Make the library not use threads.  If thread local storage (TLS) is not
190   available, you will get a preprocessor error asking you to set
191   -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.
192   Multithreaded execution without TLS is not supported.
193   (Environment variable not supported.)
194   </para></listitem>
195  <listitem><para>
196   <code>_GLIBCXX_HAVE_EXECINFO_H</code>:
197   This name should be defined automatically at library configuration time.
198   If your library was configured without <code>execinfo.h</code>, but
199   you have it in your include path, you can define it explicitly.  Without
200   it, advice is collected for the program as a whole, and not for each
201   call context.
202   (Environment variable not supported.)
203   </para></listitem>
204  </itemizedlist>
205  </para>
206
207</section>
208
209</section>
210
211
212<section xml:id="manual.ext.profile_mode.design" xreflabel="Design"><info><title>Design</title></info>
213<?dbhtml filename="profile_mode_design.html"?>
214
215
216<para>
217</para>
218<table frame="all" xml:id="table.profile_code_loc">
219<title>Profile Code Location</title>
220
221<tgroup cols="2" align="left" colsep="1" rowsep="1">
222<colspec colname="c1"/>
223<colspec colname="c2"/>
224
225<thead>
226  <row>
227    <entry>Code Location</entry>
228    <entry>Use</entry>
229  </row>
230</thead>
231<tbody>
232  <row>
233    <entry><code>libstdc++-v3/include/std/*</code></entry>
234    <entry>Preprocessor code to redirect to profile extension headers.</entry>
235  </row>
236  <row>
237    <entry><code>libstdc++-v3/include/profile/*</code></entry>
238    <entry>Profile extension public headers (map, vector, ...).</entry>
239  </row>
240  <row>
241    <entry><code>libstdc++-v3/include/profile/impl/*</code></entry>
242    <entry>Profile extension internals.  Implementation files are
243     only included from <code>impl/profiler.h</code>, which is the only
244     file included from the public headers.</entry>
245  </row>
246</tbody>
247</tgroup>
248</table>
249
250<para>
251</para>
252
253<section xml:id="manual.ext.profile_mode.design.wrapper" xreflabel="Wrapper"><info><title>Wrapper Model</title></info>
254
255  <para>
256  In order to get our instrumented library version included instead of the
257  release one,
258  we use the same wrapper model as the debug mode.
259  We subclass entities from the release version.  Wherever
260  <code>_GLIBCXX_PROFILE</code> is defined, the release namespace is
261  <code>std::__norm</code>, whereas the profile namespace is
262  <code>std::__profile</code>.  Using plain <code>std</code> translates
263  into <code>std::__profile</code>.
264  </para>
265  <para>
266  Whenever possible, we try to wrap at the public interface level, e.g.,
267  in <code>unordered_set</code> rather than in <code>hashtable</code>,
268  in order not to depend on implementation.
269  </para>
270  <para>
271  Mixing object files built with and without the profile mode must
272  not affect the program execution.  However, there are no guarantees to
273  the accuracy of diagnostics when using even a single object not built with
274  <code>-D_GLIBCXX_PROFILE</code>.
275  Currently, mixing the profile mode with debug and parallel extensions is
276  not allowed.  Mixing them at compile time will result in preprocessor errors.
277  Mixing them at link time is undefined.
278  </para>
279</section>
280
281
282<section xml:id="manual.ext.profile_mode.design.instrumentation" xreflabel="Instrumentation"><info><title>Instrumentation</title></info>
283
284  <para>
285  Instead of instrumenting every public entry and exit point,
286  we chose to add instrumentation on demand, as needed
287  by individual diagnostics.
288  The main reason is that some diagnostics require us to extract bits of
289  internal state that are particular only to that diagnostic.
290  We plan to formalize this later, after we learn more about the requirements
291  of several diagnostics.
292  </para>
293  <para>
294  All the instrumentation points can be switched on and off using
295  <code>-D[_NO]_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code> options.
296  With all the instrumentation calls off, there should be negligible
297  overhead over the release version.  This property is needed to support
298  diagnostics based on timing of internal operations.  For such diagnostics,
299  we anticipate turning most of the instrumentation off in order to prevent
300  profiling overhead from polluting time measurements, and thus diagnostics.
301  </para>
302  <para>
303  All the instrumentation on/off compile time switches live in
304  <code>include/profile/profiler.h</code>.
305  </para>
306</section>
307
308
309<section xml:id="manual.ext.profile_mode.design.rtlib" xreflabel="Run Time Behavior"><info><title>Run Time Behavior</title></info>
310
311  <para>
312  For practical reasons, the instrumentation library processes the trace
313  partially
314  rather than dumping it to disk in raw form.  Each event is processed when
315  it occurs.  It is usually attached a cost and it is aggregated into
316  the database of a specific diagnostic class.  The cost model
317  is based largely on the standard performance guarantees, but in some
318  cases we use knowledge about GCC's standard library implementation.
319  </para>
320  <para>
321  Information is indexed by (1) call stack and (2) instance id or address
322  to be able to understand and summarize precise creation-use-destruction
323  dynamic chains.  Although the analysis is sensitive to dynamic instances,
324  the reports are only sensitive to call context.  Whenever a dynamic instance
325  is destroyed, we accumulate its effect to the corresponding entry for the
326  call stack of its constructor location.
327  </para>
328
329  <para>
330  For details, see
331   <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ieeexplore.ieee.org/document/4907670/">paper presented at
332   CGO 2009</link>.
333  </para>
334</section>
335
336
337<section xml:id="manual.ext.profile_mode.design.analysis" xreflabel="Analysis and Diagnostics"><info><title>Analysis and Diagnostics</title></info>
338
339  <para>
340  Final analysis takes place offline, and it is based entirely on the
341  generated trace and debugging info in the application binary.
342  See section Diagnostics for a list of analysis types that we plan to support.
343  </para>
344  <para>
345  The input to the analysis is a table indexed by profile type and call stack.
346  The data type for each entry depends on the profile type.
347  </para>
348</section>
349
350
351<section xml:id="manual.ext.profile_mode.design.cost-model" xreflabel="Cost Model"><info><title>Cost Model</title></info>
352
353  <para>
354  While it is likely that cost models become complex as we get into
355  more sophisticated analysis, we will try to follow a simple set of rules
356  at the beginning.
357  </para>
358<itemizedlist>
359  <listitem><para><emphasis>Relative benefit estimation:</emphasis>
360  The idea is to estimate or measure the cost of all operations
361  in the original scenario versus the scenario we advise to switch to.
362  For instance, when advising to change a vector to a list, an occurrence
363  of the <code>insert</code> method will generally count as a benefit.
364  Its magnitude depends on (1) the number of elements that get shifted
365  and (2) whether it triggers a reallocation.
366  </para></listitem>
367  <listitem><para><emphasis>Synthetic measurements:</emphasis>
368  We will measure the relative difference between similar operations on
369  different containers.  We plan to write a battery of small tests that
370  compare the times of the executions of similar methods on different
371  containers.  The idea is to run these tests on the target machine.
372  If this training phase is very quick, we may decide to perform it at
373  library initialization time.  The results can be cached on disk and reused
374  across runs.
375  </para></listitem>
376  <listitem><para><emphasis>Timers:</emphasis>
377  We plan to use timers for operations of larger granularity, such as sort.
378  For instance, we can switch between different sort methods on the fly
379  and report the one that performs best for each call context.
380  </para></listitem>
381  <listitem><para><emphasis>Show stoppers:</emphasis>
382  We may decide that the presence of an operation nullifies the advice.
383  For instance, when considering switching from <code>set</code> to
384  <code>unordered_set</code>, if we detect use of operator <code>++</code>,
385  we will simply not issue the advice, since this could signal that the use
386  care require a sorted container.</para></listitem>
387</itemizedlist>
388
389</section>
390
391
392<section xml:id="manual.ext.profile_mode.design.reports" xreflabel="Reports"><info><title>Reports</title></info>
393
394  <para>
395There are two types of reports.  First, if we recognize a pattern for which
396we have a substitute that is likely to give better performance, we print
397the advice and estimated performance gain.  The advice is usually associated
398to a code position and possibly a call stack.
399  </para>
400  <para>
401Second, we report performance characteristics for which we do not have
402a clear solution for improvement.  For instance, we can point to the user
403the top 10 <code>multimap</code> locations
404which have the worst data locality in actual traversals.
405Although this does not offer a solution,
406it helps the user focus on the key problems and ignore the uninteresting ones.
407  </para>
408</section>
409
410
411<section xml:id="manual.ext.profile_mode.design.testing" xreflabel="Testing"><info><title>Testing</title></info>
412
413  <para>
414  First, we want to make sure we preserve the behavior of the release mode.
415  You can just type <code>"make check-profile"</code>, which
416  builds and runs the whole test suite in profile mode.
417  </para>
418  <para>
419  Second, we want to test the correctness of each diagnostic.
420  We created a <code>profile</code> directory in the test suite.
421  Each diagnostic must come with at least two tests, one for false positives
422  and one for false negatives.
423  </para>
424</section>
425
426</section>
427
428<section xml:id="manual.ext.profile_mode.api" xreflabel="API"><info><title>Extensions for Custom Containers</title></info>
429<?dbhtml filename="profile_mode_api.html"?>
430
431
432  <para>
433  Many large projects use their own data structures instead of the ones in the
434  standard library.  If these data structures are similar in functionality
435  to the standard library, they can be instrumented with the same hooks
436  that are used to instrument the standard library.
437  The instrumentation API is exposed in file
438  <code>profiler.h</code> (look for "Instrumentation hooks").
439  </para>
440
441</section>
442
443
444<section xml:id="manual.ext.profile_mode.cost_model" xreflabel="Cost Model"><info><title>Empirical Cost Model</title></info>
445<?dbhtml filename="profile_mode_cost_model.html"?>
446
447
448  <para>
449  Currently, the cost model uses formulas with predefined relative weights
450  for alternative containers or container implementations.  For instance,
451  iterating through a vector is X times faster than iterating through a list.
452  </para>
453  <para>
454  (Under development.)
455  We are working on customizing this to a particular machine by providing
456  an automated way to compute the actual relative weights for operations
457  on the given machine.
458  </para>
459  <para>
460  (Under development.)
461  We plan to provide a performance parameter database format that can be
462  filled in either by hand or by an automated training mechanism.
463  The analysis module will then use this database instead of the built in.
464  generic parameters.
465  </para>
466
467</section>
468
469
470<section xml:id="manual.ext.profile_mode.implementation" xreflabel="Implementation"><info><title>Implementation Issues</title></info>
471<?dbhtml filename="profile_mode_impl.html"?>
472
473
474
475<section xml:id="manual.ext.profile_mode.implementation.stack" xreflabel="Stack Traces"><info><title>Stack Traces</title></info>
476
477  <para>
478  Accurate stack traces are needed during profiling since we group events by
479  call context and dynamic instance.  Without accurate traces, diagnostics
480  may be hard to interpret.  For instance, when giving advice to the user
481  it is imperative to reference application code, not library code.
482  </para>
483  <para>
484  Currently we are using the libc <code>backtrace</code> routine to get
485  stack traces.
486  <code>_GLIBCXX_PROFILE_STACK_DEPTH</code> can be set
487  to 0 if you are willing to give up call context information, or to a small
488  positive value to reduce run time overhead.
489  </para>
490</section>
491
492
493<section xml:id="manual.ext.profile_mode.implementation.symbols" xreflabel="Symbolization"><info><title>Symbolization of Instruction Addresses</title></info>
494
495  <para>
496  The profiling and analysis phases use only instruction addresses.
497  An external utility such as addr2line is needed to postprocess the result.
498  We do not plan to add symbolization support in the profile extension.
499  This would require access to symbol tables, debug information tables,
500  external programs or libraries and other system dependent information.
501  </para>
502</section>
503
504
505<section xml:id="manual.ext.profile_mode.implementation.concurrency" xreflabel="Concurrency"><info><title>Concurrency</title></info>
506
507  <para>
508  Our current model is simplistic, but precise.
509  We cannot afford to approximate because some of our diagnostics require
510  precise matching of operations to container instance and call context.
511  During profiling, we keep a single information table per diagnostic.
512  There is a single lock per information table.
513  </para>
514</section>
515
516
517<section xml:id="manual.ext.profile_mode.implementation.stdlib-in-proflib" xreflabel="Using the Standard Library in the Runtime Library"><info><title>Using the Standard Library in the Instrumentation Implementation</title></info>
518
519  <para>
520  As much as we would like to avoid uses of libstdc++ within our
521  instrumentation library, containers such as unordered_map are very
522  appealing.  We plan to use them as long as they are named properly
523  to avoid ambiguity.
524  </para>
525</section>
526
527
528<section xml:id="manual.ext.profile_mode.implementation.malloc-hooks" xreflabel="Malloc Hooks"><info><title>Malloc Hooks</title></info>
529
530  <para>
531  User applications/libraries can provide malloc hooks.
532  When the implementation of the malloc hooks uses stdlibc++, there can
533  be an infinite cycle between the profile mode instrumentation and the
534  malloc hook code.
535  </para>
536  <para>
537  We protect against reentrance to the profile mode instrumentation code,
538  which should avoid this problem in most cases.
539  The protection mechanism is thread safe and exception safe.
540  This mechanism does not prevent reentrance to the malloc hook itself,
541  which could still result in deadlock, if, for instance, the malloc hook
542  uses non-recursive locks.
543  XXX: A definitive solution to this problem would be for the profile extension
544  to use a custom allocator internally, and perhaps not to use libstdc++.
545  </para>
546</section>
547
548
549<section xml:id="manual.ext.profile_mode.implementation.construction-destruction" xreflabel="Construction and Destruction of Global Objects"><info><title>Construction and Destruction of Global Objects</title></info>
550
551  <para>
552  The profiling library state is initialized at the first call to a profiling
553  method.  This allows us to record the construction of all global objects.
554  However, we cannot do the same at destruction time.  The trace is written
555  by a function registered by <code>atexit</code>, thus invoked by
556  <code>exit</code>.
557  </para>
558</section>
559
560</section>
561
562
563<section xml:id="manual.ext.profile_mode.developer" xreflabel="Developer Information"><info><title>Developer Information</title></info>
564<?dbhtml filename="profile_mode_devel.html"?>
565
566
567<section xml:id="manual.ext.profile_mode.developer.bigpic" xreflabel="Big Picture"><info><title>Big Picture</title></info>
568
569
570  <para>The profile mode headers are included with
571   <code>-D_GLIBCXX_PROFILE</code> through preprocessor directives in
572   <code>include/std/*</code>.
573  </para>
574
575  <para>Instrumented implementations are provided in
576   <code>include/profile/*</code>.  All instrumentation hooks are macros
577   defined in <code>include/profile/profiler.h</code>.
578  </para>
579
580  <para>All the implementation of the instrumentation hooks is in
581   <code>include/profile/impl/*</code>.  Although all the code gets included,
582   thus is publicly visible, only a small number of functions are called from
583   outside this directory.  All calls to hook implementations must be
584   done through macros defined in <code>profiler.h</code>.  The macro
585   must ensure (1) that the call is guarded against reentrance and
586   (2) that the call can be turned off at compile time using a
587   <code>-D_GLIBCXX_PROFILE_...</code> compiler option.
588  </para>
589
590</section>
591
592<section xml:id="manual.ext.profile_mode.developer.howto" xreflabel="How To Add A Diagnostic"><info><title>How To Add A Diagnostic</title></info>
593
594
595  <para>Let's say the diagnostic name is "magic".
596  </para>
597
598  <para>If you need to instrument a header not already under
599   <code>include/profile/*</code>, first edit the corresponding header
600   under <code>include/std/</code> and add a preprocessor directive such
601   as the one in <code>include/std/vector</code>:
602<programlisting>
603#ifdef _GLIBCXX_PROFILE
604# include &lt;profile/vector&gt;
605#endif
606</programlisting>
607  </para>
608
609  <para>If the file you need to instrument is not yet under
610   <code>include/profile/</code>, make a copy of the one in
611   <code>include/debug</code>, or the main implementation.
612   You'll need to include the main implementation and inherit the classes
613   you want to instrument.  Then define the methods you want to instrument,
614   define the instrumentation hooks and add calls to them.
615   Look at <code>include/profile/vector</code> for an example.
616  </para>
617
618  <para>Add macros for the instrumentation hooks in
619   <code>include/profile/impl/profiler.h</code>.
620   Hook names must start with <code>__profcxx_</code>.
621   Make sure they transform
622   in no code with <code>-D_NO_GLIBCXX_PROFILE_MAGIC</code>.
623   Make sure all calls to any method in namespace <code>__gnu_profile</code>
624   is protected against reentrance using macro
625   <code>_GLIBCXX_PROFILE_REENTRANCE_GUARD</code>.
626   All names of methods in namespace <code>__gnu_profile</code> called from
627   <code>profiler.h</code> must start with <code>__trace_magic_</code>.
628  </para>
629
630  <para>Add the implementation of the diagnostic.
631   <itemizedlist>
632     <listitem><para>
633      Create new file <code>include/profile/impl/profiler_magic.h</code>.
634     </para></listitem>
635     <listitem><para>
636      Define class <code>__magic_info: public __object_info_base</code>.
637      This is the representation of a line in the object table.
638      The <code>__merge</code> method is used to aggregate information
639      across all dynamic instances created at the same call context.
640      The <code>__magnitude</code> must return the estimation of the benefit
641      as a number of small operations, e.g., number of words copied.
642      The <code>__write</code> method is used to produce the raw trace.
643      The <code>__advice</code> method is used to produce the advice string.
644     </para></listitem>
645     <listitem><para>
646      Define class <code>__magic_stack_info: public __magic_info</code>.
647      This defines the content of a line in the stack table.
648     </para></listitem>
649     <listitem><para>
650      Define class <code>__trace_magic: public __trace_base&lt;__magic_info,
651      __magic_stack_info&gt;</code>.
652      It defines the content of the trace associated with this diagnostic.
653     </para></listitem>
654    </itemizedlist>
655  </para>
656
657  <para>Add initialization and reporting calls in
658   <code>include/profile/impl/profiler_trace.h</code>.  Use
659   <code>__trace_vector_to_list</code> as an example.
660  </para>
661
662  <para>Add documentation in file <code>doc/xml/manual/profile_mode.xml</code>.
663  </para>
664</section>
665</section>
666
667<section xml:id="manual.ext.profile_mode.diagnostics"><info><title>Diagnostics</title></info>
668<?dbhtml filename="profile_mode_diagnostics.html"?>
669
670
671  <para>
672  The table below presents all the diagnostics we intend to implement.
673  Each diagnostic has a corresponding compile time switch
674  <code>-D_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code>.
675  Groups of related diagnostics can be turned on with a single switch.
676  For instance, <code>-D_GLIBCXX_PROFILE_LOCALITY</code> is equivalent to
677  <code>-D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH
678  -D_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
679  </para>
680
681  <para>
682  The benefit, cost, expected frequency and accuracy of each diagnostic
683  was given a grade from 1 to 10, where 10 is highest.
684  A high benefit means that, if the diagnostic is accurate, the expected
685  performance improvement is high.
686  A high cost means that turning this diagnostic on leads to high slowdown.
687  A high frequency means that we expect this to occur relatively often.
688  A high accuracy means that the diagnostic is unlikely to be wrong.
689  These grades are not perfect.  They are just meant to guide users with
690  specific needs or time budgets.
691  </para>
692
693<table frame="all" xml:id="table.profile_diagnostics">
694<title>Profile Diagnostics</title>
695
696<tgroup cols="7" align="left" colsep="1" rowsep="1">
697<colspec colname="c1"/>
698<colspec colname="c2"/>
699<colspec colname="c3"/>
700<colspec colname="c4"/>
701<colspec colname="c5"/>
702<colspec colname="c6"/>
703<colspec colname="c7"/>
704
705<thead>
706  <row>
707    <entry>Group</entry>
708    <entry>Flag</entry>
709    <entry>Benefit</entry>
710    <entry>Cost</entry>
711    <entry>Freq.</entry>
712    <entry>Implemented</entry>
713  </row>
714</thead>
715<tbody>
716  <row>
717    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.containers">
718    CONTAINERS</link></entry>
719    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_too_small">
720    HASHTABLE_TOO_SMALL</link></entry>
721    <entry>10</entry>
722    <entry>1</entry>
723    <entry/>
724    <entry>10</entry>
725    <entry>yes</entry>
726  </row>
727  <row>
728    <entry/>
729    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_too_large">
730    HASHTABLE_TOO_LARGE</link></entry>
731    <entry>5</entry>
732    <entry>1</entry>
733    <entry/>
734    <entry>10</entry>
735    <entry>yes</entry>
736  </row>
737  <row>
738    <entry/>
739    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.inefficient_hash">
740    INEFFICIENT_HASH</link></entry>
741    <entry>7</entry>
742    <entry>3</entry>
743    <entry/>
744    <entry>10</entry>
745    <entry>yes</entry>
746  </row>
747  <row>
748    <entry/>
749    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_too_small">
750    VECTOR_TOO_SMALL</link></entry>
751    <entry>8</entry>
752    <entry>1</entry>
753    <entry/>
754    <entry>10</entry>
755    <entry>yes</entry>
756  </row>
757  <row>
758    <entry/>
759    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_too_large">
760    VECTOR_TOO_LARGE</link></entry>
761    <entry>5</entry>
762    <entry>1</entry>
763    <entry/>
764    <entry>10</entry>
765    <entry>yes</entry>
766  </row>
767  <row>
768    <entry/>
769    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_to_hashtable">
770    VECTOR_TO_HASHTABLE</link></entry>
771    <entry>7</entry>
772    <entry>7</entry>
773    <entry/>
774    <entry>10</entry>
775    <entry>no</entry>
776  </row>
777  <row>
778    <entry/>
779    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_to_vector">
780    HASHTABLE_TO_VECTOR</link></entry>
781    <entry>7</entry>
782    <entry>7</entry>
783    <entry/>
784    <entry>10</entry>
785    <entry>no</entry>
786  </row>
787  <row>
788    <entry/>
789    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_to_list">
790    VECTOR_TO_LIST</link></entry>
791    <entry>8</entry>
792    <entry>5</entry>
793    <entry/>
794    <entry>10</entry>
795    <entry>yes</entry>
796  </row>
797  <row>
798    <entry/>
799    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.list_to_vector">
800    LIST_TO_VECTOR</link></entry>
801    <entry>10</entry>
802    <entry>5</entry>
803    <entry/>
804    <entry>10</entry>
805    <entry>no</entry>
806  </row>
807  <row>
808    <entry/>
809    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.assoc_ord_to_unord">
810    ORDERED_TO_UNORDERED</link></entry>
811    <entry>10</entry>
812    <entry>5</entry>
813    <entry/>
814    <entry>10</entry>
815    <entry>only map/unordered_map</entry>
816  </row>
817  <row>
818    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.algorithms">
819    ALGORITHMS</link></entry>
820    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.algorithms.sort">
821    SORT</link></entry>
822    <entry>7</entry>
823    <entry>8</entry>
824    <entry/>
825    <entry>7</entry>
826    <entry>no</entry>
827  </row>
828  <row>
829    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality">
830    LOCALITY</link></entry>
831    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality.sw_prefetch">
832    SOFTWARE_PREFETCH</link></entry>
833    <entry>8</entry>
834    <entry>8</entry>
835    <entry/>
836    <entry>5</entry>
837    <entry>no</entry>
838  </row>
839  <row>
840    <entry/>
841    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality.linked">
842    RBTREE_LOCALITY</link></entry>
843    <entry>4</entry>
844    <entry>8</entry>
845    <entry/>
846    <entry>5</entry>
847    <entry>no</entry>
848  </row>
849  <row>
850    <entry/>
851    <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.mthread.false_share">
852    FALSE_SHARING</link></entry>
853    <entry>8</entry>
854    <entry>10</entry>
855    <entry/>
856    <entry>10</entry>
857    <entry>no</entry>
858  </row>
859</tbody>
860</tgroup>
861</table>
862
863<section xml:id="manual.ext.profile_mode.analysis.template" xreflabel="Template"><info><title>Diagnostic Template</title></info>
864
865<itemizedlist>
866  <listitem><para><emphasis>Switch:</emphasis>
867  <code>_GLIBCXX_PROFILE_&lt;diagnostic&gt;</code>.
868  </para></listitem>
869  <listitem><para><emphasis>Goal:</emphasis>  What problem will it diagnose?
870  </para></listitem>
871  <listitem><para><emphasis>Fundamentals:</emphasis>.
872  What is the fundamental reason why this is a problem</para></listitem>
873  <listitem><para><emphasis>Sample runtime reduction:</emphasis>
874  Percentage reduction in execution time.  When reduction is more than
875  a constant factor, describe the reduction rate formula.
876  </para></listitem>
877  <listitem><para><emphasis>Recommendation:</emphasis>
878  What would the advise look like?</para></listitem>
879  <listitem><para><emphasis>To instrument:</emphasis>
880  What stdlibc++ components need to be instrumented?</para></listitem>
881  <listitem><para><emphasis>Analysis:</emphasis>
882  How do we decide when to issue the advice?</para></listitem>
883  <listitem><para><emphasis>Cost model:</emphasis>
884  How do we measure benefits?  Math goes here.</para></listitem>
885  <listitem><para><emphasis>Example:</emphasis>
886<programlisting>
887program code
888...
889advice sample
890</programlisting>
891</para></listitem>
892</itemizedlist>
893</section>
894
895
896<section xml:id="manual.ext.profile_mode.analysis.containers" xreflabel="Containers"><info><title>Containers</title></info>
897
898
899<para>
900<emphasis>Switch:</emphasis>
901  <code>_GLIBCXX_PROFILE_CONTAINERS</code>.
902</para>
903
904<section xml:id="manual.ext.profile_mode.analysis.hashtable_too_small" xreflabel="Hashtable Too Small"><info><title>Hashtable Too Small</title></info>
905
906<itemizedlist>
907  <listitem><para><emphasis>Switch:</emphasis>
908  <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL</code>.
909  </para></listitem>
910  <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with many
911  rehash operations, small construction size and large destruction size.
912  </para></listitem>
913  <listitem><para><emphasis>Fundamentals:</emphasis> Rehash is very expensive.
914  Read content, follow chains within bucket, evaluate hash function, place at
915  new location in different order.</para></listitem>
916  <listitem><para><emphasis>Sample runtime reduction:</emphasis> 36%.
917  Code similar to example below.
918  </para></listitem>
919  <listitem><para><emphasis>Recommendation:</emphasis>
920  Set initial size to N at construction site S.
921  </para></listitem>
922  <listitem><para><emphasis>To instrument:</emphasis>
923  <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
924  </para></listitem>
925  <listitem><para><emphasis>Analysis:</emphasis>
926  For each dynamic instance of <code>unordered_[multi]set|map</code>,
927  record initial size and call context of the constructor.
928  Record size increase, if any, after each relevant operation such as insert.
929  Record the estimated rehash cost.</para></listitem>
930  <listitem><para><emphasis>Cost model:</emphasis>
931  Number of individual rehash operations * cost per rehash.</para></listitem>
932  <listitem><para><emphasis>Example:</emphasis>
933<programlisting>
9341 unordered_set&lt;int&gt; us;
9352 for (int k = 0; k &lt; 1000000; ++k) {
9363   us.insert(k);
9374 }
938
939foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
940</programlisting>
941</para></listitem>
942</itemizedlist>
943</section>
944
945
946<section xml:id="manual.ext.profile_mode.analysis.hashtable_too_large" xreflabel="Hashtable Too Large"><info><title>Hashtable Too Large</title></info>
947
948<itemizedlist>
949  <listitem><para><emphasis>Switch:</emphasis>
950  <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE</code>.
951  </para></listitem>
952  <listitem><para><emphasis>Goal:</emphasis> Detect hashtables which are
953  never filled up because fewer elements than reserved are ever
954  inserted.
955  </para></listitem>
956  <listitem><para><emphasis>Fundamentals:</emphasis> Save memory, which
957  is good in itself and may also improve memory reference performance through
958  fewer cache and TLB misses.</para></listitem>
959  <listitem><para><emphasis>Sample runtime reduction:</emphasis> unknown.
960  </para></listitem>
961  <listitem><para><emphasis>Recommendation:</emphasis>
962  Set initial size to N at construction site S.
963  </para></listitem>
964  <listitem><para><emphasis>To instrument:</emphasis>
965  <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
966  </para></listitem>
967  <listitem><para><emphasis>Analysis:</emphasis>
968  For each dynamic instance of <code>unordered_[multi]set|map</code>,
969  record initial size and call context of the constructor, and correlate it
970  with its size at destruction time.
971  </para></listitem>
972  <listitem><para><emphasis>Cost model:</emphasis>
973  Number of iteration operations + memory saved.</para></listitem>
974  <listitem><para><emphasis>Example:</emphasis>
975<programlisting>
9761 vector&lt;unordered_set&lt;int&gt;&gt; v(100000, unordered_set&lt;int&gt;(100)) ;
9772 for (int k = 0; k &lt; 100000; ++k) {
9783   for (int j = 0; j &lt; 10; ++j) {
9794     v[k].insert(k + j);
9805  }
9816 }
982
983foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
984bytes of memory and M iteration steps.
985</programlisting>
986</para></listitem>
987</itemizedlist>
988</section>
989
990<section xml:id="manual.ext.profile_mode.analysis.inefficient_hash" xreflabel="Inefficient Hash"><info><title>Inefficient Hash</title></info>
991
992<itemizedlist>
993  <listitem><para><emphasis>Switch:</emphasis>
994  <code>_GLIBCXX_PROFILE_INEFFICIENT_HASH</code>.
995  </para></listitem>
996  <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with polarized
997  distribution.
998  </para></listitem>
999  <listitem><para><emphasis>Fundamentals:</emphasis> A non-uniform
1000  distribution may lead to long chains, thus possibly increasing complexity
1001  by a factor up to the number of elements.
1002  </para></listitem>
1003  <listitem><para><emphasis>Sample runtime reduction:</emphasis> factor up
1004   to container size.
1005  </para></listitem>
1006  <listitem><para><emphasis>Recommendation:</emphasis> Change hash function
1007  for container built at site S.  Distribution score = N.  Access score = S.
1008  Longest chain = C, in bucket B.
1009  </para></listitem>
1010  <listitem><para><emphasis>To instrument:</emphasis>
1011  <code>unordered_set, unordered_map</code> constructor, destructor, [],
1012  insert, iterator.
1013  </para></listitem>
1014  <listitem><para><emphasis>Analysis:</emphasis>
1015  Count the exact number of link traversals.
1016  </para></listitem>
1017  <listitem><para><emphasis>Cost model:</emphasis>
1018  Total number of links traversed.</para></listitem>
1019  <listitem><para><emphasis>Example:</emphasis>
1020<programlisting>
1021class dumb_hash {
1022 public:
1023  size_t operator() (int i) const { return 0; }
1024};
1025...
1026  unordered_set&lt;int, dumb_hash&gt; hs;
1027  ...
1028  for (int i = 0; i &lt; COUNT; ++i) {
1029    hs.find(i);
1030  }
1031</programlisting>
1032</para></listitem>
1033</itemizedlist>
1034</section>
1035
1036<section xml:id="manual.ext.profile_mode.analysis.vector_too_small" xreflabel="Vector Too Small"><info><title>Vector Too Small</title></info>
1037
1038<itemizedlist>
1039  <listitem><para><emphasis>Switch:</emphasis>
1040  <code>_GLIBCXX_PROFILE_VECTOR_TOO_SMALL</code>.
1041  </para></listitem>
1042  <listitem><para><emphasis>Goal:</emphasis>Detect vectors with many
1043  resize operations, small construction size and large destruction size..
1044  </para></listitem>
1045  <listitem><para><emphasis>Fundamentals:</emphasis>Resizing can be expensive.
1046  Copying large amounts of data takes time.  Resizing many small vectors may
1047  have allocation overhead and affect locality.</para></listitem>
1048  <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1049  </para></listitem>
1050  <listitem><para><emphasis>Recommendation:</emphasis>
1051  Set initial size to N at construction site S.</para></listitem>
1052  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1053  </para></listitem>
1054  <listitem><para><emphasis>Analysis:</emphasis>
1055  For each dynamic instance of <code>vector</code>,
1056  record initial size and call context of the constructor.
1057  Record size increase, if any, after each relevant operation such as
1058  <code>push_back</code>.  Record the estimated resize cost.
1059  </para></listitem>
1060  <listitem><para><emphasis>Cost model:</emphasis>
1061  Total number of words copied * time to copy a word.</para></listitem>
1062  <listitem><para><emphasis>Example:</emphasis>
1063<programlisting>
10641 vector&lt;int&gt; v;
10652 for (int k = 0; k &lt; 1000000; ++k) {
10663   v.push_back(k);
10674 }
1068
1069foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
1070copying 4000000 bytes and 20 memory allocations and deallocations.
1071</programlisting>
1072</para></listitem>
1073</itemizedlist>
1074</section>
1075
1076<section xml:id="manual.ext.profile_mode.analysis.vector_too_large" xreflabel="Vector Too Large"><info><title>Vector Too Large</title></info>
1077
1078<itemizedlist>
1079  <listitem><para><emphasis>Switch:</emphasis>
1080  <code>_GLIBCXX_PROFILE_VECTOR_TOO_LARGE</code>
1081  </para></listitem>
1082  <listitem><para><emphasis>Goal:</emphasis>Detect vectors which are
1083  never filled up because fewer elements than reserved are ever
1084  inserted.
1085  </para></listitem>
1086  <listitem><para><emphasis>Fundamentals:</emphasis>Save memory, which
1087  is good in itself and may also improve memory reference performance through
1088  fewer cache and TLB misses.</para></listitem>
1089  <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1090  </para></listitem>
1091  <listitem><para><emphasis>Recommendation:</emphasis>
1092  Set initial size to N at construction site S.</para></listitem>
1093  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1094  </para></listitem>
1095  <listitem><para><emphasis>Analysis:</emphasis>
1096  For each dynamic instance of <code>vector</code>,
1097  record initial size and call context of the constructor, and correlate it
1098  with its size at destruction time.</para></listitem>
1099  <listitem><para><emphasis>Cost model:</emphasis>
1100  Total amount of memory saved.</para></listitem>
1101  <listitem><para><emphasis>Example:</emphasis>
1102<programlisting>
11031 vector&lt;vector&lt;int&gt;&gt; v(100000, vector&lt;int&gt;(100)) ;
11042 for (int k = 0; k &lt; 100000; ++k) {
11053   for (int j = 0; j &lt; 10; ++j) {
11064     v[k].insert(k + j);
11075  }
11086 }
1109
1110foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
1111bytes of memory and may reduce the number of cache and TLB misses.
1112</programlisting>
1113</para></listitem>
1114</itemizedlist>
1115</section>
1116
1117<section xml:id="manual.ext.profile_mode.analysis.vector_to_hashtable" xreflabel="Vector to Hashtable"><info><title>Vector to Hashtable</title></info>
1118
1119<itemizedlist>
1120  <listitem><para><emphasis>Switch:</emphasis>
1121  <code>_GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE</code>.
1122  </para></listitem>
1123  <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1124  <code>vector</code> that can be substituted with <code>unordered_set</code>
1125  to reduce execution time.
1126  </para></listitem>
1127  <listitem><para><emphasis>Fundamentals:</emphasis>
1128  Linear search in a vector is very expensive, whereas searching in a hashtable
1129  is very quick.</para></listitem>
1130  <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up
1131   to container size.
1132  </para></listitem>
1133  <listitem><para><emphasis>Recommendation:</emphasis>Replace
1134  <code>vector</code> with <code>unordered_set</code> at site S.
1135  </para></listitem>
1136  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1137  operations and access methods.</para></listitem>
1138  <listitem><para><emphasis>Analysis:</emphasis>
1139  For each dynamic instance of <code>vector</code>,
1140  record call context of the constructor.  Issue the advice only if the
1141  only methods called on this <code>vector</code> are <code>push_back</code>,
1142  <code>insert</code> and <code>find</code>.
1143  </para></listitem>
1144  <listitem><para><emphasis>Cost model:</emphasis>
1145  Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -
1146  cost(unordered_set::insert) + cost(unordered_set::find).
1147  </para></listitem>
1148  <listitem><para><emphasis>Example:</emphasis>
1149<programlisting>
11501  vector&lt;int&gt; v;
1151...
11522  for (int i = 0; i &lt; 1000; ++i) {
11533    find(v.begin(), v.end(), i);
11544  }
1155
1156foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
1157comparisons.
1158</programlisting>
1159</para></listitem>
1160</itemizedlist>
1161</section>
1162
1163<section xml:id="manual.ext.profile_mode.analysis.hashtable_to_vector" xreflabel="Hashtable to Vector"><info><title>Hashtable to Vector</title></info>
1164
1165<itemizedlist>
1166  <listitem><para><emphasis>Switch:</emphasis>
1167  <code>_GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR</code>.
1168  </para></listitem>
1169  <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1170  <code>unordered_set</code> that can be substituted with <code>vector</code>
1171  to reduce execution time.
1172  </para></listitem>
1173  <listitem><para><emphasis>Fundamentals:</emphasis>
1174  Hashtable iterator is slower than vector iterator.</para></listitem>
1175  <listitem><para><emphasis>Sample runtime reduction:</emphasis>95%.
1176  </para></listitem>
1177  <listitem><para><emphasis>Recommendation:</emphasis>Replace
1178  <code>unordered_set</code> with <code>vector</code> at site S.
1179  </para></listitem>
1180  <listitem><para><emphasis>To instrument:</emphasis><code>unordered_set</code>
1181  operations and access methods.</para></listitem>
1182  <listitem><para><emphasis>Analysis:</emphasis>
1183  For each dynamic instance of <code>unordered_set</code>,
1184  record call context of the constructor.  Issue the advice only if the
1185  number of <code>find</code>, <code>insert</code> and <code>[]</code>
1186  operations on this <code>unordered_set</code> are small relative to the
1187  number of elements, and methods <code>begin</code> or <code>end</code>
1188  are invoked (suggesting iteration).</para></listitem>
1189  <listitem><para><emphasis>Cost model:</emphasis>
1190  Number of .</para></listitem>
1191  <listitem><para><emphasis>Example:</emphasis>
1192<programlisting>
11931  unordered_set&lt;int&gt; us;
1194...
11952  int s = 0;
11963  for (unordered_set&lt;int&gt;::iterator it = us.begin(); it != us.end(); ++it) {
11974    s += *it;
11985  }
1199
1200foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
1201indirections and may achieve better data locality.
1202</programlisting>
1203</para></listitem>
1204</itemizedlist>
1205</section>
1206
1207<section xml:id="manual.ext.profile_mode.analysis.vector_to_list" xreflabel="Vector to List"><info><title>Vector to List</title></info>
1208
1209<itemizedlist>
1210  <listitem><para><emphasis>Switch:</emphasis>
1211  <code>_GLIBCXX_PROFILE_VECTOR_TO_LIST</code>.
1212  </para></listitem>
1213  <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1214  <code>vector</code> could be substituted with <code>list</code> for
1215  better performance.
1216  </para></listitem>
1217  <listitem><para><emphasis>Fundamentals:</emphasis>
1218  Inserting in the middle of a vector is expensive compared to inserting in a
1219  list.
1220  </para></listitem>
1221  <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up to
1222   container size.
1223  </para></listitem>
1224  <listitem><para><emphasis>Recommendation:</emphasis>Replace vector with list
1225  at site S.</para></listitem>
1226  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1227  operations and access methods.</para></listitem>
1228  <listitem><para><emphasis>Analysis:</emphasis>
1229  For each dynamic instance of <code>vector</code>,
1230  record the call context of the constructor.  Record the overhead of each
1231  <code>insert</code> operation based on current size and insert position.
1232  Report instance with high insertion overhead.
1233  </para></listitem>
1234  <listitem><para><emphasis>Cost model:</emphasis>
1235  (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1236  method in [push_back, insert, erase])
1237  + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1238  <listitem><para><emphasis>Example:</emphasis>
1239<programlisting>
12401  vector&lt;int&gt; v;
12412  for (int i = 0; i &lt; 10000; ++i) {
12423    v.insert(v.begin(), i);
12434  }
1244
1245foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
1246operations.
1247</programlisting>
1248</para></listitem>
1249</itemizedlist>
1250</section>
1251
1252<section xml:id="manual.ext.profile_mode.analysis.list_to_vector" xreflabel="List to Vector"><info><title>List to Vector</title></info>
1253
1254<itemizedlist>
1255  <listitem><para><emphasis>Switch:</emphasis>
1256  <code>_GLIBCXX_PROFILE_LIST_TO_VECTOR</code>.
1257  </para></listitem>
1258  <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1259  <code>list</code> could be substituted with <code>vector</code> for
1260  better performance.
1261  </para></listitem>
1262  <listitem><para><emphasis>Fundamentals:</emphasis>
1263  Iterating through a vector is faster than through a list.
1264  </para></listitem>
1265  <listitem><para><emphasis>Sample runtime reduction:</emphasis>64%.
1266  </para></listitem>
1267  <listitem><para><emphasis>Recommendation:</emphasis>Replace list with vector
1268  at site S.</para></listitem>
1269  <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1270  operations and access methods.</para></listitem>
1271  <listitem><para><emphasis>Analysis:</emphasis>
1272  Issue the advice if there are no <code>insert</code> operations.
1273  </para></listitem>
1274  <listitem><para><emphasis>Cost model:</emphasis>
1275    (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1276  method in [push_back, insert, erase])
1277  + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1278  <listitem><para><emphasis>Example:</emphasis>
1279<programlisting>
12801  list&lt;int&gt; l;
1281...
12822  int sum = 0;
12833  for (list&lt;int&gt;::iterator it = l.begin(); it != l.end(); ++it) {
12844    sum += *it;
12855  }
1286
1287foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
1288memory references.
1289</programlisting>
1290</para></listitem>
1291</itemizedlist>
1292</section>
1293
1294<section xml:id="manual.ext.profile_mode.analysis.list_to_slist" xreflabel="List to Forward List"><info><title>List to Forward List (Slist)</title></info>
1295
1296<itemizedlist>
1297  <listitem><para><emphasis>Switch:</emphasis>
1298  <code>_GLIBCXX_PROFILE_LIST_TO_SLIST</code>.
1299  </para></listitem>
1300  <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1301  <code>list</code> could be substituted with <code>forward_list</code> for
1302  better performance.
1303  </para></listitem>
1304  <listitem><para><emphasis>Fundamentals:</emphasis>
1305  The memory footprint of a forward_list is smaller than that of a list.
1306  This has beneficial effects on memory subsystem, e.g., fewer cache misses.
1307  </para></listitem>
1308  <listitem><para><emphasis>Sample runtime reduction:</emphasis>40%.
1309  Note that the reduction is only noticeable if the size of the forward_list
1310  node is in fact larger than that of the list node.  For memory allocators
1311  with size classes, you will only notice an effect when the two node sizes
1312  belong to different allocator size classes.
1313  </para></listitem>
1314  <listitem><para><emphasis>Recommendation:</emphasis>Replace list with
1315  forward_list at site S.</para></listitem>
1316  <listitem><para><emphasis>To instrument:</emphasis><code>list</code>
1317  operations and iteration methods.</para></listitem>
1318  <listitem><para><emphasis>Analysis:</emphasis>
1319  Issue the advice if there are no <code>backwards</code> traversals
1320  or insertion before a given node.
1321  </para></listitem>
1322  <listitem><para><emphasis>Cost model:</emphasis>
1323  Always true.</para></listitem>
1324  <listitem><para><emphasis>Example:</emphasis>
1325<programlisting>
13261  list&lt;int&gt; l;
1327...
13282  int sum = 0;
13293  for (list&lt;int&gt;::iterator it = l.begin(); it != l.end(); ++it) {
13304    sum += *it;
13315  }
1332
1333foo.cc:1: advice: Change "list" to "forward_list".
1334</programlisting>
1335</para></listitem>
1336</itemizedlist>
1337</section>
1338
1339<section xml:id="manual.ext.profile_mode.analysis.assoc_ord_to_unord" xreflabel="Ordered to Unordered Associative Container"><info><title>Ordered to Unordered Associative Container</title></info>
1340
1341<itemizedlist>
1342  <listitem><para><emphasis>Switch:</emphasis>
1343  <code>_GLIBCXX_PROFILE_ORDERED_TO_UNORDERED</code>.
1344  </para></listitem>
1345  <listitem><para><emphasis>Goal:</emphasis>  Detect cases where ordered
1346  associative containers can be replaced with unordered ones.
1347  </para></listitem>
1348  <listitem><para><emphasis>Fundamentals:</emphasis>
1349  Insert and search are quicker in a hashtable than in
1350  a red-black tree.</para></listitem>
1351  <listitem><para><emphasis>Sample runtime reduction:</emphasis>52%.
1352  </para></listitem>
1353  <listitem><para><emphasis>Recommendation:</emphasis>
1354  Replace set with unordered_set at site S.</para></listitem>
1355  <listitem><para><emphasis>To instrument:</emphasis>
1356  <code>set</code>, <code>multiset</code>, <code>map</code>,
1357  <code>multimap</code> methods.</para></listitem>
1358  <listitem><para><emphasis>Analysis:</emphasis>
1359  Issue the advice only if we are not using operator <code>++</code> on any
1360  iterator on a particular <code>[multi]set|map</code>.
1361  </para></listitem>
1362  <listitem><para><emphasis>Cost model:</emphasis>
1363  (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for
1364  method in [insert, erase, find])
1365  + (Cost(iterate hashtable) - Cost(iterate rbtree))</para></listitem>
1366  <listitem><para><emphasis>Example:</emphasis>
1367<programlisting>
13681  set&lt;int&gt; s;
13692  for (int i = 0; i &lt; 100000; ++i) {
13703    s.insert(i);
13714  }
13725  int sum = 0;
13736  for (int i = 0; i &lt; 100000; ++i) {
13747    sum += *s.find(i);
13758  }
1376</programlisting>
1377</para></listitem>
1378</itemizedlist>
1379</section>
1380
1381</section>
1382
1383
1384
1385<section xml:id="manual.ext.profile_mode.analysis.algorithms" xreflabel="Algorithms"><info><title>Algorithms</title></info>
1386
1387
1388  <para><emphasis>Switch:</emphasis>
1389  <code>_GLIBCXX_PROFILE_ALGORITHMS</code>.
1390  </para>
1391
1392<section xml:id="manual.ext.profile_mode.analysis.algorithms.sort" xreflabel="Sorting"><info><title>Sort Algorithm Performance</title></info>
1393
1394<itemizedlist>
1395  <listitem><para><emphasis>Switch:</emphasis>
1396  <code>_GLIBCXX_PROFILE_SORT</code>.
1397  </para></listitem>
1398  <listitem><para><emphasis>Goal:</emphasis> Give measure of sort algorithm
1399  performance based on actual input.  For instance, advise Radix Sort over
1400  Quick Sort for a particular call context.
1401  </para></listitem>
1402  <listitem><para><emphasis>Fundamentals:</emphasis>
1403  See papers:
1404  <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://dl.acm.org/citation.cfm?doid=1065944.1065981">
1405  A framework for adaptive algorithm selection in STAPL</link> and
1406  <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ieeexplore.ieee.org/document/4228227/">
1407  Optimizing Sorting with Machine Learning Algorithms</link>.
1408  </para></listitem>
1409  <listitem><para><emphasis>Sample runtime reduction:</emphasis>60%.
1410  </para></listitem>
1411  <listitem><para><emphasis>Recommendation:</emphasis> Change sort algorithm
1412  at site S from X Sort to Y Sort.</para></listitem>
1413  <listitem><para><emphasis>To instrument:</emphasis> <code>sort</code>
1414  algorithm.</para></listitem>
1415  <listitem><para><emphasis>Analysis:</emphasis>
1416  Issue the advice if the cost model tells us that another sort algorithm
1417  would do better on this input.  Requires us to know what algorithm we
1418  are using in our sort implementation in release mode.</para></listitem>
1419  <listitem><para><emphasis>Cost model:</emphasis>
1420  Runtime(algo) for algo in [radix, quick, merge, ...]</para></listitem>
1421  <listitem><para><emphasis>Example:</emphasis>
1422<programlisting>
1423</programlisting>
1424</para></listitem>
1425</itemizedlist>
1426</section>
1427
1428</section>
1429
1430
1431<section xml:id="manual.ext.profile_mode.analysis.locality" xreflabel="Data Locality"><info><title>Data Locality</title></info>
1432
1433
1434  <para><emphasis>Switch:</emphasis>
1435  <code>_GLIBCXX_PROFILE_LOCALITY</code>.
1436  </para>
1437
1438<section xml:id="manual.ext.profile_mode.analysis.locality.sw_prefetch" xreflabel="Need Software Prefetch"><info><title>Need Software Prefetch</title></info>
1439
1440<itemizedlist>
1441  <listitem><para><emphasis>Switch:</emphasis>
1442  <code>_GLIBCXX_PROFILE_SOFTWARE_PREFETCH</code>.
1443  </para></listitem>
1444  <listitem><para><emphasis>Goal:</emphasis> Discover sequences of indirect
1445  memory accesses that are not regular, thus cannot be predicted by
1446  hardware prefetchers.
1447  </para></listitem>
1448  <listitem><para><emphasis>Fundamentals:</emphasis>
1449  Indirect references are hard to predict and are very expensive when they
1450  miss in caches.</para></listitem>
1451  <listitem><para><emphasis>Sample runtime reduction:</emphasis>25%.
1452  </para></listitem>
1453  <listitem><para><emphasis>Recommendation:</emphasis> Insert prefetch
1454  instruction.</para></listitem>
1455  <listitem><para><emphasis>To instrument:</emphasis> Vector iterator and
1456  access operator [].
1457  </para></listitem>
1458  <listitem><para><emphasis>Analysis:</emphasis>
1459  First, get cache line size and page size from system.
1460  Then record iterator dereference sequences for which the value is a pointer.
1461  For each sequence within a container, issue a warning if successive pointer
1462  addresses are not within cache lines and do not form a linear pattern
1463  (otherwise they may be prefetched by hardware).
1464  If they also step across page boundaries, make the warning stronger.
1465  </para>
1466  <para>The same analysis applies to containers other than vector.
1467  However, we cannot give the same advice for linked structures, such as list,
1468  as there is no random access to the n-th element.  The user may still be
1469  able to benefit from this information, for instance by employing frays (user
1470  level light weight threads) to hide the latency of chasing pointers.
1471  </para>
1472  <para>
1473  This analysis is a little oversimplified.  A better cost model could be
1474  created by understanding the capability of the hardware prefetcher.
1475  This model could be trained automatically by running a set of synthetic
1476  cases.
1477  </para>
1478  </listitem>
1479  <listitem><para><emphasis>Cost model:</emphasis>
1480  Total distance between pointer values of successive elements in vectors
1481  of pointers.</para></listitem>
1482  <listitem><para><emphasis>Example:</emphasis>
1483<programlisting>
14841 int zero = 0;
14852 vector&lt;int*&gt; v(10000000, &amp;zero);
14863 for (int k = 0; k &lt; 10000000; ++k) {
14874   v[random() % 10000000] = new int(k);
14885 }
14896 for (int j = 0; j &lt; 10000000; ++j) {
14907   count += (*v[j] == 0 ? 0 : 1);
14918 }
1492
1493foo.cc:7: advice: Insert prefetch instruction.
1494</programlisting>
1495</para></listitem>
1496</itemizedlist>
1497</section>
1498
1499<section xml:id="manual.ext.profile_mode.analysis.locality.linked" xreflabel="Linked Structure Locality"><info><title>Linked Structure Locality</title></info>
1500
1501<itemizedlist>
1502  <listitem><para><emphasis>Switch:</emphasis>
1503  <code>_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
1504  </para></listitem>
1505  <listitem><para><emphasis>Goal:</emphasis> Give measure of locality of
1506  objects stored in linked structures (lists, red-black trees and hashtables)
1507  with respect to their actual traversal patterns.
1508  </para></listitem>
1509  <listitem><para><emphasis>Fundamentals:</emphasis>Allocation can be tuned
1510  to a specific traversal pattern, to result in better data locality.
1511  See paper:
1512  <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://parasol.tamu.edu/publications/download.php?file_id=570">
1513  Custom Memory Allocation for Free</link> by Jula and Rauchwerger.
1514  </para></listitem>
1515  <listitem><para><emphasis>Sample runtime reduction:</emphasis>30%.
1516  </para></listitem>
1517  <listitem><para><emphasis>Recommendation:</emphasis>
1518  High scatter score N for container built at site S.
1519  Consider changing allocation sequence or choosing a structure conscious
1520  allocator.</para></listitem>
1521  <listitem><para><emphasis>To instrument:</emphasis> Methods of all
1522  containers using linked structures.</para></listitem>
1523  <listitem><para><emphasis>Analysis:</emphasis>
1524  First, get cache line size and page size from system.
1525  Then record the number of successive elements that are on different line
1526  or page, for each traversal method such as <code>find</code>.  Give advice
1527  only if the ratio between this number and the number of total node hops
1528  is above a threshold.</para></listitem>
1529  <listitem><para><emphasis>Cost model:</emphasis>
1530  Sum(same_cache_line(this,previous))</para></listitem>
1531  <listitem><para><emphasis>Example:</emphasis>
1532<programlisting>
1533 1  set&lt;int&gt; s;
1534 2  for (int i = 0; i &lt; 10000000; ++i) {
1535 3    s.insert(i);
1536 4  }
1537 5  set&lt;int&gt; s1, s2;
1538 6  for (int i = 0; i &lt; 10000000; ++i) {
1539 7    s1.insert(i);
1540 8    s2.insert(i);
1541 9  }
1542...
1543      // Fast, better locality.
154410    for (set&lt;int&gt;::iterator it = s.begin(); it != s.end(); ++it) {
154511      sum += *it;
154612    }
1547      // Slow, elements are further apart.
154813    for (set&lt;int&gt;::iterator it = s1.begin(); it != s1.end(); ++it) {
154914      sum += *it;
155015    }
1551
1552foo.cc:5: advice: High scatter score NNN for set built here.  Consider changing
1553the allocation sequence or switching to a structure conscious allocator.
1554</programlisting>
1555</para></listitem>
1556</itemizedlist>
1557</section>
1558
1559</section>
1560
1561
1562<section xml:id="manual.ext.profile_mode.analysis.mthread" xreflabel="Multithreaded Data Access"><info><title>Multithreaded Data Access</title></info>
1563
1564
1565  <para>
1566  The diagnostics in this group are not meant to be implemented short term.
1567  They require compiler support to know when container elements are written
1568  to.  Instrumentation can only tell us when elements are referenced.
1569  </para>
1570
1571  <para><emphasis>Switch:</emphasis>
1572  <code>_GLIBCXX_PROFILE_MULTITHREADED</code>.
1573  </para>
1574
1575<section xml:id="manual.ext.profile_mode.analysis.mthread.ddtest" xreflabel="Dependence Violations at Container Level"><info><title>Data Dependence Violations at Container Level</title></info>
1576
1577<itemizedlist>
1578  <listitem><para><emphasis>Switch:</emphasis>
1579  <code>_GLIBCXX_PROFILE_DDTEST</code>.
1580  </para></listitem>
1581  <listitem><para><emphasis>Goal:</emphasis> Detect container elements
1582  that are referenced from multiple threads in the parallel region or
1583  across parallel regions.
1584  </para></listitem>
1585  <listitem><para><emphasis>Fundamentals:</emphasis>
1586  Sharing data between threads requires communication and perhaps locking,
1587  which may be expensive.
1588  </para></listitem>
1589  <listitem><para><emphasis>Sample runtime reduction:</emphasis>?%.
1590  </para></listitem>
1591  <listitem><para><emphasis>Recommendation:</emphasis> Change data
1592  distribution or parallel algorithm.</para></listitem>
1593  <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1594  and iterators.
1595  </para></listitem>
1596  <listitem><para><emphasis>Analysis:</emphasis>
1597  Keep a shadow for each container.  Record iterator dereferences and
1598  container member accesses.  Issue advice for elements referenced by
1599  multiple threads.
1600  See paper: <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://dl.acm.org/citation.cfm?id=207110.207148">
1601  The LRPD test: speculative run-time parallelization of loops with
1602  privatization and reduction parallelization</link>.
1603  </para></listitem>
1604  <listitem><para><emphasis>Cost model:</emphasis>
1605  Number of accesses to elements referenced from multiple threads
1606  </para></listitem>
1607  <listitem><para><emphasis>Example:</emphasis>
1608<programlisting>
1609</programlisting>
1610</para></listitem>
1611</itemizedlist>
1612</section>
1613
1614<section xml:id="manual.ext.profile_mode.analysis.mthread.false_share" xreflabel="False Sharing"><info><title>False Sharing</title></info>
1615
1616<itemizedlist>
1617  <listitem><para><emphasis>Switch:</emphasis>
1618  <code>_GLIBCXX_PROFILE_FALSE_SHARING</code>.
1619  </para></listitem>
1620  <listitem><para><emphasis>Goal:</emphasis> Detect elements in the
1621  same container which share a cache line, are written by at least one
1622  thread, and accessed by different threads.
1623  </para></listitem>
1624  <listitem><para><emphasis>Fundamentals:</emphasis> Under these assumptions,
1625  cache protocols require
1626  communication to invalidate lines, which may be expensive.
1627  </para></listitem>
1628  <listitem><para><emphasis>Sample runtime reduction:</emphasis>68%.
1629  </para></listitem>
1630  <listitem><para><emphasis>Recommendation:</emphasis> Reorganize container
1631  or use padding to avoid false sharing.</para></listitem>
1632  <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1633  and iterators.
1634  </para></listitem>
1635  <listitem><para><emphasis>Analysis:</emphasis>
1636  First, get the cache line size.
1637  For each shared container, record all the associated iterator dereferences
1638  and member access methods with the thread id.  Compare the address lists
1639  across threads to detect references in two different threads to the same
1640  cache line.  Issue a warning only if the ratio to total references is
1641  significant.  Do the same for iterator dereference values if they are
1642  pointers.</para></listitem>
1643  <listitem><para><emphasis>Cost model:</emphasis>
1644  Number of accesses to same cache line from different threads.
1645  </para></listitem>
1646  <listitem><para><emphasis>Example:</emphasis>
1647<programlisting>
16481     vector&lt;int&gt; v(2, 0);
16492 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
16503     for (i = 0; i &lt; SIZE; ++i) {
16514       v[i % 2] += i;
16525     }
1653
1654OMP_NUM_THREADS=2 ./a.out
1655foo.cc:1: advice: Change container structure or padding to avoid false
1656sharing in multithreaded access at foo.cc:4.  Detected N shared cache lines.
1657</programlisting>
1658</para></listitem>
1659</itemizedlist>
1660</section>
1661
1662</section>
1663
1664
1665<section xml:id="manual.ext.profile_mode.analysis.statistics" xreflabel="Statistics"><info><title>Statistics</title></info>
1666
1667
1668<para>
1669<emphasis>Switch:</emphasis>
1670  <code>_GLIBCXX_PROFILE_STATISTICS</code>.
1671</para>
1672
1673<para>
1674  In some cases the cost model may not tell us anything because the costs
1675  appear to offset the benefits.  Consider the choice between a vector and
1676  a list.  When there are both inserts and iteration, an automatic advice
1677  may not be issued.  However, the programmer may still be able to make use
1678  of this information in a different way.
1679</para>
1680<para>
1681  This diagnostic will not issue any advice, but it will print statistics for
1682  each container construction site.  The statistics will contain the cost
1683  of each operation actually performed on the container.
1684</para>
1685
1686</section>
1687
1688
1689</section>
1690
1691
1692<bibliography xml:id="profile_mode.biblio"><info><title>Bibliography</title></info>
1693
1694
1695  <biblioentry>
1696    <citetitle>
1697      Perflint: A Context Sensitive Performance Advisor for C++ Programs
1698    </citetitle>
1699
1700    <author><personname><firstname>Lixia</firstname><surname>Liu</surname></personname></author>
1701    <author><personname><firstname>Silvius</firstname><surname>Rus</surname></personname></author>
1702
1703    <copyright>
1704      <year>2009</year>
1705      <holder/>
1706    </copyright>
1707
1708    <publisher>
1709      <publishername>
1710	Proceedings of the 2009 International Symposium on Code Generation
1711	and Optimization
1712      </publishername>
1713    </publisher>
1714  </biblioentry>
1715</bibliography>
1716
1717
1718</chapter>
1719