1<chapter xmlns="http://docbook.org/ns/docbook" version="5.0" 2 xml:id="manual.ext.profile_mode" xreflabel="Profile Mode"> 3<?dbhtml filename="profile_mode.html"?> 4 5<info><title>Profile Mode</title> 6 <keywordset> 7 <keyword>C++</keyword> 8 <keyword>library</keyword> 9 <keyword>profile</keyword> 10 </keywordset> 11</info> 12 13 14 15 16<section xml:id="manual.ext.profile_mode.intro" xreflabel="Intro"><info><title>Intro</title></info> 17 18 <para> 19 <emphasis>Goal: </emphasis>Give performance improvement advice based on 20 recognition of suboptimal usage patterns of the standard library. 21 </para> 22 23 <para> 24 <emphasis>Method: </emphasis>Wrap the standard library code. Insert 25 calls to an instrumentation library to record the internal state of 26 various components at interesting entry/exit points to/from the standard 27 library. Process trace, recognize suboptimal patterns, give advice. 28 For details, see the 29 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ieeexplore.ieee.org/document/4907670/">Perflint 30 paper presented at CGO 2009</link>. 31 </para> 32 <para> 33 <emphasis>Strengths: </emphasis> 34<itemizedlist> 35 <listitem><para> 36 Unintrusive solution. The application code does not require any 37 modification. 38 </para></listitem> 39 <listitem><para> The advice is call context sensitive, thus capable of 40 identifying precisely interesting dynamic performance behavior. 41 </para></listitem> 42 <listitem><para> 43 The overhead model is pay-per-view. When you turn off a diagnostic class 44 at compile time, its overhead disappears. 45 </para></listitem> 46</itemizedlist> 47 </para> 48 <para> 49 <emphasis>Drawbacks: </emphasis> 50<itemizedlist> 51 <listitem><para> 52 You must recompile the application code with custom options. 53 </para></listitem> 54 <listitem><para>You must run the application on representative input. 55 The advice is input dependent. 56 </para></listitem> 57 <listitem><para> 58 The execution time will increase, in some cases by factors. 59 </para></listitem> 60</itemizedlist> 61 </para> 62 63 64<section xml:id="manual.ext.profile_mode.using" xreflabel="Using"><info><title>Using the Profile Mode</title></info> 65 66 67 <para> 68 This is the anticipated common workflow for program <code>foo.cc</code>: 69<programlisting> 70$ cat foo.cc 71#include <vector> 72int main() { 73 vector<int> v; 74 for (int k = 0; k < 1024; ++k) v.insert(v.begin(), k); 75} 76 77$ g++ -D_GLIBCXX_PROFILE foo.cc 78$ ./a.out 79$ cat libstdcxx-profile.txt 80vector-to-list: improvement = 5: call stack = 0x804842c ... 81 : advice = change std::vector to std::list 82vector-size: improvement = 3: call stack = 0x804842c ... 83 : advice = change initial container size from 0 to 1024 84</programlisting> 85 </para> 86 87 <para> 88 Anatomy of a warning: 89 <itemizedlist> 90 <listitem> 91 <para> 92 Warning id. This is a short descriptive string for the class 93 that this warning belongs to. E.g., "vector-to-list". 94 </para> 95 </listitem> 96 <listitem> 97 <para> 98 Estimated improvement. This is an approximation of the benefit expected 99 from implementing the change suggested by the warning. It is given on 100 a log10 scale. Negative values mean that the alternative would actually 101 do worse than the current choice. 102 In the example above, 5 comes from the fact that the overhead of 103 inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2, 104 which is around 10e5. The improvement from setting the initial size to 105 1024 is in the range of 10e3, since the overhead of dynamic resizing is 106 linear in this case. 107 </para> 108 </listitem> 109 <listitem> 110 <para> 111 Call stack. Currently, the addresses are printed without 112 symbol name or code location attribution. 113 Users are expected to postprocess the output using, for instance, addr2line. 114 </para> 115 </listitem> 116 <listitem> 117 <para> 118 The warning message. For some warnings, this is static text, e.g., 119 "change vector to list". For other warnings, such as the one above, 120 the message contains numeric advice, e.g., the suggested initial size 121 of the vector. 122 </para> 123 </listitem> 124 </itemizedlist> 125 </para> 126 127 <para>Three files are generated. <code>libstdcxx-profile.txt</code> 128 contains human readable advice. <code>libstdcxx-profile.raw</code> 129 contains implementation specific data about each diagnostic. 130 Their format is not documented. They are sufficient to generate 131 all the advice given in <code>libstdcxx-profile.txt</code>. The advantage 132 of keeping this raw format is that traces from multiple executions can 133 be aggregated simply by concatenating the raw traces. We intend to 134 offer an external utility program that can issue advice from a trace. 135 <code>libstdcxx-profile.conf.out</code> lists the actual diagnostic 136 parameters used. To alter parameters, edit this file and rename it to 137 <code>libstdcxx-profile.conf</code>. 138 </para> 139 140 <para>Advice is given regardless whether the transformation is valid. 141 For instance, we advise changing a map to an unordered_map even if the 142 application semantics require that data be ordered. 143 We believe such warnings can help users understand the performance 144 behavior of their application better, which can lead to changes 145 at a higher abstraction level. 146 </para> 147 148</section> 149 150<section xml:id="manual.ext.profile_mode.tuning" xreflabel="Tuning"><info><title>Tuning the Profile Mode</title></info> 151 152 153 <para>Compile time switches and environment variables (see also file 154 profiler.h). Unless specified otherwise, they can be set at compile time 155 using -D_<name> or by setting variable <name> 156 in the environment where the program is run, before starting execution. 157 <itemizedlist> 158 <listitem><para> 159 <code>_GLIBCXX_PROFILE_NO_<diagnostic></code>: 160 disable specific diagnostics. 161 See section Diagnostics for possible values. 162 (Environment variables not supported.) 163 </para></listitem> 164 <listitem><para> 165 <code>_GLIBCXX_PROFILE_TRACE_PATH_ROOT</code>: set an alternative root 166 path for the output files. 167 </para></listitem> 168 <listitem><para>_GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum 169 number of warnings desired. The default value is 10.</para></listitem> 170 <listitem><para> 171 <code>_GLIBCXX_PROFILE_MAX_STACK_DEPTH</code>: if set to 0, 172 the advice will 173 be collected and reported for the program as a whole, and not for each 174 call context. 175 This could also be used in continuous regression tests, where you 176 just need to know whether there is a regression or not. 177 The default value is 32. 178 </para></listitem> 179 <listitem><para> 180 <code>_GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC</code>: 181 set a limit on how much memory to use for the accounting tables for each 182 diagnostic type. When this limit is reached, new events are ignored 183 until the memory usage decreases under the limit. Generally, this means 184 that newly created containers will not be instrumented until some 185 live containers are deleted. The default is 128 MB. 186 </para></listitem> 187 <listitem><para> 188 <code>_GLIBCXX_PROFILE_NO_THREADS</code>: 189 Make the library not use threads. If thread local storage (TLS) is not 190 available, you will get a preprocessor error asking you to set 191 -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded. 192 Multithreaded execution without TLS is not supported. 193 (Environment variable not supported.) 194 </para></listitem> 195 <listitem><para> 196 <code>_GLIBCXX_HAVE_EXECINFO_H</code>: 197 This name should be defined automatically at library configuration time. 198 If your library was configured without <code>execinfo.h</code>, but 199 you have it in your include path, you can define it explicitly. Without 200 it, advice is collected for the program as a whole, and not for each 201 call context. 202 (Environment variable not supported.) 203 </para></listitem> 204 </itemizedlist> 205 </para> 206 207</section> 208 209</section> 210 211 212<section xml:id="manual.ext.profile_mode.design" xreflabel="Design"><info><title>Design</title></info> 213<?dbhtml filename="profile_mode_design.html"?> 214 215 216<para> 217</para> 218<table frame="all" xml:id="table.profile_code_loc"> 219<title>Profile Code Location</title> 220 221<tgroup cols="2" align="left" colsep="1" rowsep="1"> 222<colspec colname="c1"/> 223<colspec colname="c2"/> 224 225<thead> 226 <row> 227 <entry>Code Location</entry> 228 <entry>Use</entry> 229 </row> 230</thead> 231<tbody> 232 <row> 233 <entry><code>libstdc++-v3/include/std/*</code></entry> 234 <entry>Preprocessor code to redirect to profile extension headers.</entry> 235 </row> 236 <row> 237 <entry><code>libstdc++-v3/include/profile/*</code></entry> 238 <entry>Profile extension public headers (map, vector, ...).</entry> 239 </row> 240 <row> 241 <entry><code>libstdc++-v3/include/profile/impl/*</code></entry> 242 <entry>Profile extension internals. Implementation files are 243 only included from <code>impl/profiler.h</code>, which is the only 244 file included from the public headers.</entry> 245 </row> 246</tbody> 247</tgroup> 248</table> 249 250<para> 251</para> 252 253<section xml:id="manual.ext.profile_mode.design.wrapper" xreflabel="Wrapper"><info><title>Wrapper Model</title></info> 254 255 <para> 256 In order to get our instrumented library version included instead of the 257 release one, 258 we use the same wrapper model as the debug mode. 259 We subclass entities from the release version. Wherever 260 <code>_GLIBCXX_PROFILE</code> is defined, the release namespace is 261 <code>std::__norm</code>, whereas the profile namespace is 262 <code>std::__profile</code>. Using plain <code>std</code> translates 263 into <code>std::__profile</code>. 264 </para> 265 <para> 266 Whenever possible, we try to wrap at the public interface level, e.g., 267 in <code>unordered_set</code> rather than in <code>hashtable</code>, 268 in order not to depend on implementation. 269 </para> 270 <para> 271 Mixing object files built with and without the profile mode must 272 not affect the program execution. However, there are no guarantees to 273 the accuracy of diagnostics when using even a single object not built with 274 <code>-D_GLIBCXX_PROFILE</code>. 275 Currently, mixing the profile mode with debug and parallel extensions is 276 not allowed. Mixing them at compile time will result in preprocessor errors. 277 Mixing them at link time is undefined. 278 </para> 279</section> 280 281 282<section xml:id="manual.ext.profile_mode.design.instrumentation" xreflabel="Instrumentation"><info><title>Instrumentation</title></info> 283 284 <para> 285 Instead of instrumenting every public entry and exit point, 286 we chose to add instrumentation on demand, as needed 287 by individual diagnostics. 288 The main reason is that some diagnostics require us to extract bits of 289 internal state that are particular only to that diagnostic. 290 We plan to formalize this later, after we learn more about the requirements 291 of several diagnostics. 292 </para> 293 <para> 294 All the instrumentation points can be switched on and off using 295 <code>-D[_NO]_GLIBCXX_PROFILE_<diagnostic></code> options. 296 With all the instrumentation calls off, there should be negligible 297 overhead over the release version. This property is needed to support 298 diagnostics based on timing of internal operations. For such diagnostics, 299 we anticipate turning most of the instrumentation off in order to prevent 300 profiling overhead from polluting time measurements, and thus diagnostics. 301 </para> 302 <para> 303 All the instrumentation on/off compile time switches live in 304 <code>include/profile/profiler.h</code>. 305 </para> 306</section> 307 308 309<section xml:id="manual.ext.profile_mode.design.rtlib" xreflabel="Run Time Behavior"><info><title>Run Time Behavior</title></info> 310 311 <para> 312 For practical reasons, the instrumentation library processes the trace 313 partially 314 rather than dumping it to disk in raw form. Each event is processed when 315 it occurs. It is usually attached a cost and it is aggregated into 316 the database of a specific diagnostic class. The cost model 317 is based largely on the standard performance guarantees, but in some 318 cases we use knowledge about GCC's standard library implementation. 319 </para> 320 <para> 321 Information is indexed by (1) call stack and (2) instance id or address 322 to be able to understand and summarize precise creation-use-destruction 323 dynamic chains. Although the analysis is sensitive to dynamic instances, 324 the reports are only sensitive to call context. Whenever a dynamic instance 325 is destroyed, we accumulate its effect to the corresponding entry for the 326 call stack of its constructor location. 327 </para> 328 329 <para> 330 For details, see 331 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ieeexplore.ieee.org/document/4907670/">paper presented at 332 CGO 2009</link>. 333 </para> 334</section> 335 336 337<section xml:id="manual.ext.profile_mode.design.analysis" xreflabel="Analysis and Diagnostics"><info><title>Analysis and Diagnostics</title></info> 338 339 <para> 340 Final analysis takes place offline, and it is based entirely on the 341 generated trace and debugging info in the application binary. 342 See section Diagnostics for a list of analysis types that we plan to support. 343 </para> 344 <para> 345 The input to the analysis is a table indexed by profile type and call stack. 346 The data type for each entry depends on the profile type. 347 </para> 348</section> 349 350 351<section xml:id="manual.ext.profile_mode.design.cost-model" xreflabel="Cost Model"><info><title>Cost Model</title></info> 352 353 <para> 354 While it is likely that cost models become complex as we get into 355 more sophisticated analysis, we will try to follow a simple set of rules 356 at the beginning. 357 </para> 358<itemizedlist> 359 <listitem><para><emphasis>Relative benefit estimation:</emphasis> 360 The idea is to estimate or measure the cost of all operations 361 in the original scenario versus the scenario we advise to switch to. 362 For instance, when advising to change a vector to a list, an occurrence 363 of the <code>insert</code> method will generally count as a benefit. 364 Its magnitude depends on (1) the number of elements that get shifted 365 and (2) whether it triggers a reallocation. 366 </para></listitem> 367 <listitem><para><emphasis>Synthetic measurements:</emphasis> 368 We will measure the relative difference between similar operations on 369 different containers. We plan to write a battery of small tests that 370 compare the times of the executions of similar methods on different 371 containers. The idea is to run these tests on the target machine. 372 If this training phase is very quick, we may decide to perform it at 373 library initialization time. The results can be cached on disk and reused 374 across runs. 375 </para></listitem> 376 <listitem><para><emphasis>Timers:</emphasis> 377 We plan to use timers for operations of larger granularity, such as sort. 378 For instance, we can switch between different sort methods on the fly 379 and report the one that performs best for each call context. 380 </para></listitem> 381 <listitem><para><emphasis>Show stoppers:</emphasis> 382 We may decide that the presence of an operation nullifies the advice. 383 For instance, when considering switching from <code>set</code> to 384 <code>unordered_set</code>, if we detect use of operator <code>++</code>, 385 we will simply not issue the advice, since this could signal that the use 386 care require a sorted container.</para></listitem> 387</itemizedlist> 388 389</section> 390 391 392<section xml:id="manual.ext.profile_mode.design.reports" xreflabel="Reports"><info><title>Reports</title></info> 393 394 <para> 395There are two types of reports. First, if we recognize a pattern for which 396we have a substitute that is likely to give better performance, we print 397the advice and estimated performance gain. The advice is usually associated 398to a code position and possibly a call stack. 399 </para> 400 <para> 401Second, we report performance characteristics for which we do not have 402a clear solution for improvement. For instance, we can point to the user 403the top 10 <code>multimap</code> locations 404which have the worst data locality in actual traversals. 405Although this does not offer a solution, 406it helps the user focus on the key problems and ignore the uninteresting ones. 407 </para> 408</section> 409 410 411<section xml:id="manual.ext.profile_mode.design.testing" xreflabel="Testing"><info><title>Testing</title></info> 412 413 <para> 414 First, we want to make sure we preserve the behavior of the release mode. 415 You can just type <code>"make check-profile"</code>, which 416 builds and runs the whole test suite in profile mode. 417 </para> 418 <para> 419 Second, we want to test the correctness of each diagnostic. 420 We created a <code>profile</code> directory in the test suite. 421 Each diagnostic must come with at least two tests, one for false positives 422 and one for false negatives. 423 </para> 424</section> 425 426</section> 427 428<section xml:id="manual.ext.profile_mode.api" xreflabel="API"><info><title>Extensions for Custom Containers</title></info> 429<?dbhtml filename="profile_mode_api.html"?> 430 431 432 <para> 433 Many large projects use their own data structures instead of the ones in the 434 standard library. If these data structures are similar in functionality 435 to the standard library, they can be instrumented with the same hooks 436 that are used to instrument the standard library. 437 The instrumentation API is exposed in file 438 <code>profiler.h</code> (look for "Instrumentation hooks"). 439 </para> 440 441</section> 442 443 444<section xml:id="manual.ext.profile_mode.cost_model" xreflabel="Cost Model"><info><title>Empirical Cost Model</title></info> 445<?dbhtml filename="profile_mode_cost_model.html"?> 446 447 448 <para> 449 Currently, the cost model uses formulas with predefined relative weights 450 for alternative containers or container implementations. For instance, 451 iterating through a vector is X times faster than iterating through a list. 452 </para> 453 <para> 454 (Under development.) 455 We are working on customizing this to a particular machine by providing 456 an automated way to compute the actual relative weights for operations 457 on the given machine. 458 </para> 459 <para> 460 (Under development.) 461 We plan to provide a performance parameter database format that can be 462 filled in either by hand or by an automated training mechanism. 463 The analysis module will then use this database instead of the built in. 464 generic parameters. 465 </para> 466 467</section> 468 469 470<section xml:id="manual.ext.profile_mode.implementation" xreflabel="Implementation"><info><title>Implementation Issues</title></info> 471<?dbhtml filename="profile_mode_impl.html"?> 472 473 474 475<section xml:id="manual.ext.profile_mode.implementation.stack" xreflabel="Stack Traces"><info><title>Stack Traces</title></info> 476 477 <para> 478 Accurate stack traces are needed during profiling since we group events by 479 call context and dynamic instance. Without accurate traces, diagnostics 480 may be hard to interpret. For instance, when giving advice to the user 481 it is imperative to reference application code, not library code. 482 </para> 483 <para> 484 Currently we are using the libc <code>backtrace</code> routine to get 485 stack traces. 486 <code>_GLIBCXX_PROFILE_STACK_DEPTH</code> can be set 487 to 0 if you are willing to give up call context information, or to a small 488 positive value to reduce run time overhead. 489 </para> 490</section> 491 492 493<section xml:id="manual.ext.profile_mode.implementation.symbols" xreflabel="Symbolization"><info><title>Symbolization of Instruction Addresses</title></info> 494 495 <para> 496 The profiling and analysis phases use only instruction addresses. 497 An external utility such as addr2line is needed to postprocess the result. 498 We do not plan to add symbolization support in the profile extension. 499 This would require access to symbol tables, debug information tables, 500 external programs or libraries and other system dependent information. 501 </para> 502</section> 503 504 505<section xml:id="manual.ext.profile_mode.implementation.concurrency" xreflabel="Concurrency"><info><title>Concurrency</title></info> 506 507 <para> 508 Our current model is simplistic, but precise. 509 We cannot afford to approximate because some of our diagnostics require 510 precise matching of operations to container instance and call context. 511 During profiling, we keep a single information table per diagnostic. 512 There is a single lock per information table. 513 </para> 514</section> 515 516 517<section xml:id="manual.ext.profile_mode.implementation.stdlib-in-proflib" xreflabel="Using the Standard Library in the Runtime Library"><info><title>Using the Standard Library in the Instrumentation Implementation</title></info> 518 519 <para> 520 As much as we would like to avoid uses of libstdc++ within our 521 instrumentation library, containers such as unordered_map are very 522 appealing. We plan to use them as long as they are named properly 523 to avoid ambiguity. 524 </para> 525</section> 526 527 528<section xml:id="manual.ext.profile_mode.implementation.malloc-hooks" xreflabel="Malloc Hooks"><info><title>Malloc Hooks</title></info> 529 530 <para> 531 User applications/libraries can provide malloc hooks. 532 When the implementation of the malloc hooks uses stdlibc++, there can 533 be an infinite cycle between the profile mode instrumentation and the 534 malloc hook code. 535 </para> 536 <para> 537 We protect against reentrance to the profile mode instrumentation code, 538 which should avoid this problem in most cases. 539 The protection mechanism is thread safe and exception safe. 540 This mechanism does not prevent reentrance to the malloc hook itself, 541 which could still result in deadlock, if, for instance, the malloc hook 542 uses non-recursive locks. 543 XXX: A definitive solution to this problem would be for the profile extension 544 to use a custom allocator internally, and perhaps not to use libstdc++. 545 </para> 546</section> 547 548 549<section xml:id="manual.ext.profile_mode.implementation.construction-destruction" xreflabel="Construction and Destruction of Global Objects"><info><title>Construction and Destruction of Global Objects</title></info> 550 551 <para> 552 The profiling library state is initialized at the first call to a profiling 553 method. This allows us to record the construction of all global objects. 554 However, we cannot do the same at destruction time. The trace is written 555 by a function registered by <code>atexit</code>, thus invoked by 556 <code>exit</code>. 557 </para> 558</section> 559 560</section> 561 562 563<section xml:id="manual.ext.profile_mode.developer" xreflabel="Developer Information"><info><title>Developer Information</title></info> 564<?dbhtml filename="profile_mode_devel.html"?> 565 566 567<section xml:id="manual.ext.profile_mode.developer.bigpic" xreflabel="Big Picture"><info><title>Big Picture</title></info> 568 569 570 <para>The profile mode headers are included with 571 <code>-D_GLIBCXX_PROFILE</code> through preprocessor directives in 572 <code>include/std/*</code>. 573 </para> 574 575 <para>Instrumented implementations are provided in 576 <code>include/profile/*</code>. All instrumentation hooks are macros 577 defined in <code>include/profile/profiler.h</code>. 578 </para> 579 580 <para>All the implementation of the instrumentation hooks is in 581 <code>include/profile/impl/*</code>. Although all the code gets included, 582 thus is publicly visible, only a small number of functions are called from 583 outside this directory. All calls to hook implementations must be 584 done through macros defined in <code>profiler.h</code>. The macro 585 must ensure (1) that the call is guarded against reentrance and 586 (2) that the call can be turned off at compile time using a 587 <code>-D_GLIBCXX_PROFILE_...</code> compiler option. 588 </para> 589 590</section> 591 592<section xml:id="manual.ext.profile_mode.developer.howto" xreflabel="How To Add A Diagnostic"><info><title>How To Add A Diagnostic</title></info> 593 594 595 <para>Let's say the diagnostic name is "magic". 596 </para> 597 598 <para>If you need to instrument a header not already under 599 <code>include/profile/*</code>, first edit the corresponding header 600 under <code>include/std/</code> and add a preprocessor directive such 601 as the one in <code>include/std/vector</code>: 602<programlisting> 603#ifdef _GLIBCXX_PROFILE 604# include <profile/vector> 605#endif 606</programlisting> 607 </para> 608 609 <para>If the file you need to instrument is not yet under 610 <code>include/profile/</code>, make a copy of the one in 611 <code>include/debug</code>, or the main implementation. 612 You'll need to include the main implementation and inherit the classes 613 you want to instrument. Then define the methods you want to instrument, 614 define the instrumentation hooks and add calls to them. 615 Look at <code>include/profile/vector</code> for an example. 616 </para> 617 618 <para>Add macros for the instrumentation hooks in 619 <code>include/profile/impl/profiler.h</code>. 620 Hook names must start with <code>__profcxx_</code>. 621 Make sure they transform 622 in no code with <code>-D_NO_GLIBCXX_PROFILE_MAGIC</code>. 623 Make sure all calls to any method in namespace <code>__gnu_profile</code> 624 is protected against reentrance using macro 625 <code>_GLIBCXX_PROFILE_REENTRANCE_GUARD</code>. 626 All names of methods in namespace <code>__gnu_profile</code> called from 627 <code>profiler.h</code> must start with <code>__trace_magic_</code>. 628 </para> 629 630 <para>Add the implementation of the diagnostic. 631 <itemizedlist> 632 <listitem><para> 633 Create new file <code>include/profile/impl/profiler_magic.h</code>. 634 </para></listitem> 635 <listitem><para> 636 Define class <code>__magic_info: public __object_info_base</code>. 637 This is the representation of a line in the object table. 638 The <code>__merge</code> method is used to aggregate information 639 across all dynamic instances created at the same call context. 640 The <code>__magnitude</code> must return the estimation of the benefit 641 as a number of small operations, e.g., number of words copied. 642 The <code>__write</code> method is used to produce the raw trace. 643 The <code>__advice</code> method is used to produce the advice string. 644 </para></listitem> 645 <listitem><para> 646 Define class <code>__magic_stack_info: public __magic_info</code>. 647 This defines the content of a line in the stack table. 648 </para></listitem> 649 <listitem><para> 650 Define class <code>__trace_magic: public __trace_base<__magic_info, 651 __magic_stack_info></code>. 652 It defines the content of the trace associated with this diagnostic. 653 </para></listitem> 654 </itemizedlist> 655 </para> 656 657 <para>Add initialization and reporting calls in 658 <code>include/profile/impl/profiler_trace.h</code>. Use 659 <code>__trace_vector_to_list</code> as an example. 660 </para> 661 662 <para>Add documentation in file <code>doc/xml/manual/profile_mode.xml</code>. 663 </para> 664</section> 665</section> 666 667<section xml:id="manual.ext.profile_mode.diagnostics"><info><title>Diagnostics</title></info> 668<?dbhtml filename="profile_mode_diagnostics.html"?> 669 670 671 <para> 672 The table below presents all the diagnostics we intend to implement. 673 Each diagnostic has a corresponding compile time switch 674 <code>-D_GLIBCXX_PROFILE_<diagnostic></code>. 675 Groups of related diagnostics can be turned on with a single switch. 676 For instance, <code>-D_GLIBCXX_PROFILE_LOCALITY</code> is equivalent to 677 <code>-D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH 678 -D_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>. 679 </para> 680 681 <para> 682 The benefit, cost, expected frequency and accuracy of each diagnostic 683 was given a grade from 1 to 10, where 10 is highest. 684 A high benefit means that, if the diagnostic is accurate, the expected 685 performance improvement is high. 686 A high cost means that turning this diagnostic on leads to high slowdown. 687 A high frequency means that we expect this to occur relatively often. 688 A high accuracy means that the diagnostic is unlikely to be wrong. 689 These grades are not perfect. They are just meant to guide users with 690 specific needs or time budgets. 691 </para> 692 693<table frame="all" xml:id="table.profile_diagnostics"> 694<title>Profile Diagnostics</title> 695 696<tgroup cols="7" align="left" colsep="1" rowsep="1"> 697<colspec colname="c1"/> 698<colspec colname="c2"/> 699<colspec colname="c3"/> 700<colspec colname="c4"/> 701<colspec colname="c5"/> 702<colspec colname="c6"/> 703<colspec colname="c7"/> 704 705<thead> 706 <row> 707 <entry>Group</entry> 708 <entry>Flag</entry> 709 <entry>Benefit</entry> 710 <entry>Cost</entry> 711 <entry>Freq.</entry> 712 <entry>Implemented</entry> 713 </row> 714</thead> 715<tbody> 716 <row> 717 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.containers"> 718 CONTAINERS</link></entry> 719 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_too_small"> 720 HASHTABLE_TOO_SMALL</link></entry> 721 <entry>10</entry> 722 <entry>1</entry> 723 <entry/> 724 <entry>10</entry> 725 <entry>yes</entry> 726 </row> 727 <row> 728 <entry/> 729 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_too_large"> 730 HASHTABLE_TOO_LARGE</link></entry> 731 <entry>5</entry> 732 <entry>1</entry> 733 <entry/> 734 <entry>10</entry> 735 <entry>yes</entry> 736 </row> 737 <row> 738 <entry/> 739 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.inefficient_hash"> 740 INEFFICIENT_HASH</link></entry> 741 <entry>7</entry> 742 <entry>3</entry> 743 <entry/> 744 <entry>10</entry> 745 <entry>yes</entry> 746 </row> 747 <row> 748 <entry/> 749 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_too_small"> 750 VECTOR_TOO_SMALL</link></entry> 751 <entry>8</entry> 752 <entry>1</entry> 753 <entry/> 754 <entry>10</entry> 755 <entry>yes</entry> 756 </row> 757 <row> 758 <entry/> 759 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_too_large"> 760 VECTOR_TOO_LARGE</link></entry> 761 <entry>5</entry> 762 <entry>1</entry> 763 <entry/> 764 <entry>10</entry> 765 <entry>yes</entry> 766 </row> 767 <row> 768 <entry/> 769 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_to_hashtable"> 770 VECTOR_TO_HASHTABLE</link></entry> 771 <entry>7</entry> 772 <entry>7</entry> 773 <entry/> 774 <entry>10</entry> 775 <entry>no</entry> 776 </row> 777 <row> 778 <entry/> 779 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_to_vector"> 780 HASHTABLE_TO_VECTOR</link></entry> 781 <entry>7</entry> 782 <entry>7</entry> 783 <entry/> 784 <entry>10</entry> 785 <entry>no</entry> 786 </row> 787 <row> 788 <entry/> 789 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_to_list"> 790 VECTOR_TO_LIST</link></entry> 791 <entry>8</entry> 792 <entry>5</entry> 793 <entry/> 794 <entry>10</entry> 795 <entry>yes</entry> 796 </row> 797 <row> 798 <entry/> 799 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.list_to_vector"> 800 LIST_TO_VECTOR</link></entry> 801 <entry>10</entry> 802 <entry>5</entry> 803 <entry/> 804 <entry>10</entry> 805 <entry>no</entry> 806 </row> 807 <row> 808 <entry/> 809 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.assoc_ord_to_unord"> 810 ORDERED_TO_UNORDERED</link></entry> 811 <entry>10</entry> 812 <entry>5</entry> 813 <entry/> 814 <entry>10</entry> 815 <entry>only map/unordered_map</entry> 816 </row> 817 <row> 818 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.algorithms"> 819 ALGORITHMS</link></entry> 820 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.algorithms.sort"> 821 SORT</link></entry> 822 <entry>7</entry> 823 <entry>8</entry> 824 <entry/> 825 <entry>7</entry> 826 <entry>no</entry> 827 </row> 828 <row> 829 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality"> 830 LOCALITY</link></entry> 831 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality.sw_prefetch"> 832 SOFTWARE_PREFETCH</link></entry> 833 <entry>8</entry> 834 <entry>8</entry> 835 <entry/> 836 <entry>5</entry> 837 <entry>no</entry> 838 </row> 839 <row> 840 <entry/> 841 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality.linked"> 842 RBTREE_LOCALITY</link></entry> 843 <entry>4</entry> 844 <entry>8</entry> 845 <entry/> 846 <entry>5</entry> 847 <entry>no</entry> 848 </row> 849 <row> 850 <entry/> 851 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.mthread.false_share"> 852 FALSE_SHARING</link></entry> 853 <entry>8</entry> 854 <entry>10</entry> 855 <entry/> 856 <entry>10</entry> 857 <entry>no</entry> 858 </row> 859</tbody> 860</tgroup> 861</table> 862 863<section xml:id="manual.ext.profile_mode.analysis.template" xreflabel="Template"><info><title>Diagnostic Template</title></info> 864 865<itemizedlist> 866 <listitem><para><emphasis>Switch:</emphasis> 867 <code>_GLIBCXX_PROFILE_<diagnostic></code>. 868 </para></listitem> 869 <listitem><para><emphasis>Goal:</emphasis> What problem will it diagnose? 870 </para></listitem> 871 <listitem><para><emphasis>Fundamentals:</emphasis>. 872 What is the fundamental reason why this is a problem</para></listitem> 873 <listitem><para><emphasis>Sample runtime reduction:</emphasis> 874 Percentage reduction in execution time. When reduction is more than 875 a constant factor, describe the reduction rate formula. 876 </para></listitem> 877 <listitem><para><emphasis>Recommendation:</emphasis> 878 What would the advise look like?</para></listitem> 879 <listitem><para><emphasis>To instrument:</emphasis> 880 What stdlibc++ components need to be instrumented?</para></listitem> 881 <listitem><para><emphasis>Analysis:</emphasis> 882 How do we decide when to issue the advice?</para></listitem> 883 <listitem><para><emphasis>Cost model:</emphasis> 884 How do we measure benefits? Math goes here.</para></listitem> 885 <listitem><para><emphasis>Example:</emphasis> 886<programlisting> 887program code 888... 889advice sample 890</programlisting> 891</para></listitem> 892</itemizedlist> 893</section> 894 895 896<section xml:id="manual.ext.profile_mode.analysis.containers" xreflabel="Containers"><info><title>Containers</title></info> 897 898 899<para> 900<emphasis>Switch:</emphasis> 901 <code>_GLIBCXX_PROFILE_CONTAINERS</code>. 902</para> 903 904<section xml:id="manual.ext.profile_mode.analysis.hashtable_too_small" xreflabel="Hashtable Too Small"><info><title>Hashtable Too Small</title></info> 905 906<itemizedlist> 907 <listitem><para><emphasis>Switch:</emphasis> 908 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL</code>. 909 </para></listitem> 910 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with many 911 rehash operations, small construction size and large destruction size. 912 </para></listitem> 913 <listitem><para><emphasis>Fundamentals:</emphasis> Rehash is very expensive. 914 Read content, follow chains within bucket, evaluate hash function, place at 915 new location in different order.</para></listitem> 916 <listitem><para><emphasis>Sample runtime reduction:</emphasis> 36%. 917 Code similar to example below. 918 </para></listitem> 919 <listitem><para><emphasis>Recommendation:</emphasis> 920 Set initial size to N at construction site S. 921 </para></listitem> 922 <listitem><para><emphasis>To instrument:</emphasis> 923 <code>unordered_set, unordered_map</code> constructor, destructor, rehash. 924 </para></listitem> 925 <listitem><para><emphasis>Analysis:</emphasis> 926 For each dynamic instance of <code>unordered_[multi]set|map</code>, 927 record initial size and call context of the constructor. 928 Record size increase, if any, after each relevant operation such as insert. 929 Record the estimated rehash cost.</para></listitem> 930 <listitem><para><emphasis>Cost model:</emphasis> 931 Number of individual rehash operations * cost per rehash.</para></listitem> 932 <listitem><para><emphasis>Example:</emphasis> 933<programlisting> 9341 unordered_set<int> us; 9352 for (int k = 0; k < 1000000; ++k) { 9363 us.insert(k); 9374 } 938 939foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations. 940</programlisting> 941</para></listitem> 942</itemizedlist> 943</section> 944 945 946<section xml:id="manual.ext.profile_mode.analysis.hashtable_too_large" xreflabel="Hashtable Too Large"><info><title>Hashtable Too Large</title></info> 947 948<itemizedlist> 949 <listitem><para><emphasis>Switch:</emphasis> 950 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE</code>. 951 </para></listitem> 952 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables which are 953 never filled up because fewer elements than reserved are ever 954 inserted. 955 </para></listitem> 956 <listitem><para><emphasis>Fundamentals:</emphasis> Save memory, which 957 is good in itself and may also improve memory reference performance through 958 fewer cache and TLB misses.</para></listitem> 959 <listitem><para><emphasis>Sample runtime reduction:</emphasis> unknown. 960 </para></listitem> 961 <listitem><para><emphasis>Recommendation:</emphasis> 962 Set initial size to N at construction site S. 963 </para></listitem> 964 <listitem><para><emphasis>To instrument:</emphasis> 965 <code>unordered_set, unordered_map</code> constructor, destructor, rehash. 966 </para></listitem> 967 <listitem><para><emphasis>Analysis:</emphasis> 968 For each dynamic instance of <code>unordered_[multi]set|map</code>, 969 record initial size and call context of the constructor, and correlate it 970 with its size at destruction time. 971 </para></listitem> 972 <listitem><para><emphasis>Cost model:</emphasis> 973 Number of iteration operations + memory saved.</para></listitem> 974 <listitem><para><emphasis>Example:</emphasis> 975<programlisting> 9761 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ; 9772 for (int k = 0; k < 100000; ++k) { 9783 for (int j = 0; j < 10; ++j) { 9794 v[k].insert(k + j); 9805 } 9816 } 982 983foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N 984bytes of memory and M iteration steps. 985</programlisting> 986</para></listitem> 987</itemizedlist> 988</section> 989 990<section xml:id="manual.ext.profile_mode.analysis.inefficient_hash" xreflabel="Inefficient Hash"><info><title>Inefficient Hash</title></info> 991 992<itemizedlist> 993 <listitem><para><emphasis>Switch:</emphasis> 994 <code>_GLIBCXX_PROFILE_INEFFICIENT_HASH</code>. 995 </para></listitem> 996 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with polarized 997 distribution. 998 </para></listitem> 999 <listitem><para><emphasis>Fundamentals:</emphasis> A non-uniform 1000 distribution may lead to long chains, thus possibly increasing complexity 1001 by a factor up to the number of elements. 1002 </para></listitem> 1003 <listitem><para><emphasis>Sample runtime reduction:</emphasis> factor up 1004 to container size. 1005 </para></listitem> 1006 <listitem><para><emphasis>Recommendation:</emphasis> Change hash function 1007 for container built at site S. Distribution score = N. Access score = S. 1008 Longest chain = C, in bucket B. 1009 </para></listitem> 1010 <listitem><para><emphasis>To instrument:</emphasis> 1011 <code>unordered_set, unordered_map</code> constructor, destructor, [], 1012 insert, iterator. 1013 </para></listitem> 1014 <listitem><para><emphasis>Analysis:</emphasis> 1015 Count the exact number of link traversals. 1016 </para></listitem> 1017 <listitem><para><emphasis>Cost model:</emphasis> 1018 Total number of links traversed.</para></listitem> 1019 <listitem><para><emphasis>Example:</emphasis> 1020<programlisting> 1021class dumb_hash { 1022 public: 1023 size_t operator() (int i) const { return 0; } 1024}; 1025... 1026 unordered_set<int, dumb_hash> hs; 1027 ... 1028 for (int i = 0; i < COUNT; ++i) { 1029 hs.find(i); 1030 } 1031</programlisting> 1032</para></listitem> 1033</itemizedlist> 1034</section> 1035 1036<section xml:id="manual.ext.profile_mode.analysis.vector_too_small" xreflabel="Vector Too Small"><info><title>Vector Too Small</title></info> 1037 1038<itemizedlist> 1039 <listitem><para><emphasis>Switch:</emphasis> 1040 <code>_GLIBCXX_PROFILE_VECTOR_TOO_SMALL</code>. 1041 </para></listitem> 1042 <listitem><para><emphasis>Goal:</emphasis>Detect vectors with many 1043 resize operations, small construction size and large destruction size.. 1044 </para></listitem> 1045 <listitem><para><emphasis>Fundamentals:</emphasis>Resizing can be expensive. 1046 Copying large amounts of data takes time. Resizing many small vectors may 1047 have allocation overhead and affect locality.</para></listitem> 1048 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%. 1049 </para></listitem> 1050 <listitem><para><emphasis>Recommendation:</emphasis> 1051 Set initial size to N at construction site S.</para></listitem> 1052 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>. 1053 </para></listitem> 1054 <listitem><para><emphasis>Analysis:</emphasis> 1055 For each dynamic instance of <code>vector</code>, 1056 record initial size and call context of the constructor. 1057 Record size increase, if any, after each relevant operation such as 1058 <code>push_back</code>. Record the estimated resize cost. 1059 </para></listitem> 1060 <listitem><para><emphasis>Cost model:</emphasis> 1061 Total number of words copied * time to copy a word.</para></listitem> 1062 <listitem><para><emphasis>Example:</emphasis> 1063<programlisting> 10641 vector<int> v; 10652 for (int k = 0; k < 1000000; ++k) { 10663 v.push_back(k); 10674 } 1068 1069foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves 1070copying 4000000 bytes and 20 memory allocations and deallocations. 1071</programlisting> 1072</para></listitem> 1073</itemizedlist> 1074</section> 1075 1076<section xml:id="manual.ext.profile_mode.analysis.vector_too_large" xreflabel="Vector Too Large"><info><title>Vector Too Large</title></info> 1077 1078<itemizedlist> 1079 <listitem><para><emphasis>Switch:</emphasis> 1080 <code>_GLIBCXX_PROFILE_VECTOR_TOO_LARGE</code> 1081 </para></listitem> 1082 <listitem><para><emphasis>Goal:</emphasis>Detect vectors which are 1083 never filled up because fewer elements than reserved are ever 1084 inserted. 1085 </para></listitem> 1086 <listitem><para><emphasis>Fundamentals:</emphasis>Save memory, which 1087 is good in itself and may also improve memory reference performance through 1088 fewer cache and TLB misses.</para></listitem> 1089 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%. 1090 </para></listitem> 1091 <listitem><para><emphasis>Recommendation:</emphasis> 1092 Set initial size to N at construction site S.</para></listitem> 1093 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>. 1094 </para></listitem> 1095 <listitem><para><emphasis>Analysis:</emphasis> 1096 For each dynamic instance of <code>vector</code>, 1097 record initial size and call context of the constructor, and correlate it 1098 with its size at destruction time.</para></listitem> 1099 <listitem><para><emphasis>Cost model:</emphasis> 1100 Total amount of memory saved.</para></listitem> 1101 <listitem><para><emphasis>Example:</emphasis> 1102<programlisting> 11031 vector<vector<int>> v(100000, vector<int>(100)) ; 11042 for (int k = 0; k < 100000; ++k) { 11053 for (int j = 0; j < 10; ++j) { 11064 v[k].insert(k + j); 11075 } 11086 } 1109 1110foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N 1111bytes of memory and may reduce the number of cache and TLB misses. 1112</programlisting> 1113</para></listitem> 1114</itemizedlist> 1115</section> 1116 1117<section xml:id="manual.ext.profile_mode.analysis.vector_to_hashtable" xreflabel="Vector to Hashtable"><info><title>Vector to Hashtable</title></info> 1118 1119<itemizedlist> 1120 <listitem><para><emphasis>Switch:</emphasis> 1121 <code>_GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE</code>. 1122 </para></listitem> 1123 <listitem><para><emphasis>Goal:</emphasis> Detect uses of 1124 <code>vector</code> that can be substituted with <code>unordered_set</code> 1125 to reduce execution time. 1126 </para></listitem> 1127 <listitem><para><emphasis>Fundamentals:</emphasis> 1128 Linear search in a vector is very expensive, whereas searching in a hashtable 1129 is very quick.</para></listitem> 1130 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up 1131 to container size. 1132 </para></listitem> 1133 <listitem><para><emphasis>Recommendation:</emphasis>Replace 1134 <code>vector</code> with <code>unordered_set</code> at site S. 1135 </para></listitem> 1136 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code> 1137 operations and access methods.</para></listitem> 1138 <listitem><para><emphasis>Analysis:</emphasis> 1139 For each dynamic instance of <code>vector</code>, 1140 record call context of the constructor. Issue the advice only if the 1141 only methods called on this <code>vector</code> are <code>push_back</code>, 1142 <code>insert</code> and <code>find</code>. 1143 </para></listitem> 1144 <listitem><para><emphasis>Cost model:</emphasis> 1145 Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) - 1146 cost(unordered_set::insert) + cost(unordered_set::find). 1147 </para></listitem> 1148 <listitem><para><emphasis>Example:</emphasis> 1149<programlisting> 11501 vector<int> v; 1151... 11522 for (int i = 0; i < 1000; ++i) { 11533 find(v.begin(), v.end(), i); 11544 } 1155 1156foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000 1157comparisons. 1158</programlisting> 1159</para></listitem> 1160</itemizedlist> 1161</section> 1162 1163<section xml:id="manual.ext.profile_mode.analysis.hashtable_to_vector" xreflabel="Hashtable to Vector"><info><title>Hashtable to Vector</title></info> 1164 1165<itemizedlist> 1166 <listitem><para><emphasis>Switch:</emphasis> 1167 <code>_GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR</code>. 1168 </para></listitem> 1169 <listitem><para><emphasis>Goal:</emphasis> Detect uses of 1170 <code>unordered_set</code> that can be substituted with <code>vector</code> 1171 to reduce execution time. 1172 </para></listitem> 1173 <listitem><para><emphasis>Fundamentals:</emphasis> 1174 Hashtable iterator is slower than vector iterator.</para></listitem> 1175 <listitem><para><emphasis>Sample runtime reduction:</emphasis>95%. 1176 </para></listitem> 1177 <listitem><para><emphasis>Recommendation:</emphasis>Replace 1178 <code>unordered_set</code> with <code>vector</code> at site S. 1179 </para></listitem> 1180 <listitem><para><emphasis>To instrument:</emphasis><code>unordered_set</code> 1181 operations and access methods.</para></listitem> 1182 <listitem><para><emphasis>Analysis:</emphasis> 1183 For each dynamic instance of <code>unordered_set</code>, 1184 record call context of the constructor. Issue the advice only if the 1185 number of <code>find</code>, <code>insert</code> and <code>[]</code> 1186 operations on this <code>unordered_set</code> are small relative to the 1187 number of elements, and methods <code>begin</code> or <code>end</code> 1188 are invoked (suggesting iteration).</para></listitem> 1189 <listitem><para><emphasis>Cost model:</emphasis> 1190 Number of .</para></listitem> 1191 <listitem><para><emphasis>Example:</emphasis> 1192<programlisting> 11931 unordered_set<int> us; 1194... 11952 int s = 0; 11963 for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) { 11974 s += *it; 11985 } 1199 1200foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N 1201indirections and may achieve better data locality. 1202</programlisting> 1203</para></listitem> 1204</itemizedlist> 1205</section> 1206 1207<section xml:id="manual.ext.profile_mode.analysis.vector_to_list" xreflabel="Vector to List"><info><title>Vector to List</title></info> 1208 1209<itemizedlist> 1210 <listitem><para><emphasis>Switch:</emphasis> 1211 <code>_GLIBCXX_PROFILE_VECTOR_TO_LIST</code>. 1212 </para></listitem> 1213 <listitem><para><emphasis>Goal:</emphasis> Detect cases where 1214 <code>vector</code> could be substituted with <code>list</code> for 1215 better performance. 1216 </para></listitem> 1217 <listitem><para><emphasis>Fundamentals:</emphasis> 1218 Inserting in the middle of a vector is expensive compared to inserting in a 1219 list. 1220 </para></listitem> 1221 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up to 1222 container size. 1223 </para></listitem> 1224 <listitem><para><emphasis>Recommendation:</emphasis>Replace vector with list 1225 at site S.</para></listitem> 1226 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code> 1227 operations and access methods.</para></listitem> 1228 <listitem><para><emphasis>Analysis:</emphasis> 1229 For each dynamic instance of <code>vector</code>, 1230 record the call context of the constructor. Record the overhead of each 1231 <code>insert</code> operation based on current size and insert position. 1232 Report instance with high insertion overhead. 1233 </para></listitem> 1234 <listitem><para><emphasis>Cost model:</emphasis> 1235 (Sum(cost(vector::method)) - Sum(cost(list::method)), for 1236 method in [push_back, insert, erase]) 1237 + (Cost(iterate vector) - Cost(iterate list))</para></listitem> 1238 <listitem><para><emphasis>Example:</emphasis> 1239<programlisting> 12401 vector<int> v; 12412 for (int i = 0; i < 10000; ++i) { 12423 v.insert(v.begin(), i); 12434 } 1244 1245foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000 1246operations. 1247</programlisting> 1248</para></listitem> 1249</itemizedlist> 1250</section> 1251 1252<section xml:id="manual.ext.profile_mode.analysis.list_to_vector" xreflabel="List to Vector"><info><title>List to Vector</title></info> 1253 1254<itemizedlist> 1255 <listitem><para><emphasis>Switch:</emphasis> 1256 <code>_GLIBCXX_PROFILE_LIST_TO_VECTOR</code>. 1257 </para></listitem> 1258 <listitem><para><emphasis>Goal:</emphasis> Detect cases where 1259 <code>list</code> could be substituted with <code>vector</code> for 1260 better performance. 1261 </para></listitem> 1262 <listitem><para><emphasis>Fundamentals:</emphasis> 1263 Iterating through a vector is faster than through a list. 1264 </para></listitem> 1265 <listitem><para><emphasis>Sample runtime reduction:</emphasis>64%. 1266 </para></listitem> 1267 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with vector 1268 at site S.</para></listitem> 1269 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code> 1270 operations and access methods.</para></listitem> 1271 <listitem><para><emphasis>Analysis:</emphasis> 1272 Issue the advice if there are no <code>insert</code> operations. 1273 </para></listitem> 1274 <listitem><para><emphasis>Cost model:</emphasis> 1275 (Sum(cost(vector::method)) - Sum(cost(list::method)), for 1276 method in [push_back, insert, erase]) 1277 + (Cost(iterate vector) - Cost(iterate list))</para></listitem> 1278 <listitem><para><emphasis>Example:</emphasis> 1279<programlisting> 12801 list<int> l; 1281... 12822 int sum = 0; 12833 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) { 12844 sum += *it; 12855 } 1286 1287foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect 1288memory references. 1289</programlisting> 1290</para></listitem> 1291</itemizedlist> 1292</section> 1293 1294<section xml:id="manual.ext.profile_mode.analysis.list_to_slist" xreflabel="List to Forward List"><info><title>List to Forward List (Slist)</title></info> 1295 1296<itemizedlist> 1297 <listitem><para><emphasis>Switch:</emphasis> 1298 <code>_GLIBCXX_PROFILE_LIST_TO_SLIST</code>. 1299 </para></listitem> 1300 <listitem><para><emphasis>Goal:</emphasis> Detect cases where 1301 <code>list</code> could be substituted with <code>forward_list</code> for 1302 better performance. 1303 </para></listitem> 1304 <listitem><para><emphasis>Fundamentals:</emphasis> 1305 The memory footprint of a forward_list is smaller than that of a list. 1306 This has beneficial effects on memory subsystem, e.g., fewer cache misses. 1307 </para></listitem> 1308 <listitem><para><emphasis>Sample runtime reduction:</emphasis>40%. 1309 Note that the reduction is only noticeable if the size of the forward_list 1310 node is in fact larger than that of the list node. For memory allocators 1311 with size classes, you will only notice an effect when the two node sizes 1312 belong to different allocator size classes. 1313 </para></listitem> 1314 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with 1315 forward_list at site S.</para></listitem> 1316 <listitem><para><emphasis>To instrument:</emphasis><code>list</code> 1317 operations and iteration methods.</para></listitem> 1318 <listitem><para><emphasis>Analysis:</emphasis> 1319 Issue the advice if there are no <code>backwards</code> traversals 1320 or insertion before a given node. 1321 </para></listitem> 1322 <listitem><para><emphasis>Cost model:</emphasis> 1323 Always true.</para></listitem> 1324 <listitem><para><emphasis>Example:</emphasis> 1325<programlisting> 13261 list<int> l; 1327... 13282 int sum = 0; 13293 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) { 13304 sum += *it; 13315 } 1332 1333foo.cc:1: advice: Change "list" to "forward_list". 1334</programlisting> 1335</para></listitem> 1336</itemizedlist> 1337</section> 1338 1339<section xml:id="manual.ext.profile_mode.analysis.assoc_ord_to_unord" xreflabel="Ordered to Unordered Associative Container"><info><title>Ordered to Unordered Associative Container</title></info> 1340 1341<itemizedlist> 1342 <listitem><para><emphasis>Switch:</emphasis> 1343 <code>_GLIBCXX_PROFILE_ORDERED_TO_UNORDERED</code>. 1344 </para></listitem> 1345 <listitem><para><emphasis>Goal:</emphasis> Detect cases where ordered 1346 associative containers can be replaced with unordered ones. 1347 </para></listitem> 1348 <listitem><para><emphasis>Fundamentals:</emphasis> 1349 Insert and search are quicker in a hashtable than in 1350 a red-black tree.</para></listitem> 1351 <listitem><para><emphasis>Sample runtime reduction:</emphasis>52%. 1352 </para></listitem> 1353 <listitem><para><emphasis>Recommendation:</emphasis> 1354 Replace set with unordered_set at site S.</para></listitem> 1355 <listitem><para><emphasis>To instrument:</emphasis> 1356 <code>set</code>, <code>multiset</code>, <code>map</code>, 1357 <code>multimap</code> methods.</para></listitem> 1358 <listitem><para><emphasis>Analysis:</emphasis> 1359 Issue the advice only if we are not using operator <code>++</code> on any 1360 iterator on a particular <code>[multi]set|map</code>. 1361 </para></listitem> 1362 <listitem><para><emphasis>Cost model:</emphasis> 1363 (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for 1364 method in [insert, erase, find]) 1365 + (Cost(iterate hashtable) - Cost(iterate rbtree))</para></listitem> 1366 <listitem><para><emphasis>Example:</emphasis> 1367<programlisting> 13681 set<int> s; 13692 for (int i = 0; i < 100000; ++i) { 13703 s.insert(i); 13714 } 13725 int sum = 0; 13736 for (int i = 0; i < 100000; ++i) { 13747 sum += *s.find(i); 13758 } 1376</programlisting> 1377</para></listitem> 1378</itemizedlist> 1379</section> 1380 1381</section> 1382 1383 1384 1385<section xml:id="manual.ext.profile_mode.analysis.algorithms" xreflabel="Algorithms"><info><title>Algorithms</title></info> 1386 1387 1388 <para><emphasis>Switch:</emphasis> 1389 <code>_GLIBCXX_PROFILE_ALGORITHMS</code>. 1390 </para> 1391 1392<section xml:id="manual.ext.profile_mode.analysis.algorithms.sort" xreflabel="Sorting"><info><title>Sort Algorithm Performance</title></info> 1393 1394<itemizedlist> 1395 <listitem><para><emphasis>Switch:</emphasis> 1396 <code>_GLIBCXX_PROFILE_SORT</code>. 1397 </para></listitem> 1398 <listitem><para><emphasis>Goal:</emphasis> Give measure of sort algorithm 1399 performance based on actual input. For instance, advise Radix Sort over 1400 Quick Sort for a particular call context. 1401 </para></listitem> 1402 <listitem><para><emphasis>Fundamentals:</emphasis> 1403 See papers: 1404 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://dl.acm.org/citation.cfm?doid=1065944.1065981"> 1405 A framework for adaptive algorithm selection in STAPL</link> and 1406 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://ieeexplore.ieee.org/document/4228227/"> 1407 Optimizing Sorting with Machine Learning Algorithms</link>. 1408 </para></listitem> 1409 <listitem><para><emphasis>Sample runtime reduction:</emphasis>60%. 1410 </para></listitem> 1411 <listitem><para><emphasis>Recommendation:</emphasis> Change sort algorithm 1412 at site S from X Sort to Y Sort.</para></listitem> 1413 <listitem><para><emphasis>To instrument:</emphasis> <code>sort</code> 1414 algorithm.</para></listitem> 1415 <listitem><para><emphasis>Analysis:</emphasis> 1416 Issue the advice if the cost model tells us that another sort algorithm 1417 would do better on this input. Requires us to know what algorithm we 1418 are using in our sort implementation in release mode.</para></listitem> 1419 <listitem><para><emphasis>Cost model:</emphasis> 1420 Runtime(algo) for algo in [radix, quick, merge, ...]</para></listitem> 1421 <listitem><para><emphasis>Example:</emphasis> 1422<programlisting> 1423</programlisting> 1424</para></listitem> 1425</itemizedlist> 1426</section> 1427 1428</section> 1429 1430 1431<section xml:id="manual.ext.profile_mode.analysis.locality" xreflabel="Data Locality"><info><title>Data Locality</title></info> 1432 1433 1434 <para><emphasis>Switch:</emphasis> 1435 <code>_GLIBCXX_PROFILE_LOCALITY</code>. 1436 </para> 1437 1438<section xml:id="manual.ext.profile_mode.analysis.locality.sw_prefetch" xreflabel="Need Software Prefetch"><info><title>Need Software Prefetch</title></info> 1439 1440<itemizedlist> 1441 <listitem><para><emphasis>Switch:</emphasis> 1442 <code>_GLIBCXX_PROFILE_SOFTWARE_PREFETCH</code>. 1443 </para></listitem> 1444 <listitem><para><emphasis>Goal:</emphasis> Discover sequences of indirect 1445 memory accesses that are not regular, thus cannot be predicted by 1446 hardware prefetchers. 1447 </para></listitem> 1448 <listitem><para><emphasis>Fundamentals:</emphasis> 1449 Indirect references are hard to predict and are very expensive when they 1450 miss in caches.</para></listitem> 1451 <listitem><para><emphasis>Sample runtime reduction:</emphasis>25%. 1452 </para></listitem> 1453 <listitem><para><emphasis>Recommendation:</emphasis> Insert prefetch 1454 instruction.</para></listitem> 1455 <listitem><para><emphasis>To instrument:</emphasis> Vector iterator and 1456 access operator []. 1457 </para></listitem> 1458 <listitem><para><emphasis>Analysis:</emphasis> 1459 First, get cache line size and page size from system. 1460 Then record iterator dereference sequences for which the value is a pointer. 1461 For each sequence within a container, issue a warning if successive pointer 1462 addresses are not within cache lines and do not form a linear pattern 1463 (otherwise they may be prefetched by hardware). 1464 If they also step across page boundaries, make the warning stronger. 1465 </para> 1466 <para>The same analysis applies to containers other than vector. 1467 However, we cannot give the same advice for linked structures, such as list, 1468 as there is no random access to the n-th element. The user may still be 1469 able to benefit from this information, for instance by employing frays (user 1470 level light weight threads) to hide the latency of chasing pointers. 1471 </para> 1472 <para> 1473 This analysis is a little oversimplified. A better cost model could be 1474 created by understanding the capability of the hardware prefetcher. 1475 This model could be trained automatically by running a set of synthetic 1476 cases. 1477 </para> 1478 </listitem> 1479 <listitem><para><emphasis>Cost model:</emphasis> 1480 Total distance between pointer values of successive elements in vectors 1481 of pointers.</para></listitem> 1482 <listitem><para><emphasis>Example:</emphasis> 1483<programlisting> 14841 int zero = 0; 14852 vector<int*> v(10000000, &zero); 14863 for (int k = 0; k < 10000000; ++k) { 14874 v[random() % 10000000] = new int(k); 14885 } 14896 for (int j = 0; j < 10000000; ++j) { 14907 count += (*v[j] == 0 ? 0 : 1); 14918 } 1492 1493foo.cc:7: advice: Insert prefetch instruction. 1494</programlisting> 1495</para></listitem> 1496</itemizedlist> 1497</section> 1498 1499<section xml:id="manual.ext.profile_mode.analysis.locality.linked" xreflabel="Linked Structure Locality"><info><title>Linked Structure Locality</title></info> 1500 1501<itemizedlist> 1502 <listitem><para><emphasis>Switch:</emphasis> 1503 <code>_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>. 1504 </para></listitem> 1505 <listitem><para><emphasis>Goal:</emphasis> Give measure of locality of 1506 objects stored in linked structures (lists, red-black trees and hashtables) 1507 with respect to their actual traversal patterns. 1508 </para></listitem> 1509 <listitem><para><emphasis>Fundamentals:</emphasis>Allocation can be tuned 1510 to a specific traversal pattern, to result in better data locality. 1511 See paper: 1512 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://parasol.tamu.edu/publications/download.php?file_id=570"> 1513 Custom Memory Allocation for Free</link> by Jula and Rauchwerger. 1514 </para></listitem> 1515 <listitem><para><emphasis>Sample runtime reduction:</emphasis>30%. 1516 </para></listitem> 1517 <listitem><para><emphasis>Recommendation:</emphasis> 1518 High scatter score N for container built at site S. 1519 Consider changing allocation sequence or choosing a structure conscious 1520 allocator.</para></listitem> 1521 <listitem><para><emphasis>To instrument:</emphasis> Methods of all 1522 containers using linked structures.</para></listitem> 1523 <listitem><para><emphasis>Analysis:</emphasis> 1524 First, get cache line size and page size from system. 1525 Then record the number of successive elements that are on different line 1526 or page, for each traversal method such as <code>find</code>. Give advice 1527 only if the ratio between this number and the number of total node hops 1528 is above a threshold.</para></listitem> 1529 <listitem><para><emphasis>Cost model:</emphasis> 1530 Sum(same_cache_line(this,previous))</para></listitem> 1531 <listitem><para><emphasis>Example:</emphasis> 1532<programlisting> 1533 1 set<int> s; 1534 2 for (int i = 0; i < 10000000; ++i) { 1535 3 s.insert(i); 1536 4 } 1537 5 set<int> s1, s2; 1538 6 for (int i = 0; i < 10000000; ++i) { 1539 7 s1.insert(i); 1540 8 s2.insert(i); 1541 9 } 1542... 1543 // Fast, better locality. 154410 for (set<int>::iterator it = s.begin(); it != s.end(); ++it) { 154511 sum += *it; 154612 } 1547 // Slow, elements are further apart. 154813 for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) { 154914 sum += *it; 155015 } 1551 1552foo.cc:5: advice: High scatter score NNN for set built here. Consider changing 1553the allocation sequence or switching to a structure conscious allocator. 1554</programlisting> 1555</para></listitem> 1556</itemizedlist> 1557</section> 1558 1559</section> 1560 1561 1562<section xml:id="manual.ext.profile_mode.analysis.mthread" xreflabel="Multithreaded Data Access"><info><title>Multithreaded Data Access</title></info> 1563 1564 1565 <para> 1566 The diagnostics in this group are not meant to be implemented short term. 1567 They require compiler support to know when container elements are written 1568 to. Instrumentation can only tell us when elements are referenced. 1569 </para> 1570 1571 <para><emphasis>Switch:</emphasis> 1572 <code>_GLIBCXX_PROFILE_MULTITHREADED</code>. 1573 </para> 1574 1575<section xml:id="manual.ext.profile_mode.analysis.mthread.ddtest" xreflabel="Dependence Violations at Container Level"><info><title>Data Dependence Violations at Container Level</title></info> 1576 1577<itemizedlist> 1578 <listitem><para><emphasis>Switch:</emphasis> 1579 <code>_GLIBCXX_PROFILE_DDTEST</code>. 1580 </para></listitem> 1581 <listitem><para><emphasis>Goal:</emphasis> Detect container elements 1582 that are referenced from multiple threads in the parallel region or 1583 across parallel regions. 1584 </para></listitem> 1585 <listitem><para><emphasis>Fundamentals:</emphasis> 1586 Sharing data between threads requires communication and perhaps locking, 1587 which may be expensive. 1588 </para></listitem> 1589 <listitem><para><emphasis>Sample runtime reduction:</emphasis>?%. 1590 </para></listitem> 1591 <listitem><para><emphasis>Recommendation:</emphasis> Change data 1592 distribution or parallel algorithm.</para></listitem> 1593 <listitem><para><emphasis>To instrument:</emphasis> Container access methods 1594 and iterators. 1595 </para></listitem> 1596 <listitem><para><emphasis>Analysis:</emphasis> 1597 Keep a shadow for each container. Record iterator dereferences and 1598 container member accesses. Issue advice for elements referenced by 1599 multiple threads. 1600 See paper: <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://dl.acm.org/citation.cfm?id=207110.207148"> 1601 The LRPD test: speculative run-time parallelization of loops with 1602 privatization and reduction parallelization</link>. 1603 </para></listitem> 1604 <listitem><para><emphasis>Cost model:</emphasis> 1605 Number of accesses to elements referenced from multiple threads 1606 </para></listitem> 1607 <listitem><para><emphasis>Example:</emphasis> 1608<programlisting> 1609</programlisting> 1610</para></listitem> 1611</itemizedlist> 1612</section> 1613 1614<section xml:id="manual.ext.profile_mode.analysis.mthread.false_share" xreflabel="False Sharing"><info><title>False Sharing</title></info> 1615 1616<itemizedlist> 1617 <listitem><para><emphasis>Switch:</emphasis> 1618 <code>_GLIBCXX_PROFILE_FALSE_SHARING</code>. 1619 </para></listitem> 1620 <listitem><para><emphasis>Goal:</emphasis> Detect elements in the 1621 same container which share a cache line, are written by at least one 1622 thread, and accessed by different threads. 1623 </para></listitem> 1624 <listitem><para><emphasis>Fundamentals:</emphasis> Under these assumptions, 1625 cache protocols require 1626 communication to invalidate lines, which may be expensive. 1627 </para></listitem> 1628 <listitem><para><emphasis>Sample runtime reduction:</emphasis>68%. 1629 </para></listitem> 1630 <listitem><para><emphasis>Recommendation:</emphasis> Reorganize container 1631 or use padding to avoid false sharing.</para></listitem> 1632 <listitem><para><emphasis>To instrument:</emphasis> Container access methods 1633 and iterators. 1634 </para></listitem> 1635 <listitem><para><emphasis>Analysis:</emphasis> 1636 First, get the cache line size. 1637 For each shared container, record all the associated iterator dereferences 1638 and member access methods with the thread id. Compare the address lists 1639 across threads to detect references in two different threads to the same 1640 cache line. Issue a warning only if the ratio to total references is 1641 significant. Do the same for iterator dereference values if they are 1642 pointers.</para></listitem> 1643 <listitem><para><emphasis>Cost model:</emphasis> 1644 Number of accesses to same cache line from different threads. 1645 </para></listitem> 1646 <listitem><para><emphasis>Example:</emphasis> 1647<programlisting> 16481 vector<int> v(2, 0); 16492 #pragma omp parallel for shared(v, SIZE) schedule(static, 1) 16503 for (i = 0; i < SIZE; ++i) { 16514 v[i % 2] += i; 16525 } 1653 1654OMP_NUM_THREADS=2 ./a.out 1655foo.cc:1: advice: Change container structure or padding to avoid false 1656sharing in multithreaded access at foo.cc:4. Detected N shared cache lines. 1657</programlisting> 1658</para></listitem> 1659</itemizedlist> 1660</section> 1661 1662</section> 1663 1664 1665<section xml:id="manual.ext.profile_mode.analysis.statistics" xreflabel="Statistics"><info><title>Statistics</title></info> 1666 1667 1668<para> 1669<emphasis>Switch:</emphasis> 1670 <code>_GLIBCXX_PROFILE_STATISTICS</code>. 1671</para> 1672 1673<para> 1674 In some cases the cost model may not tell us anything because the costs 1675 appear to offset the benefits. Consider the choice between a vector and 1676 a list. When there are both inserts and iteration, an automatic advice 1677 may not be issued. However, the programmer may still be able to make use 1678 of this information in a different way. 1679</para> 1680<para> 1681 This diagnostic will not issue any advice, but it will print statistics for 1682 each container construction site. The statistics will contain the cost 1683 of each operation actually performed on the container. 1684</para> 1685 1686</section> 1687 1688 1689</section> 1690 1691 1692<bibliography xml:id="profile_mode.biblio"><info><title>Bibliography</title></info> 1693 1694 1695 <biblioentry> 1696 <citetitle> 1697 Perflint: A Context Sensitive Performance Advisor for C++ Programs 1698 </citetitle> 1699 1700 <author><personname><firstname>Lixia</firstname><surname>Liu</surname></personname></author> 1701 <author><personname><firstname>Silvius</firstname><surname>Rus</surname></personname></author> 1702 1703 <copyright> 1704 <year>2009</year> 1705 <holder/> 1706 </copyright> 1707 1708 <publisher> 1709 <publishername> 1710 Proceedings of the 2009 International Symposium on Code Generation 1711 and Optimization 1712 </publishername> 1713 </publisher> 1714 </biblioentry> 1715</bibliography> 1716 1717 1718</chapter> 1719