1# Machine code analysis tools
2
3## The microarchitecture of modern CPUs
4
5While you might have heard of Instruction Set Architectures, such as `x86` or
6`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_),
7refers to the internal details of an actual family of CPUs, such as Intel's
8_Haswell_ or AMD's _Jaguar_.
9
10Replacing scalar code with SIMD code will improve performance on all CPUs
11supporting the required vector extensions.
12However, due to microarchitectural differences, the actual speed-up at
13runtime might vary.
14
15**Example**: a simple example arises when optimizing for AMD K8 CPUs.
16The assembly generated for an empty function should look like this:
17
18```asm
19nop
20ret
21```
22
23The `nop` is used to align the `ret` instruction for better performance.
24However, the compiler will actually generated the following code:
25
26```asm
27repz ret
28```
29
30The `repz` instruction will repeat the following instruction until a certain
31condition. Of course, in this situation, the function will simply immediately
32return, and the `ret` instruction is still aligned.
33However, AMD K8's branch predictor performs better with the latter code.
34
35For those looking to absolutely maximize performance for a certain target µ-arch,
36you will have to read some CPU manuals, or ask the compiler to do it for you
37with `-C target-cpu`.
38
39### Summary of CPU internals
40
41Modern processors are able to execute instructions out-of-order for better performance,
42by utilizing tricks such as [branch prediction], [instruction pipelining],
43or [superscalar execution].
44
45[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor
46[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining
47[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor
48
49SIMD instructions are also subject to these optimizations, meaning it can get pretty
50difficult to determine where the slowdown happens.
51For example, if the profiler reports a store operation is slow, one of two things
52could be happening:
53
54- the store is limited by the CPU's memory bandwidth, which is actually an ideal
55  scenario, all things considered;
56
57- memory bandwidth is nowhere near its peak, but the value to be stored is at the
58  end of a long chain of operations, and this store is where the profiler
59  encountered the pipeline stall;
60
61Since most profilers are simple tools which don't understand the subtleties of
62instruction scheduling, you
63
64## Analyzing the machine code
65
66Certain tools have knowledge of internal CPU microarchitecture, i.e. they know
67
68- how many physical [register files] a CPU actually has
69
70- what is the latency / throughtput of an instruction
71
72- what [µ-ops] are generated for a set of instructions
73
74and many other architectural details.
75
76[register files]: https://en.wikipedia.org/wiki/Register_file
77[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation
78
79These tools are therefore able to provide accurate information as to why some
80instructions are inefficient, and where the bottleneck is.
81
82The disadvantage is that the output of these tools requires advanced knowledge
83of the target architecture to understand, i.e. they **cannot** point out what
84the cause of the issue is explicitly.
85
86## Intel's Architecture Code Analyzer (IACA)
87
88[IACA] is a free tool offered by Intel for analyzing the performance of various
89computational kernels.
90
91Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches.
92
93[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer
94
95## llvm-mca
96
97<!--
98TODO: once LLVM 7 gets released, write a chapter on using llvm-mca
99with SIMD disassembly.
100-->
101