1# Machine code analysis tools 2 3## The microarchitecture of modern CPUs 4 5While you might have heard of Instruction Set Architectures, such as `x86` or 6`arm` or `mips`, the term _microarchitecture_ (also written here as _µ-arch_), 7refers to the internal details of an actual family of CPUs, such as Intel's 8_Haswell_ or AMD's _Jaguar_. 9 10Replacing scalar code with SIMD code will improve performance on all CPUs 11supporting the required vector extensions. 12However, due to microarchitectural differences, the actual speed-up at 13runtime might vary. 14 15**Example**: a simple example arises when optimizing for AMD K8 CPUs. 16The assembly generated for an empty function should look like this: 17 18```asm 19nop 20ret 21``` 22 23The `nop` is used to align the `ret` instruction for better performance. 24However, the compiler will actually generated the following code: 25 26```asm 27repz ret 28``` 29 30The `repz` instruction will repeat the following instruction until a certain 31condition. Of course, in this situation, the function will simply immediately 32return, and the `ret` instruction is still aligned. 33However, AMD K8's branch predictor performs better with the latter code. 34 35For those looking to absolutely maximize performance for a certain target µ-arch, 36you will have to read some CPU manuals, or ask the compiler to do it for you 37with `-C target-cpu`. 38 39### Summary of CPU internals 40 41Modern processors are able to execute instructions out-of-order for better performance, 42by utilizing tricks such as [branch prediction], [instruction pipelining], 43or [superscalar execution]. 44 45[branch prediction]: https://en.wikipedia.org/wiki/Branch_predictor 46[instruction pipelining]: https://en.wikipedia.org/wiki/Instruction_pipelining 47[superscalar execution]: https://en.wikipedia.org/wiki/Superscalar_processor 48 49SIMD instructions are also subject to these optimizations, meaning it can get pretty 50difficult to determine where the slowdown happens. 51For example, if the profiler reports a store operation is slow, one of two things 52could be happening: 53 54- the store is limited by the CPU's memory bandwidth, which is actually an ideal 55 scenario, all things considered; 56 57- memory bandwidth is nowhere near its peak, but the value to be stored is at the 58 end of a long chain of operations, and this store is where the profiler 59 encountered the pipeline stall; 60 61Since most profilers are simple tools which don't understand the subtleties of 62instruction scheduling, you 63 64## Analyzing the machine code 65 66Certain tools have knowledge of internal CPU microarchitecture, i.e. they know 67 68- how many physical [register files] a CPU actually has 69 70- what is the latency / throughtput of an instruction 71 72- what [µ-ops] are generated for a set of instructions 73 74and many other architectural details. 75 76[register files]: https://en.wikipedia.org/wiki/Register_file 77[µ-ops]: https://en.wikipedia.org/wiki/Micro-operation 78 79These tools are therefore able to provide accurate information as to why some 80instructions are inefficient, and where the bottleneck is. 81 82The disadvantage is that the output of these tools requires advanced knowledge 83of the target architecture to understand, i.e. they **cannot** point out what 84the cause of the issue is explicitly. 85 86## Intel's Architecture Code Analyzer (IACA) 87 88[IACA] is a free tool offered by Intel for analyzing the performance of various 89computational kernels. 90 91Being a proprietary, closed source tool, it _only_ supports Intel's µ-arches. 92 93[IACA]: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer 94 95## llvm-mca 96 97<!-- 98TODO: once LLVM 7 gets released, write a chapter on using llvm-mca 99with SIMD disassembly. 100--> 101