1# Benchmarking `llvm-libc`'s memory functions 2 3## Foreword 4 5Microbenchmarks are valuable tools to assess and compare the performance of 6isolated pieces of code. However they don't capture all interactions of complex 7systems; and so other metrics can be equally important: 8 9- **code size** (to reduce instruction cache pressure), 10- **Profile Guided Optimization** friendliness, 11- **hyperthreading / multithreading** friendliness. 12 13## Rationale 14 15The goal here is to satisfy the [Benchmarking 16Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles). 17 181. **Relevance**: Benchmarks should measure relatively vital features. 192. **Representativeness**: Benchmark performance metrics should be broadly 20 accepted by industry and academia. 213. **Equity**: All systems should be fairly compared. 224. **Repeatability**: Benchmark results can be verified. 235. **Cost-effectiveness**: Benchmark tests are economical. 246. **Scalability**: Benchmark tests should measure from single server to 25 multiple servers. 267. **Transparency**: Benchmark metrics should be easy to understand. 27 28Benchmarking is a [subtle 29art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and 30benchmarking memory functions is no exception. Here we'll dive into 31peculiarities of designing good microbenchmarks for `llvm-libc` memory 32functions. 33 34## Challenges 35 36As seen in the [README.md](README.md#benchmarking-regimes) the microbenchmarking 37facility should focus on measuring **low latency code**. If copying a few bytes 38takes in the order of a few cycles, the benchmark should be able to **measure 39accurately down to the cycle**. 40 41### Measuring instruments 42 43There are different sources of time in a computer (ordered from high to low resolution) 44 - [Performance 45 Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to 46 introspect the internals of the CPU, 47 - [High Precision Event 48 Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to 49 trigger short lived actions, 50 - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used 51 to keep track of the computer's time. 52 53In theory **Performance Counters** provide cycle accurate measurement via the 54`cpu cycles` event. But as we'll see, they are not really practical in this 55context. 56 57### Performance counters and modern processor architecture 58 59Modern CPUs are [out of 60order](https://en.wikipedia.org/wiki/Out-of-order_execution) and 61[superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a 62consequence it is [hard to know what is included when the counter is 63read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling), 64some instructions may still be **in flight**, some others may be executing 65[**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a 66matter of fact **on the same machine, measuring twice the same piece of code will yield 67different results.** 68 69### Performance counters semantics inconsistencies and availability 70 71Although they have the same name, the exact semantics of performance counters 72are micro-architecture dependent: **it is generally not possible to compare two 73micro-architectures exposing the same performance counters.** 74 75Each vendor decides which performance counters to implement and their exact 76meaning. Although we want to benchmark `llvm-libc` memory functions for all 77available [target 78triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there 79are **no guarantees that the counter we're interested in is available.** 80 81### Additional imprecisions 82 83- Reading performance counters is done through Kernel [System 84 calls](https://en.wikipedia.org/wiki/System_call). The System call itself 85 is costly (hundreds of cycles) and will perturbate the counter's value. 86- [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response) 87 can occur during measurement. 88- If the system is already under monitoring (virtual machines or system wide 89 profiling) the kernel can decide to multiplex the performance counters 90 leading to lower precision or even completely missing the measurement. 91- The Kernel can decide to [migrate the 92 process](https://en.wikipedia.org/wiki/Process_migration) to a different 93 core. 94- [Dynamic frequency 95 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick 96 in during the measurement and change the ticking duration. **Ultimately we 97 care about the amount of work over a period of time**. This removes some 98 legitimacy of measuring cycles rather than **raw time**. 99 100### Cycle accuracy conclusion 101 102We have seen that performance counters are: not widely available, semantically 103inconsistent across micro-architectures and imprecise on modern CPUs for small 104snippets of code. 105 106## Design decisions 107 108In order to achieve the needed precision we would need to resort on more widely 109available counters and derive the time from a high number of runs: going from a 110single deterministic measure to a probabilistic one. 111 112**To get a good signal to noise ratio we need the running time of the piece of 113code to be orders of magnitude greater than the measurement precision.** 114 115For instance, if measurement precision is of 10 cycles, we need the function 116runtime to take more than 1000 cycles to achieve 1% 117[SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio). 118 119### Repeating code N-times until precision is sufficient 120 121The algorithm is as follows: 122 123- We measure the time it takes to run the code _N_ times (Initially _N_ is 10 124 for instance) 125- We deduce an approximation of the runtime of one iteration (= _runtime_ / 126 _N_). 127- We increase _N_ by _X%_ and repeat the measurement (geometric progression). 128- We keep track of the _one iteration runtime approximation_ and build a 129 weighted mean of all the samples so far (weight is proportional to _N_) 130- We stop the process when the difference between the weighted mean and the 131 last estimation is smaller than _ε_ or when other stopping conditions are 132 met (total runtime, maximum iterations or maximum sample count). 133 134This method allows us to be as precise as needed provided that the measured 135runtime is proportional to _N_. Longer run times also smooth out imprecision 136related to _interrupts_ and _context switches_. 137 138Note: When measuring longer runtimes (e.g. copying several megabytes of data) 139the above assumption doesn't hold anymore and the _ε_ precision cannot be 140reached by increasing iterations. The whole benchmarking process becomes 141prohibitively slow. In this case the algorithm is limited to a single sample and 142repeated several times to get a decent 95% confidence interval. 143 144### Effect of branch prediction 145 146When measuring code with branches, repeating the same call again and again will 147allow the processor to learn the branching patterns and perfectly predict all 148the branches, leading to unrealistic results. 149 150**Decision: When benchmarking small buffer sizes, the function parameters should 151be randomized between calls to prevent perfect branch predictions.** 152 153### Effect of the memory subsystem 154 155The CPU is tightly coupled to the memory subsystem. It is common to see `L1`, 156`L2` and `L3` data caches. 157 158We may be tempted to randomize data accesses widely to exercise all the caching 159layers down to RAM but the [cost of accessing lower layers of 160memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html) 161completely dominates the runtime for small sizes. 162 163So to respect **Equity** and **Repeatability** principles we should make sure we 164**do not** depend on the memory subsystem. 165 166**Decision: When benchmarking small buffer sizes, the data accessed by the 167function should stay in `L1`.** 168 169### Effect of prefetching 170 171In case of small buffer sizes, 172[prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick 173in but in case of large buffers it may introduce a bias. 174 175**Decision: When benchmarking large buffer sizes, the data should be accessed in 176a random fashion to lower the impact of prefetching between calls.** 177 178### Effect of dynamic frequency scaling 179 180Modern processors implement [dynamic frequency 181scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called 182`performance` mode the CPU will increase its frequency and run faster than usual 183within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The 184increased clock rate is limited by the processor's power, current, and thermal 185limits, the number of cores currently in use, and the maximum frequency of the 186active cores."_ 187 188**Decision: When benchmarking we want to make sure the dynamic frequency scaling 189is always set to `performance`. We also want to make sure that the time based 190events are not impacted by frequency scaling.** 191 192See [REAME.md](REAME.md) on how to set this up. 193 194### Reserved and pinned cores 195 196Some operating systems allow [core 197reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process). 198It removes a set of perturbation sources like: process migration, context 199switches and interrupts. When a core is hyperthreaded, both cores should be 200reserved. 201 202## Microbenchmarks limitations 203 204As stated in the Foreword section a number of effects do play a role in 205production but are not directly measurable through microbenchmarks. The code 206size of the benchmark is (much) smaller than the hot code of real applications 207and **doesn't exhibit instruction cache pressure as much**. 208 209### iCache pressure 210 211Fundamental functions that are called frequently will occupy the L1 iCache 212([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If 213they are too big they will prevent other hot code to stay in the cache and incur 214[stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory 215functions should be as small as possible. 216 217### iTLB pressure 218 219The same reasoning goes for instruction Translation Lookaside Buffer 220([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring 221[TLB 222misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling). 223 224## FAQ 225 2261. Why don't you use Google Benchmark directly? 227 228 We reuse some parts of Google Benchmark (detection of frequency scaling, CPU 229 cache hierarchy informations) but when it comes to measuring memory 230 functions Google Benchmark have a few issues: 231 232 - Google Benchmark privileges code based configuration via macros and 233 builders. It is typically done in a static manner. In our case the 234 parameters we need to setup are a mix of what's usually controlled by 235 the framework (number of trials, maximum number of iterations, size 236 ranges) and parameters that are more tied to the function under test 237 (randomization strategies, custom values). Achieving this with Google 238 Benchmark is cumbersome as it involves templated benchmarks and 239 duplicated code. In the end, the configuration would be spread across 240 command line flags (via framework's option or custom flags), and code 241 constants. 242 - Output of the measurements is done through a `BenchmarkReporter` class, 243 that makes it hard to access the parameters discussed above. 244