1# Benchmarking `llvm-libc`'s memory functions
2
3## Foreword
4
5Microbenchmarks are valuable tools to assess and compare the performance of
6isolated pieces of code. However they don't capture all interactions of complex
7systems; and so other metrics can be equally important:
8
9-   **code size** (to reduce instruction cache pressure),
10-   **Profile Guided Optimization** friendliness,
11-   **hyperthreading / multithreading** friendliness.
12
13## Rationale
14
15The goal here is to satisfy the [Benchmarking
16Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
17
181.  **Relevance**: Benchmarks should measure relatively vital features.
192.  **Representativeness**: Benchmark performance metrics should be broadly
20    accepted by industry and academia.
213.  **Equity**: All systems should be fairly compared.
224.  **Repeatability**: Benchmark results can be verified.
235.  **Cost-effectiveness**: Benchmark tests are economical.
246.  **Scalability**: Benchmark tests should measure from single server to
25    multiple servers.
267.  **Transparency**: Benchmark metrics should be easy to understand.
27
28Benchmarking is a [subtle
29art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
30benchmarking memory functions is no exception. Here we'll dive into
31peculiarities of designing good microbenchmarks for `llvm-libc` memory
32functions.
33
34## Challenges
35
36As seen in the [README.md](README.md#benchmarking-regimes) the microbenchmarking
37facility should focus on measuring **low latency code**. If copying a few bytes
38takes in the order of a few cycles, the benchmark should be able to **measure
39accurately down to the cycle**.
40
41### Measuring instruments
42
43There are different sources of time in a computer (ordered from high to low resolution)
44 - [Performance
45   Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
46   introspect the internals of the CPU,
47 - [High Precision Event
48   Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
49   trigger short lived actions,
50 - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
51   to keep track of the computer's time.
52
53In theory **Performance Counters** provide cycle accurate measurement via the
54`cpu cycles` event. But as we'll see, they are not really practical in this
55context.
56
57### Performance counters and modern processor architecture
58
59Modern CPUs are [out of
60order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
61[superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
62consequence it is [hard to know what is included when the counter is
63read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
64some instructions may still be **in flight**, some others may be executing
65[**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
66matter of fact **on the same machine, measuring twice the same piece of code will yield
67different results.**
68
69### Performance counters semantics inconsistencies and availability
70
71Although they have the same name, the exact semantics of performance counters
72are micro-architecture dependent: **it is generally not possible to compare two
73micro-architectures exposing the same performance counters.**
74
75Each vendor decides which performance counters to implement and their exact
76meaning. Although we want to benchmark `llvm-libc` memory functions for all
77available [target
78triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
79are **no guarantees that the counter we're interested in is available.**
80
81### Additional imprecisions
82
83-   Reading performance counters is done through Kernel [System
84    calls](https://en.wikipedia.org/wiki/System_call). The System call itself
85    is costly (hundreds of cycles) and will perturbate the counter's value.
86-   [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
87    can occur during measurement.
88-   If the system is already under monitoring (virtual machines or system wide
89    profiling) the kernel can decide to multiplex the performance counters
90    leading to lower precision or even completely missing the measurement.
91-   The Kernel can decide to [migrate the
92    process](https://en.wikipedia.org/wiki/Process_migration) to a different
93    core.
94-   [Dynamic frequency
95    scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
96    in during the measurement and change the ticking duration. **Ultimately we
97    care about the amount of work over a period of time**. This removes some
98    legitimacy of measuring cycles rather than **raw time**.
99
100### Cycle accuracy conclusion
101
102We have seen that performance counters are: not widely available, semantically
103inconsistent across micro-architectures and imprecise on modern CPUs for small
104snippets of code.
105
106## Design decisions
107
108In order to achieve the needed precision we would need to resort on more widely
109available counters and derive the time from a high number of runs: going from a
110single deterministic measure to a probabilistic one.
111
112**To get a good signal to noise ratio we need the running time of the piece of
113code to be orders of magnitude greater than the measurement precision.**
114
115For instance, if measurement precision is of 10 cycles, we need the function
116runtime to take more than 1000 cycles to achieve 1%
117[SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
118
119### Repeating code N-times until precision is sufficient
120
121The algorithm is as follows:
122
123-   We measure the time it takes to run the code _N_ times (Initially _N_ is 10
124    for instance)
125-   We deduce an approximation of the runtime of one iteration (= _runtime_ /
126    _N_).
127-   We increase _N_ by _X%_ and repeat the measurement (geometric progression).
128-   We keep track of the _one iteration runtime approximation_ and build a
129    weighted mean of all the samples so far (weight is proportional to _N_)
130-   We stop the process when the difference between the weighted mean and the
131    last estimation is smaller than _ε_ or when other stopping conditions are
132    met (total runtime, maximum iterations or maximum sample count).
133
134This method allows us to be as precise as needed provided that the measured
135runtime is proportional to _N_. Longer run times also smooth out imprecision
136related to _interrupts_ and _context switches_.
137
138Note: When measuring longer runtimes (e.g. copying several megabytes of data)
139the above assumption doesn't hold anymore and the _ε_ precision cannot be
140reached by increasing iterations. The whole benchmarking process becomes
141prohibitively slow. In this case the algorithm is limited to a single sample and
142repeated several times to get a decent 95% confidence interval.
143
144### Effect of branch prediction
145
146When measuring code with branches, repeating the same call again and again will
147allow the processor to learn the branching patterns and perfectly predict all
148the branches, leading to unrealistic results.
149
150**Decision: When benchmarking small buffer sizes, the function parameters should
151be randomized between calls to prevent perfect branch predictions.**
152
153### Effect of the memory subsystem
154
155The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
156`L2` and `L3` data caches.
157
158We may be tempted to randomize data accesses widely to exercise all the caching
159layers down to RAM but the [cost of accessing lower layers of
160memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
161completely dominates the runtime for small sizes.
162
163So to respect **Equity** and **Repeatability** principles we should make sure we
164**do not** depend on the memory subsystem.
165
166**Decision: When benchmarking small buffer sizes, the data accessed by the
167function should stay in `L1`.**
168
169### Effect of prefetching
170
171In case of small buffer sizes,
172[prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
173in but in case of large buffers it may introduce a bias.
174
175**Decision: When benchmarking large buffer sizes, the data should be accessed in
176a random fashion to lower the impact of prefetching between calls.**
177
178### Effect of dynamic frequency scaling
179
180Modern processors implement [dynamic frequency
181scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
182`performance` mode the CPU will increase its frequency and run faster than usual
183within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
184increased clock rate is limited by the processor's power, current, and thermal
185limits, the number of cores currently in use, and the maximum frequency of the
186active cores."_
187
188**Decision: When benchmarking we want to make sure the dynamic frequency scaling
189is always set to `performance`. We also want to make sure that the time based
190events are not impacted by frequency scaling.**
191
192See [REAME.md](REAME.md) on how to set this up.
193
194### Reserved and pinned cores
195
196Some operating systems allow [core
197reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
198It removes a set of perturbation sources like: process migration, context
199switches and interrupts. When a core is hyperthreaded, both cores should be
200reserved.
201
202## Microbenchmarks limitations
203
204As stated in the Foreword section a number of effects do play a role in
205production but are not directly measurable through microbenchmarks. The code
206size of the benchmark is (much) smaller than the hot code of real applications
207and **doesn't exhibit instruction cache pressure as much**.
208
209### iCache pressure
210
211Fundamental functions that are called frequently will occupy the L1 iCache
212([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
213they are too big they will prevent other hot code to stay in the cache and incur
214[stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
215functions should be as small as possible.
216
217### iTLB pressure
218
219The same reasoning goes for instruction Translation Lookaside Buffer
220([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
221[TLB
222misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
223
224## FAQ
225
2261.  Why don't you use Google Benchmark directly?
227
228    We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
229    cache hierarchy informations) but when it comes to measuring memory
230    functions Google Benchmark have a few issues:
231
232    -   Google Benchmark privileges code based configuration via macros and
233        builders. It is typically done in a static manner. In our case the
234        parameters we need to setup are a mix of what's usually controlled by
235        the framework (number of trials, maximum number of iterations, size
236        ranges) and parameters that are more tied to the function under test
237        (randomization strategies, custom values). Achieving this with Google
238        Benchmark is cumbersome as it involves templated benchmarks and
239        duplicated code. In the end, the configuration would be spread across
240        command line flags (via framework's option or custom flags), and code
241        constants.
242    -   Output of the measurements is done through a `BenchmarkReporter` class,
243        that makes it hard to access the parameters discussed above.
244