1# BeamAsm, the Erlang JIT
2
3BeamAsm provides load-time conversion of Erlang BEAM instructions into
4native code on x86-64. This allows the loader to eliminate any instruction
5dispatching overhead and also specialize each instruction on their argument types.
6
7BeamAsm does hardly any cross instruction optimizations and the x and y
8register arrays work the same as when interpreting BEAM instructions.
9This allows the Erlang run-time system to be largely unchanged except for
10places that need to work with loaded BEAM instructions like code loading,
11tracing, and a few others.
12
13BeamAsm uses [asmjit](https://github.com/asmjit/asmjit) to generate native code
14in run-time. Only small parts of the
15[Assembler API](https://asmjit.com/doc/group__asmjit__assembler.html) of
16[asmjit](https://github.com/asmjit/asmjit) is used. At the moment
17[asmjit](https://github.com/asmjit/asmjit) only supports x86 32/64 bit assembler.
18
19## Loading Code
20
21The code is loaded very similarly to how it is loaded for the interpreter. Each beam
22file is parsed and then optimized through the transformations described in
23[beam_makeops](beam_makeops#defining-transformation-rules). The transformations
24used in BeamAsm are much simpler than the interpreter's, as most of the
25transformations for the interpreter are done only to eliminate the instruction
26dispatch overhead.
27
28Then each instruction is encoded using the C++ functions in the jit/instr_*.cpp files.
29Example:
30
31    void BeamModuleAssembler::emit_is_nonempty_list(const ArgVal &Fail, const ArgVal &Src) {
32      a.test(getArgRef(Src), imm(_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST));
33      a.jne(labels[Fail.getValue()]);
34    }
35
36[asmjit](https://github.com/asmjit/asmjit) provides a fairly straightforward
37mapping from a C++ function call to the x86 assembly instruction. The above
38instruction tests if the value in the `Src` register is a non-empty list and if
39it is not then it jumps to the fail label.
40
41For comparison, the interpreter has 8 combinations and specializations of
42this implementation to minimize the instruction dispatch overhead for
43common patterns.
44
45The original register allocation done by the Erlang compiler is used to manage the
46liveness of values and the physical registers are statically allocated to keep
47the necessary process state. At the moment this is the static register allocation:
48
49    rbx: ErtsSchedulerRegisters struct (contains x/float registers and some metadata)
50    rbp: Active code index
51    r12: Optional Save slot for the Erlang stack pointer when executing C code
52    r13: Current running process
53    r14: Remaining reductions
54    r15: Erlang heap pointer
55
56Note that all of these are callee save registers under the System V and Windows
57ABIs which means that BeamAsm never has to spill any of these when making C
58function calls.
59
60The caller save registers are used as scratch registers within instructions but
61usually do not carry information between them. For some frequent instruction
62sequences such as tuple matching cross instruction optimization *are* done to avoid
63fetching the base address of the tuple in every `get_tuple_element` instruction.
64
65
66### Reducing code size and load time
67
68One of the strengths of the interpreter is that it uses relatively little memory
69for loaded code. This is because the implementation of each loaded instruction is
70shared and only the arguments to the instructions vary. Using as little memory as
71possible has many advantages; less memory is used, loading time decreases,
72higher cache hit-rate.
73
74In BeamAsm we need to achieve something similar since the load-time of a module
75scales almost linearly with the amount of memory it uses. Early BeamAsm prototypes
76used about double the amount of memory for code as the interpreter, while current
77versions use about 10% more. How was this achieved?
78
79In BeamAsm we heavily use shared code fragments to try to emit as much code as
80possible as global shared fragments instead of duplicating the code unnecessarily.
81For instance, the return instruction looks something like this:
82
83    Label yield = a.newLabel();
84
85    a.dec(FCALLS);           /* Decrement reduction counter */
86    a.jl(dispatch_return);   /* Check if we should yield */
87    a.ret();
88
89    a.bind(yield);
90    abs_jmp(ga->get_dispatch_return());
91
92The code above is not exactly what is emitted, but close enough. The thing to note
93is that the code for doing the context switch is never emitted. Instead, we jump
94to a global fragment that all return instructions share. This greatly reduces
95the amount of code that has to be emitted for each module.
96
97## Running Erlang code
98
99Running BeamAsm code is very similar to running the interpreter, except that
100native code is executed instead of interpreted code.
101
102We had to tweak the way the Erlang stack works in order to execute native
103instructions on it. While the interpreter uses a stack slot for
104the current frame's return address (setting it to `[]` when unused), the
105native code merely reserves enough space for it as the x86 `call` and `ret`
106instructions bump the stack pointer when executed.
107
108This only affects the _current stack frame_, and is functionally identical
109aside from two caveats:
110
1111. Exceptions must not be thrown when the return address is reserved.
112
113    It's hard to tell where the stack will end up after an exception; the return
114    address won't be on the stack if we crash in the _current stack frame_, but
115    will be if we crash in a function we call. Telling these apart turned out to
116    rather complicated, so we decided to require the return address to be used
117    when an exception is thrown.
118
119    `emit_handle_error` handles this for you, and shared fragments that have been
120    called (rather than jumped to) satisfy this requirement by default.
121
1222. Garbage collection needs to take return addresses into account.
123
124    If we're about to create a term we have to make sure that there's enough
125    space for this term _and_ a potential return address, or else the next
126    `call` will clobber said term. This is taken care of in `emit_gc_test` and
127    you generally don't need to think about it.
128
129In addition to the above, we ensure that there's always at least `S_REDZONE`
130free words on the stack so we can make calls to shared fragments or trace
131handlers even when we lack a stack frame. This is merely a reservation and has
132no effect on how the stack works, and all values stored there must be valid
133Erlang terms in case of a garbage collection.
134
135## Running C code
136
137As Erlang stacks can be very small, we have to switch over to a different stack
138when we need to execute C code (which may expect a much larger stack). This is
139done through `emit_enter_runtime` and `emit_leave_runtime`, for example:
140
141    mov_arg(ARG4, NumFree);
142
143    /* Move to the C stack and swap out our current reductions, stack-, and
144     * heap pointer to the process structure. */
145    emit_enter_runtime<Update::eReductions | Update::eStack | Update::eHeap>();
146
147    a.mov(ARG1, c_p);
148    load_x_reg_array(ARG2);
149    make_move_patch(ARG3, lambdas[Fun.getValue()].patches);
150
151    /* Call `new_fun`, asserting that we're on the C stack. */
152    runtime_call<4>(new_fun);
153
154    /* Move back to the C stack, and read the updated values from the process
155     * structure */
156    emit_leave_runtime<Update::eReductions | Update::eStack | Update::eHeap>();
157
158    a.mov(getXRef(0), RET);
159
160All combinations of the `Update` constants are legal, but the ones given to
161`emit_leave_runtime` _must_ be the same as those given to `emit_enter_runtime`.
162
163## Tracing and NIF Loading
164
165To make tracing and NIF loading work there needs to be a way to intercept
166any function call. In the interpreter, this is done by rewriting the loaded BEAM code,
167but this is more complicated in BeamAsm as we want to have a fast and compact way to
168do this. This is solved by emitting the code below at the start of each function:
169
170    0x0: jmp 6
171    0x2: ERTS_ASM_BP_FLAG_NONE
172    0x3: relative near call
173    0x4: &genericBPTramp
174    0x8: actual code for the function
175
176When code starts to execute it will simply see the `jmp 6` instruction
177which skips the prologue and starts to execute the code directly.
178
179When we want to enable a certain break point we set the `jmp` target to
180be 1 (which means it will land on the call instruction) and will call
181genericBPTramp. genericBPTramp is a label at the top of each module
182that contains [trampolines][1] for all flag combinations.
183
184[1]: https://en.wikipedia.org/wiki/Trampoline_(computing)
185
186    genericBPTramp:
187
188    0x0: ret
189    0x10: jmp call_nif_early
190    0x20: call generic_bp_local
191    0x30: call generic_bp_local
192    0x35: jmp call_nif_early
193
194Note that each target is 16 byte aligned. This is because the call target
195in the function prologue is updated to target the correct place when a flag
196is updated. So if CALL\_NIF\_EARLY is set, then it is updated to be
197genericBPTramp + 0x10. If BP is set, it is updated to genericBPTramp + 0x20
198and the combination makes it to be genericBPTramp + 0x30.
199
200### Updating code
201
202Because many environments enforce [W^X] it's not always possible to write
203directly to the code pages. Because of this we map code twice: once with an
204executable page and once with a writable page. Since they're backed by the
205same memory, writes to the writable page appear magically in the executable
206one.
207
208The `erts_writable_code_ptr` function can be used to get writable pointers,
209given a module instance:
210
211    for (i = 0; i < n; ++i) {
212        ErtsCodeInfo* ci;
213        void *w_ptr;
214
215        w_ptr = erts_writable_code_ptr(&modp->curr,
216                                       code_hdr->functions[i]);
217        ci = (ErtsCodeInfo*)w_ptr;
218
219        uninstall_breakpoint(ci);
220        consolidate_bp_data(modp, ci, 1);
221        ASSERT(ci->u.gen_bp == NULL);
222    }
223
224Without the module instance there's no reliable way to figure out the writable
225address of a code page, and we rely on _address space layout randomization_
226(ASLR) to make it difficult to guess.
227
228### Export tracing
229
230Unlike the interpreter, we don't execute code inside export entries as that's
231very annoying to do in the face of [W^X]. When tracing is enabled, we instead
232point to a fragment that looks at the current export entry and decides what to
233do.
234
235This fragment is shared between all export entries, and the export entry to
236operate on is assumed to be in a certain register (`RET` as of writing). This
237means that all remote calls _must_ place the export entry in said register,
238even when we don't know beforehand that the call is remote, such as when
239calling a fun.
240
241This is pretty easy to do in assembler and the `emit_setup_export_call` helper
242handles it nicely for us, but we can't set registers when trapping out from C
243code. When trapping to an export entry from C code one must set `c_p->current`
244to the `ErtsCodeMFA` inside the export entry in question, and then set `c_p->i`
245to `beam_bif_export_trap`.
246
247The `BIF_TRAP` macros handle this for you, so you generally don't need to
248think about it.
249
250[W^X]: https://en.wikipedia.org/wiki/W%5EX
251
252## Description of each file
253
254The BeamAsm implementation resides in the `$ERL_TOP/erts/emulator/beam/jit` folder.
255The files are:
256
257* `load.h`
258    * BeamAsm specific header for loading code
259* `asm_load.c`
260    * BeamAsm specific functions for loading code
261* `generators.tab`, `predicates.tab`, `ops.tab`
262    * BeamAsm specific transformations for instructions. See [beam_makeops](beam_makeops) for
263      more details.
264* `beam_asm.h`
265    * Header file describing the C -> C++ api
266* `beam_asm.hpp`
267    * Header file describing the structs and classes used by BeamAsm.
268* `beam_asm.cpp`
269    * Implementation of the main process loop
270    * The BeamAsm initialization code
271    * The C -> C++ interface functions.
272* `beam_asm_module.cpp`
273    * The code for the BeamAsm module code generator logic
274* `beam_asm_global.cpp`
275    * Global code fragments that are used by multiple instructions, e.g. error handling code.
276* `instr_*.cpp`
277    * Implementation of individual instructions grouped into files by area
278* `beam_asm_perf.cpp`
279    * The linux perf support for BeamAsm
280
281## Linux perf support
282
283perf can also be instrumented using BeamAsm symbols to provide more information. As with
284gdb, only the currently executing function will show up in the stack trace, which means
285that perf provides functionality similar to that of [eprof](https://erlang.org/doc/man/eprof.html).
286
287You can run perf on BeamAsm like this:
288
289    perf record erl +JPperf true
290
291and then look at the results using `perf report` as you normally would with perf.
292
293If you want to get some context to you calls you cann use the [lbr](https://lwn.net/Articles/680985/)
294call-graph option to `perf record`. Using `lbr` is not perfect (for instance you
295do not get any syscalls in the context), but it work well enough.
296For example, you can run perf to analyze dialyzer building a PLT like this:
297
298     ERL_FLAGS="+JPperf true +S 1" perf record --call-graph lbr \
299       dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \
300       syntax_tools asn1 edoc et ftp inets mnesia observer public_key \
301       sasl runtime_tools snmp ssl tftp wx xmerl tools
302
303The above code is run using `+S 1` to make the perf output easier to understand.
304If you then run `perf report -f --no-children` you may get something similar to this:
305
306![Linux Perf report: dialyzer PLT build](figures/perf-beamasm.png)
307
308Any Erlang function in the report is prefixed with a `$` and all C functions have
309their normal names. Any Erlang function that has the prefix `$global::` refers
310to a global shared fragment.
311
312So in the above, we can see that we spend the most time doing `eq`, i.e. comparing two terms.
313By expanding it and looking at its parents we can see that it is the function
314`erl_types:t_is_equal/2` that contributes the most to this value. Go and have a look
315at it in the source code to see if you can figure out why so much time is spent there.
316
317After `eq` we see the function `erl_types:t_has_var/1` where we spend almost
3186% of the entire execution in. A while further down you can see `copy_struct` which
319is the function used to copy terms. If we expand it to view the parents we find that
320it is mostly `ets:lookup_element/3` that contributes to this time via the Erlang
321function `dialyzer_plt:ets_table_lookup/2`.
322
323### Flame Graph
324
325You can also create a Flame Graph from the perf output. Flame Graphs are basically
326just another way to look at the same data as the `perf report` output, but can
327be more easily shared with others and manipulated to give a graph tailor-made for
328your needs. For instance, if we run dialyzer with all schedulers:
329
330    ## Run dialyzer with multiple schedulers
331    ERL_FLAGS="+JPperf true" perf record --call-graph lbr \
332      dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \
333      syntax_tools asn1 edoc et ftp inets mnesia observer public_key \
334      sasl runtime_tools snmp ssl tftp wx xmerl tools --statistics
335
336And then use the scripts found at Brendan Gregg's [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs)
337web page as follows:
338
339    ## Collect the results
340    perf script > out.perf
341    ## run stackcollapse
342    stackcollapse-perf.pl out.perf > out.folded
343    ## Create the svg
344    flamegraph.pl out.folded > out.svg
345
346We get a graph that would look something like this:
347
348![Linux Perf FlameGraph: dialyzer PLT build](figures/perf-beamasm.svg)
349
350You can view a larger version [here](seefile/figures/perf-beamasm.svg). It contains
351the same information, but it is easier to share with others as it does
352not need the symbols in the executable.
353
354Using the same data we can also produce a graph where the scheduler profile data
355has been merged by using `sed`:
356
357    ## Strip [0-9]+_ from all scheduler names
358    sed -e 's/^[0-9]\+_//' out.folded > out.folded_sched
359    ## Create the svg
360    flamegraph.pl out.folded_sched > out_sched.svg
361
362![Linux Perf FlameGraph: dialyzer PLT build](figures/perf-beamasm-merged.svg)
363
364You can view a larger version [here](seefile/figures/perf-beamasm-merged.svg).
365There are many different transformations that you can do to make the graph show
366you what you want.
367
368### Annotate perf functions
369
370If you want to be able to use the `perf annotate` functionality (and in extention
371the annotate functionality in the `perf report` gui) you need to use a monotonic
372clock when calling `perf record`, i.e. `perf record -k mono`. So for a dialyzer
373run you would do this:
374
375    ERL_FLAGS="+JPperf true +S 1" perf record -k mono --call-graph lbr \
376      dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \
377      syntax_tools asn1 edoc et ftp inets mnesia observer public_key \
378      sasl runtime_tools snmp ssl tftp wx xmerl tools
379
380In order to use the `perf.data` produced by this record you need to first call
381`perf inject --jit` like this:
382
383    perf inject --jit -i perf.data -o perf.jitted.data
384
385and then you can view an annotated function like this:
386
387    perf annotate -M intel -i perf.jitted.data erl_types:t_has_var/1
388
389or by pressing `a` in the `perf report` ui. Then you get something like this:
390
391![Linux Perf FlameGraph: dialyzer PLT build](figures/beamasm-perf-annotate.png)
392
393> *WARNING*: Calling `perf inject --jit` will create a lot of files in `/tmp/`
394> and in `~/.debug/tmp/`. So make sure to cleanup in those directories from time to
395> time or you may run out of inodes.
396
397### perf tips and tricks
398
399You can do a lot of neat things with `perf`. Below is a list of some of the options
400we have found useful:
401
402* `perf report --no-children`
403    Do not include the accumulation of all children in a call.
404* `perf report  --call-graph callee`
405    Show the callee rather than the caller when expanding a function call.
406* `perf archive`
407    Create an archive with all the artifacts needed to inspect the data
408    on another host. In early version of perf this command does not work,
409    instead you can use [this bash script](https://github.com/torvalds/linux/blob/master/tools/perf/perf-archive.sh).
410* `perf report` gives "failed to process sample" and/or "failed to process type: 68"
411    This probably means that you are running a bugge version of perf. We have
412    seen this when running Ubuntu 18.04 with kernel version 4. If you update
413    to Ubuntu 20.04 or use Ubuntu 18.04 with kernel version 5 the problem
414    should go away.
415
416## FAQ
417
418### How do I know that I'm running a JIT enabled Erlang?
419
420You will see a banner containing `[jit]` shell when you start. You can also use
421`erlang:system_info(emu_flavor)` to check the flavor and it should be `jit`.
422
423There are three major reasons why when building Erlang/OTP you would not get the JIT.
424
425* You are not building x86 64-bit
426* You do not have a C++ compiler that supports C++-17
427* You do not have an OS that supports executable *and* writable memory
428
429If you run `./configure --enable-jit` configure will abort when it discovers that
430your system cannot build the JIT.
431
432### Is the interpreter still available?
433
434Yes, you can still build the interpreter if you want to. In fact, it is what is used
435on platforms where BeamAsm does not yet work. You can either completely disable
436BeamAsm by passing `--disable-jit` to configure. Or you can build the
437interpreter using `make FLAVOR=emu` and then run it using `erl -emu_flavor emu`.
438
439It is possible to have both the JIT and interpreter available at the same time.
440
441### How much of a speedup should I expect from BeamAsm compared to the interpreter?
442
443It depends a lot on what your application does. Anything from no difference to up to
444four times as fast is possible.
445
446BeamAsm tries very hard to not be slower than the interpreter, but there can be cases
447when that happens. One such could be very short-lived small scripts. If you come across
448any scenarios when this happens, please open a bug report at
449[the Erlang/OTP bug tracker](https://github.com/erlang/otp/issues).
450
451### Would it be possible to add support for BeamAsm on ARM?
452
453Any new architecture needs support in the assembler as well. Since we use
454[asmjit](https://github.com/asmjit/asmjit) for this, that means we need support
455in [asmjit](https://github.com/asmjit/asmjit). BeamAsm uses relatively few
456instructions (mostly, `mov`, `jmp`, `cmp`, `sub`, `add`), so it would not need to have
457full support of all ARM instructions.
458
459Another approach would be to not use [asmjit](https://github.com/asmjit/asmjit)
460for ARM, but instead, use something different to assemble code during load-time.
461
462### Would it be possible to add support for BeamAsm on another OS?
463
464Adding a new OS that runs x86-64 should not need any large changes if
465the OS supports mapping of memory as executable. If the ABI used by the
466OS is not supported changes related to calling C-functions also have to
467be made.
468
469As a reference, it took us about 2-3 weeks to implement support for Windows.
470
471### Would it be possible to add support in perf to better crawl the Erlang stack?
472
473Yes, though not easily.
474
475Using `perf --call-graph lbr` works for Erlang, but it does not give a
476perfect record as the buffer has a limited size.
477