1# BeamAsm, the Erlang JIT 2 3BeamAsm provides load-time conversion of Erlang BEAM instructions into 4native code on x86-64. This allows the loader to eliminate any instruction 5dispatching overhead and also specialize each instruction on their argument types. 6 7BeamAsm does hardly any cross instruction optimizations and the x and y 8register arrays work the same as when interpreting BEAM instructions. 9This allows the Erlang run-time system to be largely unchanged except for 10places that need to work with loaded BEAM instructions like code loading, 11tracing, and a few others. 12 13BeamAsm uses [asmjit](https://github.com/asmjit/asmjit) to generate native code 14in run-time. Only small parts of the 15[Assembler API](https://asmjit.com/doc/group__asmjit__assembler.html) of 16[asmjit](https://github.com/asmjit/asmjit) is used. At the moment 17[asmjit](https://github.com/asmjit/asmjit) only supports x86 32/64 bit assembler. 18 19## Loading Code 20 21The code is loaded very similarly to how it is loaded for the interpreter. Each beam 22file is parsed and then optimized through the transformations described in 23[beam_makeops](beam_makeops#defining-transformation-rules). The transformations 24used in BeamAsm are much simpler than the interpreter's, as most of the 25transformations for the interpreter are done only to eliminate the instruction 26dispatch overhead. 27 28Then each instruction is encoded using the C++ functions in the jit/instr_*.cpp files. 29Example: 30 31 void BeamModuleAssembler::emit_is_nonempty_list(const ArgVal &Fail, const ArgVal &Src) { 32 a.test(getArgRef(Src), imm(_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST)); 33 a.jne(labels[Fail.getValue()]); 34 } 35 36[asmjit](https://github.com/asmjit/asmjit) provides a fairly straightforward 37mapping from a C++ function call to the x86 assembly instruction. The above 38instruction tests if the value in the `Src` register is a non-empty list and if 39it is not then it jumps to the fail label. 40 41For comparison, the interpreter has 8 combinations and specializations of 42this implementation to minimize the instruction dispatch overhead for 43common patterns. 44 45The original register allocation done by the Erlang compiler is used to manage the 46liveness of values and the physical registers are statically allocated to keep 47the necessary process state. At the moment this is the static register allocation: 48 49 rbx: ErtsSchedulerRegisters struct (contains x/float registers and some metadata) 50 rbp: Active code index 51 r12: Optional Save slot for the Erlang stack pointer when executing C code 52 r13: Current running process 53 r14: Remaining reductions 54 r15: Erlang heap pointer 55 56Note that all of these are callee save registers under the System V and Windows 57ABIs which means that BeamAsm never has to spill any of these when making C 58function calls. 59 60The caller save registers are used as scratch registers within instructions but 61usually do not carry information between them. For some frequent instruction 62sequences such as tuple matching cross instruction optimization *are* done to avoid 63fetching the base address of the tuple in every `get_tuple_element` instruction. 64 65 66### Reducing code size and load time 67 68One of the strengths of the interpreter is that it uses relatively little memory 69for loaded code. This is because the implementation of each loaded instruction is 70shared and only the arguments to the instructions vary. Using as little memory as 71possible has many advantages; less memory is used, loading time decreases, 72higher cache hit-rate. 73 74In BeamAsm we need to achieve something similar since the load-time of a module 75scales almost linearly with the amount of memory it uses. Early BeamAsm prototypes 76used about double the amount of memory for code as the interpreter, while current 77versions use about 10% more. How was this achieved? 78 79In BeamAsm we heavily use shared code fragments to try to emit as much code as 80possible as global shared fragments instead of duplicating the code unnecessarily. 81For instance, the return instruction looks something like this: 82 83 Label yield = a.newLabel(); 84 85 a.dec(FCALLS); /* Decrement reduction counter */ 86 a.jl(dispatch_return); /* Check if we should yield */ 87 a.ret(); 88 89 a.bind(yield); 90 abs_jmp(ga->get_dispatch_return()); 91 92The code above is not exactly what is emitted, but close enough. The thing to note 93is that the code for doing the context switch is never emitted. Instead, we jump 94to a global fragment that all return instructions share. This greatly reduces 95the amount of code that has to be emitted for each module. 96 97## Running Erlang code 98 99Running BeamAsm code is very similar to running the interpreter, except that 100native code is executed instead of interpreted code. 101 102We had to tweak the way the Erlang stack works in order to execute native 103instructions on it. While the interpreter uses a stack slot for 104the current frame's return address (setting it to `[]` when unused), the 105native code merely reserves enough space for it as the x86 `call` and `ret` 106instructions bump the stack pointer when executed. 107 108This only affects the _current stack frame_, and is functionally identical 109aside from two caveats: 110 1111. Exceptions must not be thrown when the return address is reserved. 112 113 It's hard to tell where the stack will end up after an exception; the return 114 address won't be on the stack if we crash in the _current stack frame_, but 115 will be if we crash in a function we call. Telling these apart turned out to 116 rather complicated, so we decided to require the return address to be used 117 when an exception is thrown. 118 119 `emit_handle_error` handles this for you, and shared fragments that have been 120 called (rather than jumped to) satisfy this requirement by default. 121 1222. Garbage collection needs to take return addresses into account. 123 124 If we're about to create a term we have to make sure that there's enough 125 space for this term _and_ a potential return address, or else the next 126 `call` will clobber said term. This is taken care of in `emit_gc_test` and 127 you generally don't need to think about it. 128 129In addition to the above, we ensure that there's always at least `S_REDZONE` 130free words on the stack so we can make calls to shared fragments or trace 131handlers even when we lack a stack frame. This is merely a reservation and has 132no effect on how the stack works, and all values stored there must be valid 133Erlang terms in case of a garbage collection. 134 135## Running C code 136 137As Erlang stacks can be very small, we have to switch over to a different stack 138when we need to execute C code (which may expect a much larger stack). This is 139done through `emit_enter_runtime` and `emit_leave_runtime`, for example: 140 141 mov_arg(ARG4, NumFree); 142 143 /* Move to the C stack and swap out our current reductions, stack-, and 144 * heap pointer to the process structure. */ 145 emit_enter_runtime<Update::eReductions | Update::eStack | Update::eHeap>(); 146 147 a.mov(ARG1, c_p); 148 load_x_reg_array(ARG2); 149 make_move_patch(ARG3, lambdas[Fun.getValue()].patches); 150 151 /* Call `new_fun`, asserting that we're on the C stack. */ 152 runtime_call<4>(new_fun); 153 154 /* Move back to the C stack, and read the updated values from the process 155 * structure */ 156 emit_leave_runtime<Update::eReductions | Update::eStack | Update::eHeap>(); 157 158 a.mov(getXRef(0), RET); 159 160All combinations of the `Update` constants are legal, but the ones given to 161`emit_leave_runtime` _must_ be the same as those given to `emit_enter_runtime`. 162 163## Tracing and NIF Loading 164 165To make tracing and NIF loading work there needs to be a way to intercept 166any function call. In the interpreter, this is done by rewriting the loaded BEAM code, 167but this is more complicated in BeamAsm as we want to have a fast and compact way to 168do this. This is solved by emitting the code below at the start of each function: 169 170 0x0: jmp 6 171 0x2: ERTS_ASM_BP_FLAG_NONE 172 0x3: relative near call 173 0x4: &genericBPTramp 174 0x8: actual code for the function 175 176When code starts to execute it will simply see the `jmp 6` instruction 177which skips the prologue and starts to execute the code directly. 178 179When we want to enable a certain break point we set the `jmp` target to 180be 1 (which means it will land on the call instruction) and will call 181genericBPTramp. genericBPTramp is a label at the top of each module 182that contains [trampolines][1] for all flag combinations. 183 184[1]: https://en.wikipedia.org/wiki/Trampoline_(computing) 185 186 genericBPTramp: 187 188 0x0: ret 189 0x10: jmp call_nif_early 190 0x20: call generic_bp_local 191 0x30: call generic_bp_local 192 0x35: jmp call_nif_early 193 194Note that each target is 16 byte aligned. This is because the call target 195in the function prologue is updated to target the correct place when a flag 196is updated. So if CALL\_NIF\_EARLY is set, then it is updated to be 197genericBPTramp + 0x10. If BP is set, it is updated to genericBPTramp + 0x20 198and the combination makes it to be genericBPTramp + 0x30. 199 200### Updating code 201 202Because many environments enforce [W^X] it's not always possible to write 203directly to the code pages. Because of this we map code twice: once with an 204executable page and once with a writable page. Since they're backed by the 205same memory, writes to the writable page appear magically in the executable 206one. 207 208The `erts_writable_code_ptr` function can be used to get writable pointers, 209given a module instance: 210 211 for (i = 0; i < n; ++i) { 212 ErtsCodeInfo* ci; 213 void *w_ptr; 214 215 w_ptr = erts_writable_code_ptr(&modp->curr, 216 code_hdr->functions[i]); 217 ci = (ErtsCodeInfo*)w_ptr; 218 219 uninstall_breakpoint(ci); 220 consolidate_bp_data(modp, ci, 1); 221 ASSERT(ci->u.gen_bp == NULL); 222 } 223 224Without the module instance there's no reliable way to figure out the writable 225address of a code page, and we rely on _address space layout randomization_ 226(ASLR) to make it difficult to guess. 227 228### Export tracing 229 230Unlike the interpreter, we don't execute code inside export entries as that's 231very annoying to do in the face of [W^X]. When tracing is enabled, we instead 232point to a fragment that looks at the current export entry and decides what to 233do. 234 235This fragment is shared between all export entries, and the export entry to 236operate on is assumed to be in a certain register (`RET` as of writing). This 237means that all remote calls _must_ place the export entry in said register, 238even when we don't know beforehand that the call is remote, such as when 239calling a fun. 240 241This is pretty easy to do in assembler and the `emit_setup_export_call` helper 242handles it nicely for us, but we can't set registers when trapping out from C 243code. When trapping to an export entry from C code one must set `c_p->current` 244to the `ErtsCodeMFA` inside the export entry in question, and then set `c_p->i` 245to `beam_bif_export_trap`. 246 247The `BIF_TRAP` macros handle this for you, so you generally don't need to 248think about it. 249 250[W^X]: https://en.wikipedia.org/wiki/W%5EX 251 252## Description of each file 253 254The BeamAsm implementation resides in the `$ERL_TOP/erts/emulator/beam/jit` folder. 255The files are: 256 257* `load.h` 258 * BeamAsm specific header for loading code 259* `asm_load.c` 260 * BeamAsm specific functions for loading code 261* `generators.tab`, `predicates.tab`, `ops.tab` 262 * BeamAsm specific transformations for instructions. See [beam_makeops](beam_makeops) for 263 more details. 264* `beam_asm.h` 265 * Header file describing the C -> C++ api 266* `beam_asm.hpp` 267 * Header file describing the structs and classes used by BeamAsm. 268* `beam_asm.cpp` 269 * Implementation of the main process loop 270 * The BeamAsm initialization code 271 * The C -> C++ interface functions. 272* `beam_asm_module.cpp` 273 * The code for the BeamAsm module code generator logic 274* `beam_asm_global.cpp` 275 * Global code fragments that are used by multiple instructions, e.g. error handling code. 276* `instr_*.cpp` 277 * Implementation of individual instructions grouped into files by area 278* `beam_asm_perf.cpp` 279 * The linux perf support for BeamAsm 280 281## Linux perf support 282 283perf can also be instrumented using BeamAsm symbols to provide more information. As with 284gdb, only the currently executing function will show up in the stack trace, which means 285that perf provides functionality similar to that of [eprof](https://erlang.org/doc/man/eprof.html). 286 287You can run perf on BeamAsm like this: 288 289 perf record erl +JPperf true 290 291and then look at the results using `perf report` as you normally would with perf. 292 293If you want to get some context to you calls you cann use the [lbr](https://lwn.net/Articles/680985/) 294call-graph option to `perf record`. Using `lbr` is not perfect (for instance you 295do not get any syscalls in the context), but it work well enough. 296For example, you can run perf to analyze dialyzer building a PLT like this: 297 298 ERL_FLAGS="+JPperf true +S 1" perf record --call-graph lbr \ 299 dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \ 300 syntax_tools asn1 edoc et ftp inets mnesia observer public_key \ 301 sasl runtime_tools snmp ssl tftp wx xmerl tools 302 303The above code is run using `+S 1` to make the perf output easier to understand. 304If you then run `perf report -f --no-children` you may get something similar to this: 305 306![Linux Perf report: dialyzer PLT build](figures/perf-beamasm.png) 307 308Any Erlang function in the report is prefixed with a `$` and all C functions have 309their normal names. Any Erlang function that has the prefix `$global::` refers 310to a global shared fragment. 311 312So in the above, we can see that we spend the most time doing `eq`, i.e. comparing two terms. 313By expanding it and looking at its parents we can see that it is the function 314`erl_types:t_is_equal/2` that contributes the most to this value. Go and have a look 315at it in the source code to see if you can figure out why so much time is spent there. 316 317After `eq` we see the function `erl_types:t_has_var/1` where we spend almost 3186% of the entire execution in. A while further down you can see `copy_struct` which 319is the function used to copy terms. If we expand it to view the parents we find that 320it is mostly `ets:lookup_element/3` that contributes to this time via the Erlang 321function `dialyzer_plt:ets_table_lookup/2`. 322 323### Flame Graph 324 325You can also create a Flame Graph from the perf output. Flame Graphs are basically 326just another way to look at the same data as the `perf report` output, but can 327be more easily shared with others and manipulated to give a graph tailor-made for 328your needs. For instance, if we run dialyzer with all schedulers: 329 330 ## Run dialyzer with multiple schedulers 331 ERL_FLAGS="+JPperf true" perf record --call-graph lbr \ 332 dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \ 333 syntax_tools asn1 edoc et ftp inets mnesia observer public_key \ 334 sasl runtime_tools snmp ssl tftp wx xmerl tools --statistics 335 336And then use the scripts found at Brendan Gregg's [CPU Flame Graphs](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs) 337web page as follows: 338 339 ## Collect the results 340 perf script > out.perf 341 ## run stackcollapse 342 stackcollapse-perf.pl out.perf > out.folded 343 ## Create the svg 344 flamegraph.pl out.folded > out.svg 345 346We get a graph that would look something like this: 347 348![Linux Perf FlameGraph: dialyzer PLT build](figures/perf-beamasm.svg) 349 350You can view a larger version [here](seefile/figures/perf-beamasm.svg). It contains 351the same information, but it is easier to share with others as it does 352not need the symbols in the executable. 353 354Using the same data we can also produce a graph where the scheduler profile data 355has been merged by using `sed`: 356 357 ## Strip [0-9]+_ from all scheduler names 358 sed -e 's/^[0-9]\+_//' out.folded > out.folded_sched 359 ## Create the svg 360 flamegraph.pl out.folded_sched > out_sched.svg 361 362![Linux Perf FlameGraph: dialyzer PLT build](figures/perf-beamasm-merged.svg) 363 364You can view a larger version [here](seefile/figures/perf-beamasm-merged.svg). 365There are many different transformations that you can do to make the graph show 366you what you want. 367 368### Annotate perf functions 369 370If you want to be able to use the `perf annotate` functionality (and in extention 371the annotate functionality in the `perf report` gui) you need to use a monotonic 372clock when calling `perf record`, i.e. `perf record -k mono`. So for a dialyzer 373run you would do this: 374 375 ERL_FLAGS="+JPperf true +S 1" perf record -k mono --call-graph lbr \ 376 dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \ 377 syntax_tools asn1 edoc et ftp inets mnesia observer public_key \ 378 sasl runtime_tools snmp ssl tftp wx xmerl tools 379 380In order to use the `perf.data` produced by this record you need to first call 381`perf inject --jit` like this: 382 383 perf inject --jit -i perf.data -o perf.jitted.data 384 385and then you can view an annotated function like this: 386 387 perf annotate -M intel -i perf.jitted.data erl_types:t_has_var/1 388 389or by pressing `a` in the `perf report` ui. Then you get something like this: 390 391![Linux Perf FlameGraph: dialyzer PLT build](figures/beamasm-perf-annotate.png) 392 393> *WARNING*: Calling `perf inject --jit` will create a lot of files in `/tmp/` 394> and in `~/.debug/tmp/`. So make sure to cleanup in those directories from time to 395> time or you may run out of inodes. 396 397### perf tips and tricks 398 399You can do a lot of neat things with `perf`. Below is a list of some of the options 400we have found useful: 401 402* `perf report --no-children` 403 Do not include the accumulation of all children in a call. 404* `perf report --call-graph callee` 405 Show the callee rather than the caller when expanding a function call. 406* `perf archive` 407 Create an archive with all the artifacts needed to inspect the data 408 on another host. In early version of perf this command does not work, 409 instead you can use [this bash script](https://github.com/torvalds/linux/blob/master/tools/perf/perf-archive.sh). 410* `perf report` gives "failed to process sample" and/or "failed to process type: 68" 411 This probably means that you are running a bugge version of perf. We have 412 seen this when running Ubuntu 18.04 with kernel version 4. If you update 413 to Ubuntu 20.04 or use Ubuntu 18.04 with kernel version 5 the problem 414 should go away. 415 416## FAQ 417 418### How do I know that I'm running a JIT enabled Erlang? 419 420You will see a banner containing `[jit]` shell when you start. You can also use 421`erlang:system_info(emu_flavor)` to check the flavor and it should be `jit`. 422 423There are three major reasons why when building Erlang/OTP you would not get the JIT. 424 425* You are not building x86 64-bit 426* You do not have a C++ compiler that supports C++-17 427* You do not have an OS that supports executable *and* writable memory 428 429If you run `./configure --enable-jit` configure will abort when it discovers that 430your system cannot build the JIT. 431 432### Is the interpreter still available? 433 434Yes, you can still build the interpreter if you want to. In fact, it is what is used 435on platforms where BeamAsm does not yet work. You can either completely disable 436BeamAsm by passing `--disable-jit` to configure. Or you can build the 437interpreter using `make FLAVOR=emu` and then run it using `erl -emu_flavor emu`. 438 439It is possible to have both the JIT and interpreter available at the same time. 440 441### How much of a speedup should I expect from BeamAsm compared to the interpreter? 442 443It depends a lot on what your application does. Anything from no difference to up to 444four times as fast is possible. 445 446BeamAsm tries very hard to not be slower than the interpreter, but there can be cases 447when that happens. One such could be very short-lived small scripts. If you come across 448any scenarios when this happens, please open a bug report at 449[the Erlang/OTP bug tracker](https://github.com/erlang/otp/issues). 450 451### Would it be possible to add support for BeamAsm on ARM? 452 453Any new architecture needs support in the assembler as well. Since we use 454[asmjit](https://github.com/asmjit/asmjit) for this, that means we need support 455in [asmjit](https://github.com/asmjit/asmjit). BeamAsm uses relatively few 456instructions (mostly, `mov`, `jmp`, `cmp`, `sub`, `add`), so it would not need to have 457full support of all ARM instructions. 458 459Another approach would be to not use [asmjit](https://github.com/asmjit/asmjit) 460for ARM, but instead, use something different to assemble code during load-time. 461 462### Would it be possible to add support for BeamAsm on another OS? 463 464Adding a new OS that runs x86-64 should not need any large changes if 465the OS supports mapping of memory as executable. If the ABI used by the 466OS is not supported changes related to calling C-functions also have to 467be made. 468 469As a reference, it took us about 2-3 weeks to implement support for Windows. 470 471### Would it be possible to add support in perf to better crawl the Erlang stack? 472 473Yes, though not easily. 474 475Using `perf --call-graph lbr` works for Erlang, but it does not give a 476perfect record as the buffer has a limited size. 477