curv-0.5/ideas/Bytecode.md

= Byte Code Format
If I use a byte code interpreter, what's the instruction set and architecture?

Register based is higher performance than stack based, and not any more
difficult to implement. Eg, the Add opcode has 3 immediate operands,
which are stack frame slot indexes.

Opcodes are integers. The main interpreter loop is switch(opcode) inside
of a for(;;). The opcodes are sequential from 0, defined by enum OpCode, and
the default case uses a "notreached" compiler directive to guide optimization:
this is `__assume(false);` on MSC++ and `__builtin_unreachable();` in gcc.
ChakraCore uses this to optimize its interpreter loop; not sure if it helps
on gcc or clang.
* Direct threaded code is a popular alternative.
  In GCC, `goto (*ip++)` is faster than branch to top of loop and switch.
  It takes up more space (64 bits per opcode), so worse cache coherency.
  The `labels as values` gcc extension isn't available in MSC++.
  In 2014, it was alleged that Clang didn't adequately optimize
  this type of interpreter loop: the IP wasn't stored in a register,
  and the code for goto a label value was slower. I dunno if this is fixed,
  but I need clang for MacOS.

An operation consists of a 16 bit opcode, followed by zero or more
16 bit operands. Constants are loaded from a constant table.
Relative branches use a signed 16 bit offset into an array of shorts,
so 32767 ops could be an upper limit, unless I provide an extended branch
with a 32 bit operand for those special cases.
This seems good and simple: no memory alignment issues, and I won't run
out of registers (8 bit slot ids are limiting).

Rationale:
* Compact is good, fewer cache hits, more cache coherence.
* Opcodes can be 1 byte, I won't likely need more than 256 opcodes.
  But immediate operands need to be more than 1 byte,
  and that raises the issue of unaligned memory access.
  * Eg, relative branch.
  * I might need stack frames with more than 256 value slots?
  * Eg, immediate 8 byte values?
* Intel and ARM support unaligned memory access. ARMv7 needs to use
  special unaligned instructions: LDRD vs LDR. The compiler needs to know.
  MSC++ has an `unaligned` pointer type decoration, g++ doesn't
  (the documentation for `__attribute__(aligned(1))` doesn't give hope).
  If I have to use memcpy() to fix an alignment problem in the interpreter
  main loop, my performance will be crap.
* The clang mailing list describes using
  `typedef int32_t unaligned_int32_t __attribute__((aligned(1)));`
  to make unaligned pointers safe to use. So maybe I'm good?
  The same post gives reasons not to rely on this: the attribute isn't
  part of the type system, could be dropped in various contexts like template
  parameters and ?: operator.
* I could use 'short codes' instead of 'byte codes'.
  An instruction array is an array of uint16_t. Immediate operands are 16 bit.
  Constants are loaded from a constant table.
  A few operations might need rarely used variants with 32 bit
  immediate operands (eg, branch and load constant). I won't support stack
  frames with more than 65536 value slots.
* This seems good and simple. No alignment issues. I won't run out of registers,
  and I won't need to consider 8 and 16 bit versions of every arithmetic opcode.

Problem: I need 4 operands for Add: the 4th is a source pointer
`(Phrase*)` for reporting an error. This is kind of like how python and ruby
apparently have line numbers embedded in their operations.
* I don't really want to embed a 64 bit pointer in the instruction stream
  that is almost never used: poor cache locality.
* I could have a location table (array of `Phrase*`) and embed a 16 bit
  location index into the instruction. 25% more storage, but better
  cache locality. Easy to implement.
* When emitting an Add operation, instead of putting the location into the
  location table and emitting the location index inline, I could instead
  add an (opcode index, location) pair to an external table. Then, when an
  error occurs, use the opcode index to look up the location. Not intrinsically
  more difficult, improves cache locality, and the external table could later
  be extended or rearranged to support a reverse map from source to
  instructions.
* The opcode index could be stored in the Phrase. Then we exhaustively search
  the source tree for the index when an error occurs. This would speed up
  compile and slow down error handling, which is fine.
* I could have a location map that maps opcode ranges onto source pointers.
  The opcode array is split into a sequence of contiguous ranges.
  For each range, we store the length of the range (16 bits)
  and a source pointer. We use two arrays to eliminate alignment gaps.
  More work. Could be used to set a breakpoint at a source location by
  pointing to code in the editor? Single stepping?


It's unlikely I need more than 256 opcodes, which will fit in a byte.
What about literal operands, and their alignment?
A curv::Value is 8 bytes. I'll also need offsets or pointers for jumps,
and offsets relative to frame pointers for loading variables.
I could use 32 bit opcodes with an 8 bit operator field and 24 bits of
immediate operand. Constant Values are stored in a literal table.
* That's a lot of empty operand fields. Poor cache coherence.
  Ela uses two arrays of the same size, an ops array and an int32 opData array.
* Lua uses 32 bit opcodes with a 6 bit operation; register machine.
* CPython 3.3:
  * 101 opcodes: stack, flow ctrl, arithmetic, constructors, details.
  * 1 byte operation, optional unaligned 2 byte argument may follow
  * registers: IP, stack, fast_locals (local variables and parameters)
  * load_global is an index into an array of names
* Ruby

**Function Calls.**
Several cases:
* **Call to an unknown dynamic function value.**
  Here we are computing a function value at run time, which is stored in a
  value slot. With the current simplified function design, we compare the
  provided # of arguments with the nargs value in the function value, and
  abort if they are different. So: push the arguments,
  then `opCall(funSlot,nargs)`. Abort if fun->nargs != nargs,
  otherwise push the return value and branch to the function.
* **Call to a known static function.** In the machine code interpreter, this
  would be optimized in the same way as in C, with a call to a static address.
  Also, nargs checking happens at compile time.
  In the bytecode interpreter, the function is in the constant table,
  so `opCallConstant(funSlot)`. [builtin functions are accessed via the
  constant table, apparently.]
* ...

I could abstract the difference between different bytecode representations
using an abstract interface for generating and executing code.

I will separate the control stack from the argument stack.
One is pointers and one is doubles: potentially different alignment.

Registers:
* SP: stack pointer (value stack)
* FP: frame pointer (value stack)
* KP: constants
* IP: instruction pointer
* CS: control stack pointer

Function call:
* validate length of argument list (compile or run time)
* push arguments. SP points at 1+the last argument.
* call function. CS points at 1+the return address. IP points to function body.
* return: remember the top of stack value; pop all temporaries and the original
  arguments; push the return value; IP = pop the return address.