1= Byte Code Format
2If I use a byte code interpreter, what's the instruction set and architecture?
3
4Register based is higher performance than stack based, and not any more
5difficult to implement. Eg, the Add opcode has 3 immediate operands,
6which are stack frame slot indexes.
7
8Opcodes are integers. The main interpreter loop is switch(opcode) inside
9of a for(;;). The opcodes are sequential from 0, defined by enum OpCode, and
10the default case uses a "notreached" compiler directive to guide optimization:
11this is `__assume(false);` on MSC++ and `__builtin_unreachable();` in gcc.
12ChakraCore uses this to optimize its interpreter loop; not sure if it helps
13on gcc or clang.
14* Direct threaded code is a popular alternative.
15  In GCC, `goto (*ip++)` is faster than branch to top of loop and switch.
16  It takes up more space (64 bits per opcode), so worse cache coherency.
17  The `labels as values` gcc extension isn't available in MSC++.
18  In 2014, it was alleged that Clang didn't adequately optimize
19  this type of interpreter loop: the IP wasn't stored in a register,
20  and the code for goto a label value was slower. I dunno if this is fixed,
21  but I need clang for MacOS.
22
23An operation consists of a 16 bit opcode, followed by zero or more
2416 bit operands. Constants are loaded from a constant table.
25Relative branches use a signed 16 bit offset into an array of shorts,
26so 32767 ops could be an upper limit, unless I provide an extended branch
27with a 32 bit operand for those special cases.
28This seems good and simple: no memory alignment issues, and I won't run
29out of registers (8 bit slot ids are limiting).
30
31Rationale:
32* Compact is good, fewer cache hits, more cache coherence.
33* Opcodes can be 1 byte, I won't likely need more than 256 opcodes.
34  But immediate operands need to be more than 1 byte,
35  and that raises the issue of unaligned memory access.
36  * Eg, relative branch.
37  * I might need stack frames with more than 256 value slots?
38  * Eg, immediate 8 byte values?
39* Intel and ARM support unaligned memory access. ARMv7 needs to use
40  special unaligned instructions: LDRD vs LDR. The compiler needs to know.
41  MSC++ has an `unaligned` pointer type decoration, g++ doesn't
42  (the documentation for `__attribute__(aligned(1))` doesn't give hope).
43  If I have to use memcpy() to fix an alignment problem in the interpreter
44  main loop, my performance will be crap.
45* The clang mailing list describes using
46  `typedef int32_t unaligned_int32_t __attribute__((aligned(1)));`
47  to make unaligned pointers safe to use. So maybe I'm good?
48  The same post gives reasons not to rely on this: the attribute isn't
49  part of the type system, could be dropped in various contexts like template
50  parameters and ?: operator.
51* I could use 'short codes' instead of 'byte codes'.
52  An instruction array is an array of uint16_t. Immediate operands are 16 bit.
53  Constants are loaded from a constant table.
54  A few operations might need rarely used variants with 32 bit
55  immediate operands (eg, branch and load constant). I won't support stack
56  frames with more than 65536 value slots.
57* This seems good and simple. No alignment issues. I won't run out of registers,
58  and I won't need to consider 8 and 16 bit versions of every arithmetic opcode.
59
60Problem: I need 4 operands for Add: the 4th is a source pointer
61`(Phrase*)` for reporting an error. This is kind of like how python and ruby
62apparently have line numbers embedded in their operations.
63* I don't really want to embed a 64 bit pointer in the instruction stream
64  that is almost never used: poor cache locality.
65* I could have a location table (array of `Phrase*`) and embed a 16 bit
66  location index into the instruction. 25% more storage, but better
67  cache locality. Easy to implement.
68* When emitting an Add operation, instead of putting the location into the
69  location table and emitting the location index inline, I could instead
70  add an (opcode index, location) pair to an external table. Then, when an
71  error occurs, use the opcode index to look up the location. Not intrinsically
72  more difficult, improves cache locality, and the external table could later
73  be extended or rearranged to support a reverse map from source to
74  instructions.
75* The opcode index could be stored in the Phrase. Then we exhaustively search
76  the source tree for the index when an error occurs. This would speed up
77  compile and slow down error handling, which is fine.
78* I could have a location map that maps opcode ranges onto source pointers.
79  The opcode array is split into a sequence of contiguous ranges.
80  For each range, we store the length of the range (16 bits)
81  and a source pointer. We use two arrays to eliminate alignment gaps.
82  More work. Could be used to set a breakpoint at a source location by
83  pointing to code in the editor? Single stepping?
84
85
86It's unlikely I need more than 256 opcodes, which will fit in a byte.
87What about literal operands, and their alignment?
88A curv::Value is 8 bytes. I'll also need offsets or pointers for jumps,
89and offsets relative to frame pointers for loading variables.
90I could use 32 bit opcodes with an 8 bit operator field and 24 bits of
91immediate operand. Constant Values are stored in a literal table.
92* That's a lot of empty operand fields. Poor cache coherence.
93  Ela uses two arrays of the same size, an ops array and an int32 opData array.
94* Lua uses 32 bit opcodes with a 6 bit operation; register machine.
95* CPython 3.3:
96  * 101 opcodes: stack, flow ctrl, arithmetic, constructors, details.
97  * 1 byte operation, optional unaligned 2 byte argument may follow
98  * registers: IP, stack, fast_locals (local variables and parameters)
99  * load_global is an index into an array of names
100* Ruby
101
102**Function Calls.**
103Several cases:
104* **Call to an unknown dynamic function value.**
105  Here we are computing a function value at run time, which is stored in a
106  value slot. With the current simplified function design, we compare the
107  provided # of arguments with the nargs value in the function value, and
108  abort if they are different. So: push the arguments,
109  then `opCall(funSlot,nargs)`. Abort if fun->nargs != nargs,
110  otherwise push the return value and branch to the function.
111* **Call to a known static function.** In the machine code interpreter, this
112  would be optimized in the same way as in C, with a call to a static address.
113  Also, nargs checking happens at compile time.
114  In the bytecode interpreter, the function is in the constant table,
115  so `opCallConstant(funSlot)`. [builtin functions are accessed via the
116  constant table, apparently.]
117* ...
118
119I could abstract the difference between different bytecode representations
120using an abstract interface for generating and executing code.
121
122I will separate the control stack from the argument stack.
123One is pointers and one is doubles: potentially different alignment.
124
125Registers:
126* SP: stack pointer (value stack)
127* FP: frame pointer (value stack)
128* KP: constants
129* IP: instruction pointer
130* CS: control stack pointer
131
132Function call:
133* validate length of argument list (compile or run time)
134* push arguments. SP points at 1+the last argument.
135* call function. CS points at 1+the return address. IP points to function body.
136* return: remember the top of stack value; pop all temporaries and the original
137  arguments; push the return value; IP = pop the return address.
138