1= Byte Code Format 2If I use a byte code interpreter, what's the instruction set and architecture? 3 4Register based is higher performance than stack based, and not any more 5difficult to implement. Eg, the Add opcode has 3 immediate operands, 6which are stack frame slot indexes. 7 8Opcodes are integers. The main interpreter loop is switch(opcode) inside 9of a for(;;). The opcodes are sequential from 0, defined by enum OpCode, and 10the default case uses a "notreached" compiler directive to guide optimization: 11this is `__assume(false);` on MSC++ and `__builtin_unreachable();` in gcc. 12ChakraCore uses this to optimize its interpreter loop; not sure if it helps 13on gcc or clang. 14* Direct threaded code is a popular alternative. 15 In GCC, `goto (*ip++)` is faster than branch to top of loop and switch. 16 It takes up more space (64 bits per opcode), so worse cache coherency. 17 The `labels as values` gcc extension isn't available in MSC++. 18 In 2014, it was alleged that Clang didn't adequately optimize 19 this type of interpreter loop: the IP wasn't stored in a register, 20 and the code for goto a label value was slower. I dunno if this is fixed, 21 but I need clang for MacOS. 22 23An operation consists of a 16 bit opcode, followed by zero or more 2416 bit operands. Constants are loaded from a constant table. 25Relative branches use a signed 16 bit offset into an array of shorts, 26so 32767 ops could be an upper limit, unless I provide an extended branch 27with a 32 bit operand for those special cases. 28This seems good and simple: no memory alignment issues, and I won't run 29out of registers (8 bit slot ids are limiting). 30 31Rationale: 32* Compact is good, fewer cache hits, more cache coherence. 33* Opcodes can be 1 byte, I won't likely need more than 256 opcodes. 34 But immediate operands need to be more than 1 byte, 35 and that raises the issue of unaligned memory access. 36 * Eg, relative branch. 37 * I might need stack frames with more than 256 value slots? 38 * Eg, immediate 8 byte values? 39* Intel and ARM support unaligned memory access. ARMv7 needs to use 40 special unaligned instructions: LDRD vs LDR. The compiler needs to know. 41 MSC++ has an `unaligned` pointer type decoration, g++ doesn't 42 (the documentation for `__attribute__(aligned(1))` doesn't give hope). 43 If I have to use memcpy() to fix an alignment problem in the interpreter 44 main loop, my performance will be crap. 45* The clang mailing list describes using 46 `typedef int32_t unaligned_int32_t __attribute__((aligned(1)));` 47 to make unaligned pointers safe to use. So maybe I'm good? 48 The same post gives reasons not to rely on this: the attribute isn't 49 part of the type system, could be dropped in various contexts like template 50 parameters and ?: operator. 51* I could use 'short codes' instead of 'byte codes'. 52 An instruction array is an array of uint16_t. Immediate operands are 16 bit. 53 Constants are loaded from a constant table. 54 A few operations might need rarely used variants with 32 bit 55 immediate operands (eg, branch and load constant). I won't support stack 56 frames with more than 65536 value slots. 57* This seems good and simple. No alignment issues. I won't run out of registers, 58 and I won't need to consider 8 and 16 bit versions of every arithmetic opcode. 59 60Problem: I need 4 operands for Add: the 4th is a source pointer 61`(Phrase*)` for reporting an error. This is kind of like how python and ruby 62apparently have line numbers embedded in their operations. 63* I don't really want to embed a 64 bit pointer in the instruction stream 64 that is almost never used: poor cache locality. 65* I could have a location table (array of `Phrase*`) and embed a 16 bit 66 location index into the instruction. 25% more storage, but better 67 cache locality. Easy to implement. 68* When emitting an Add operation, instead of putting the location into the 69 location table and emitting the location index inline, I could instead 70 add an (opcode index, location) pair to an external table. Then, when an 71 error occurs, use the opcode index to look up the location. Not intrinsically 72 more difficult, improves cache locality, and the external table could later 73 be extended or rearranged to support a reverse map from source to 74 instructions. 75* The opcode index could be stored in the Phrase. Then we exhaustively search 76 the source tree for the index when an error occurs. This would speed up 77 compile and slow down error handling, which is fine. 78* I could have a location map that maps opcode ranges onto source pointers. 79 The opcode array is split into a sequence of contiguous ranges. 80 For each range, we store the length of the range (16 bits) 81 and a source pointer. We use two arrays to eliminate alignment gaps. 82 More work. Could be used to set a breakpoint at a source location by 83 pointing to code in the editor? Single stepping? 84 85 86It's unlikely I need more than 256 opcodes, which will fit in a byte. 87What about literal operands, and their alignment? 88A curv::Value is 8 bytes. I'll also need offsets or pointers for jumps, 89and offsets relative to frame pointers for loading variables. 90I could use 32 bit opcodes with an 8 bit operator field and 24 bits of 91immediate operand. Constant Values are stored in a literal table. 92* That's a lot of empty operand fields. Poor cache coherence. 93 Ela uses two arrays of the same size, an ops array and an int32 opData array. 94* Lua uses 32 bit opcodes with a 6 bit operation; register machine. 95* CPython 3.3: 96 * 101 opcodes: stack, flow ctrl, arithmetic, constructors, details. 97 * 1 byte operation, optional unaligned 2 byte argument may follow 98 * registers: IP, stack, fast_locals (local variables and parameters) 99 * load_global is an index into an array of names 100* Ruby 101 102**Function Calls.** 103Several cases: 104* **Call to an unknown dynamic function value.** 105 Here we are computing a function value at run time, which is stored in a 106 value slot. With the current simplified function design, we compare the 107 provided # of arguments with the nargs value in the function value, and 108 abort if they are different. So: push the arguments, 109 then `opCall(funSlot,nargs)`. Abort if fun->nargs != nargs, 110 otherwise push the return value and branch to the function. 111* **Call to a known static function.** In the machine code interpreter, this 112 would be optimized in the same way as in C, with a call to a static address. 113 Also, nargs checking happens at compile time. 114 In the bytecode interpreter, the function is in the constant table, 115 so `opCallConstant(funSlot)`. [builtin functions are accessed via the 116 constant table, apparently.] 117* ... 118 119I could abstract the difference between different bytecode representations 120using an abstract interface for generating and executing code. 121 122I will separate the control stack from the argument stack. 123One is pointers and one is doubles: potentially different alignment. 124 125Registers: 126* SP: stack pointer (value stack) 127* FP: frame pointer (value stack) 128* KP: constants 129* IP: instruction pointer 130* CS: control stack pointer 131 132Function call: 133* validate length of argument list (compile or run time) 134* push arguments. SP points at 1+the last argument. 135* call function. CS points at 1+the return address. IP points to function body. 136* return: remember the top of stack value; pop all temporaries and the original 137 arguments; push the return value; IP = pop the return address. 138