1..  Licensed to the Apache Software Foundation (ASF) under one
2    or more contributor license agreements.  See the NOTICE file
3    distributed with this work for additional information
4    regarding copyright ownership.  The ASF licenses this file
5    to you under the Apache License, Version 2.0 (the
6    "License"); you may not use this file except in compliance
7    with the License.  You may obtain a copy of the License at
8
9..    http://www.apache.org/licenses/LICENSE-2.0
10
11..  Unless required by applicable law or agreed to in writing,
12    software distributed under the License is distributed on an
13    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14    KIND, either express or implied.  See the License for the
15    specific language governing permissions and limitations
16    under the License.
17
18Putting the VM in TVM: The Relay Virtual Machine
19================================================
20
21Relay, a new program representation, has enabled the representation and optimization of
22a great breadth of machine learning programs.
23Unfortunately, by supporting a more expressive set of programs, we have
24introduced several new execution challenges.
25
26Relay's interpreter can execute the full language but has notable limitations
27that make it unsuited for production deployments. It is structured as an inefficient
28interpreter that performs AST traversal to execute the program. This approach is conceptually
29simple but inefficient, as the AST traversal heavily relies on indirection.
30
31There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation,
32fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions
33for these, but none is sufficiently compelling or optimized.
34
35The second execution mechanism is the existing graph runtime. In order to target Relay
36programs to this, we compile a small subset of them to the old graph format and execute
37them on the runtime. Graph runtime provides a fast execution experience but only for a very limited
38subset of Relay programs.
39
40An alternative but not-standard approach is Relay's ahead-of-time compiler,
41which compiles a Relay program into a shared library containing an ahead-
42of-time implementation. The ahead-of-time compiler provides compelling performance
43but is difficult to extend and instrument, which can only be done by modifying the
44code generation and optimization mechanisms.
45
46The Relay virtual machine is intended to be a framework that balances these competing
47approaches, providing a dynamic execution environment which can be extended, instrumented,
48and integrated with other approaches like ahead-of-time compilation via a flexible extension
49mechanism.
50
51The virtual machine is designed to strike a balance between performance and flexibility
52when deploying and executing Relay programs, without giving up the benefits of TVM.
53
54Virtual machine (VM) design is a well-studied area in programming languages and systems,
55and there have been various virtual machine designs for both full-fledged
56and embedded programing languages.
57Previous language VM designs have been heavily tailored to the execution profile of traditional programs.
58Traditional programs manipulate small scalar values and consist of a large number of low-level instructions.
59The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient.
60In the context of machine learning we manipulate primarily tensor values, using a (relatively)
61low number of high level instructions. ML programs' cost centers are expensive operator invocations,
62such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs,
63micro-optimizations present in scalar VMs are dramatically less important.
64
65TVM has provided strong support for vision models,
66but we want to grow to support a wider variety of models.
67The graph runtime is able to utilize the fully static nature of the input graphs to perform
68aggressive optimization such as fully static allocation, and optimal memory reuse.
69When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic
70allocation, we must change how execution works. A virtual machine for Relay is a natural choice.
71
72The rest of this document provides a high-level overview of the Relay
73virtual machine design and its instruction set.
74
75Design
76------
77
78The VM's design is focused on simplicity without sacrificing performance.
79In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM.
80
81In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation),
82reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors).
83
84Instruction Set
85~~~~~~~~~~~~~~~
86
87The choices of an instruction set and instruction representation are the most critical design decisions for a VM.
88The current representation of the instructions is a tagged union containing the op-code and the data payload.  An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay.
89
90Ret
91^^^
92**Arguments**:
93::
94  RegName dst
95  RegName result
96
97Returns the object in register `result` to caller's register `dst`.
98
99InvokePacked
100^^^^^^^^^^^^
101**Arguments**:
102::
103  size_t packed_index
104  size_t arity
105  size_t output_size
106  RegName* packed_args
107
108Invoke the packed function denoted by `packed_index`. The `arity`
109and `output_size` are used to inform the VM how many inputs and
110outputs to expect. `packed_args` stores the list of argument registers.
111
112AllocTensor
113^^^^^^^^^^^
114**Arguments**:
115::
116  RegName dst
117  RegName shape_register
118  size_t ndim
119  DLDataType dtype
120
121Allocate a tensor value of the appropriate shape (stored in `shape_register`) and `dtype`. The result
122is saved to register `dst`.
123
124AllocADT
125^^^^^^^^^^^^^
126**Arguments**:
127::
128  RegName dst
129  size_t tag
130  size_t num_fields
131  RegName* datatype_fields
132
133Allocate a data type with the tag `tag` using the `num_fields` entries
134from registers `datatype_fields`. The result is saved to register `dst`.
135
136AllocClosure
137^^^^^^^^^^^^
138**Arguments**:
139::
140  RegName dst
141  size_t clo_index
142  size_t num_freevar
143  RegName* free_vars;
144
145Allocate a closure with the VMFunction at `clo_index` as
146its code, and the `num_freevar` entries from registers in
147`free_vars`. The result is saved to register `dst`.
148
149GetField
150^^^^^^^^
151**Arguments**:
152::
153  RegName dst
154  RegName object
155  size_t field_index
156
157Get the field value with index `field_index` from `object`. And saves the result to register `dst`.
158
159If
160^^
161**Arguments**:
162::
163  RegName test
164  RegName target
165  size_t true_offset
166  size_t false_offset
167
168Check if the object at register `test` is equal to `target`.
169If equal, relative jump by `true_offset`, else relative
170jump by `false_offset`.
171
172GetTagi
173^^^^^^^
174**Arguments**:
175::
176  RegName object
177  RegName dst
178
179Get the object tag for ADT object in register `object`. And saves the reult to register `dst`.
180
181Fatal
182^^^^^
183Fail the virtual machine execution.
184
185Goto
186^^^^
187**Arguments**:
188::
189  size_t pc_offset
190
191Relative unconditional jump by `pc_offset`.
192
193Invoke
194^^^^^^
195**Arguments**:
196::
197  size_t func_index
198
199Invoke function at `func_index`, consumes the number of arguments contained in the VMFunction's
200arity field.
201
202InvokeClosure
203^^^^^^^^^^^^^
204**Arguments**:
205::
206    RegName closure
207    size_t num_closure_args
208    RegName* closure_args
209
210Invokes `closure`, consuming the number of arguments declared in the closure's VMFunction.
211
212LoadConst
213^^^^^^^^^
214**Arguments**:
215::
216  RegName dst
217  size_t const_index
218
219Load the constant at `const_index` from the constant pool. The result is saved to register `dst`.
220
221LoadConsti
222^^^^^^^^^^
223**Arguments**:
224::
225  size_t val
226  RegName dst
227
228Load the constant integer `val` to register `dst`. The result is a 0-rank tensor.
229
230Object Representation
231~~~~~~~~~~~~~~~~~~~~~
232We use a simple object representation that uses shared pointers and tagging.
233There is a huge space of possible object representations trade-offs, but we
234believe micro-optimizing this code has little to no effect on the end-to-end performance.
235
236::
237
238    struct ObjectCell {
239      ObjectTag tag;
240      ...
241    };
242
243    struct Object {
244      std::shared_ptr<ObjectCell> ptr;
245      ...
246    }
247
248See `include/tvm/runtime/vm.h` for more details.
249
250Currently, we support 3 types of objects: tensors, data types, and closures.
251
252::
253
254    Object Tensor(const tvm::runtime::NDArray& data);
255    Object ADT(size_t tag, const std::vector<Object>& fields);
256    Object Closure(size_t func_index, std::vector<Object> free_vars);
257
258
259Stack and State
260~~~~~~~~~~~~~~~
261
262The Relay VM maintains a stack frame, which contains information about how to resume the
263previous call. Registers are allocated in a continuous space (virtual register file) for each function.
264
265We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).
266
267::
268
269    struct VirtualMachine {
270      ...
271      std::vector<VMFrame> frames;
272      ...
273      // Current function.
274      size_t func_index;
275      // Pointer into the current function's instructions.
276      const Instruction* code;
277      // Current program counter relative to the code pointer.
278      size_t pc;
279      ...
280    };
281
282
283Dispatch Loop
284~~~~~~~~~~~~~
285A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a
286virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented
287a simple `switch`/`goto` dispatch loop which dispatches based on instruction op code.
288
289This loop is implemented by `VirtualMachine::Run()`.
290
291VM Compiler
292~~~~~~~~~~~
293
294An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
295The VM compiler transforms a `tvm::relay::Module` into a `tvm::relay::vm::VirtualMachine`. The virtual
296machine contains a set of compiled functions, the compiled functions are contained in `tvm::relay::vm::Function`. The functions contain metadata about the the function as well as its compiled bytecode. For full definitions of the data structures see `vm.h`.
297
298Optimizations
299~~~~~~~~~~~~~
300
301There are quite a few optimizations required by the VM compiler.
302
303We have implemented them in the old pass style, but plan to port them to
304the new pass manager (#2546) before merging.
305
306Optimizations marked with `TODO` are not implemented yet.
307
308- A-Normal Form
309- Lambda Lift (see `src/relay/vm/lambda_lift.cc`)
310- Inline Primitives (see `src/relay/vm/inline_primitives.cc`)
311- Inliner (see `src/relay/pass/inliner.cc`)
312- Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`)
313- ADT Tag Allocation (see `src/relay/backend/vm/compiler.cc`)
314- Tail Call Optimization (TODO)
315- Liveness Analysis (TODO)
316
317Serialization
318~~~~~~~~~~~~~
319
320A final and yet-to-be-implemented part of the VM design is serialization. The accompanying PR will introduce both the bytecode and its serialization, as well as VM-level serialization. The design premise is that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization.
321
322Unresolved Questions
323~~~~~~~~~~~~~~~~~~~~
324
325How do we handle dynamic shapes?
326^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
327TODO
328
329How can we modify the VM to support JIT compilation of certain code paths?
330^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
332to be very flexible so we can modify it for future experiments.
333
334How do we support heterogenous execution?
335^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
336Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
337In order to do this properly we need to run the device annotation and copying passes.
338