1..  Licensed to the Apache Software Foundation (ASF) under one
2    or more contributor license agreements.  See the NOTICE file
3    distributed with this work for additional information
4    regarding copyright ownership.  The ASF licenses this file
5    to you under the Apache License, Version 2.0 (the
6    "License"); you may not use this file except in compliance
7    with the License.  You may obtain a copy of the License at
8
9..    http://www.apache.org/licenses/LICENSE-2.0
10
11..  Unless required by applicable law or agreed to in writing,
12    software distributed under the License is distributed on an
13    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14    KIND, either express or implied.  See the License for the
15    specific language governing permissions and limitations
16    under the License.
17
18Putting the VM in TVM: The Relay Virtual Machine
19================================================
20
21Relay, a new program representation, has enabled the representation and optimization of
22a great breadth of machine learning programs.
23Unfortunately, by supporting a more expressive set of programs, we have
24introduced several new execution challenges.
25
26Relay's interpreter can execute the full language but has notable limitations
27that make it unsuited for production deployments. It is structured as an inefficient
28interpreter that performs AST traversal to execute the program. This approach is conceptually
29simple but inefficient, as the AST traversal heavily relies on indirection.
30
31There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation,
32fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions
33for these, but none is sufficiently compelling or optimized.
34
35The second execution mechanism is the existing graph runtime. In order to target Relay
36programs to this, we compile a small subset of them to the old graph format and execute
37them on the runtime. Graph runtime provides a fast execution experience but only for a very limited
38subset of Relay programs.
39
40An alternative but not-standard approach is Relay's ahead-of-time compiler,
41which compiles a Relay program into a shared library containing an ahead-of-time
42implementation. The ahead-of-time compiler provides compelling performance
43but is difficult to extend and instrument, which can only be done by modifying the
44code generation and optimization mechanisms.
45
46The Relay virtual machine is intended to be a framework that balances these competing
47approaches, providing a dynamic execution environment which can be extended, instrumented,
48and integrated with other approaches like ahead-of-time compilation via a flexible extension
49mechanism.
50
51The virtual machine is designed to strike a balance between performance and flexibility
52when deploying and executing Relay programs, without giving up the benefits of TVM.
53
54Virtual machine (VM) design is a well-studied area in programming languages and systems,
55and there have been various virtual machine designs for both full-fledged
56and embedded programing languages.
57Previous language VM designs have been heavily tailored to the execution profile of traditional programs.
58Traditional programs manipulate small scalar values and consist of a large number of low-level instructions.
59The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient.
60In the context of machine learning we manipulate primarily tensor values, using a (relatively)
61low number of high level instructions. ML programs' cost centers are expensive operator invocations,
62such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs,
63micro-optimizations present in scalar VMs are dramatically less important.
64
65TVM has provided strong support for vision models,
66but we want to grow to support a wider variety of models.
67The graph runtime is able to utilize the fully static nature of the input graphs to perform
68aggressive optimization such as fully static allocation, and optimal memory reuse.
69When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic
70allocation, we must change how execution works. A virtual machine for Relay is a natural choice.
71
72The rest of this document provides a high-level overview of the Relay
73virtual machine design and its instruction set.
74
75Design
76------
77
78The VM's design is focused on simplicity without sacrificing performance.
79In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM.
80
81In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation),
82reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors).
83
84Instruction Set
85~~~~~~~~~~~~~~~
86
87The choices of an instruction set and instruction representation are the most critical design decisions for a VM.
88The current representation of the instructions is a tagged union containing the op-code and the data payload.  An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay.
89
90Ret
91^^^
92**Arguments**:
93::
94
95  RegName dst
96  RegName result
97
98Returns the object in register ``result`` to caller's register ``dst``.
99
100InvokePacked
101^^^^^^^^^^^^
102**Arguments**:
103::
104
105  Index packed_index
106  Index arity
107  Index output_size
108  RegName* packed_args
109
110Invoke the packed function denoted by ``packed_index``. The ``arity``
111and ``output_size`` are used to inform the VM how many inputs and
112outputs to expect. ``packed_args`` stores the list of argument registers. Note ``Index``
113is an alias of ``int64_t``, and it will be used in other instructions as well.
114
115AllocTensor
116^^^^^^^^^^^
117**Arguments**:
118::
119
120  RegName dst
121  RegName storage
122  uint32_t ndim
123  int64_t* shape
124  DLDataType dtype
125
126Allocate a tensor value of using constant shape (stored in ``shape``) and ``dtype``
127from the given storage block, ``storage``. The result is saved to register ``dst``.
128
129AllocTensorReg
130^^^^^^^^^^^^^^
131**Arguments**:
132::
133
134  RegName dst
135  RegName storage
136  RegName shape_register
137  DLDataType dtype
138
139Allocate a tensor value of the appropriate shape (stored in ``shape_register``)
140and ``dtype`` from the given storage block (stored in ``storage``). The result is saved to register ``dst``.
141
142AllocStorage
143^^^^^^^^^^^^
144**Arguments**:
145::
146
147  RegName dst
148  RegName size
149  RegName alignment
150  DLDataType dtype_hint
151
152Allocate a storage block with the given ``size``, ``alignment`` and data type, ``dtype_hint``.
153The allocated storage block is stored in register ``dst``.
154
155AllocADT
156^^^^^^^^
157**Arguments**:
158::
159
160  RegName dst
161  Index tag
162  Index num_fields
163  RegName* datatype_fields
164
165Allocate a data type with the tag ``tag`` using the ``num_fields`` entries
166from registers ``datatype_fields``. The result is saved to register ``dst``.
167
168AllocClosure
169^^^^^^^^^^^^
170**Arguments**:
171::
172
173  RegName dst
174  Index clo_index
175  Index num_freevar
176  RegName* free_vars;
177
178Allocate a closure with the VMFunction at ``clo_index`` as
179its code, and the ``num_freevar`` entries from registers in
180``free_vars``. The result is saved to register ``dst``.
181
182GetField
183^^^^^^^^
184**Arguments**:
185::
186
187  RegName dst
188  RegName object
189  Index field_index
190
191Get the field value with index ``field_index`` from ``object``. And saves the result to register ``dst``.
192
193If
194^^
195**Arguments**:
196::
197
198  RegName test
199  RegName target
200  Index true_offset
201  Index false_offset
202
203Check if the object at register ``test`` is equal to ``target``.
204If equal, relative jump by ``true_offset``, else relative
205jump by ``false_offset``.
206
207GetTag
208^^^^^^
209**Arguments**:
210::
211
212  RegName object
213  RegName dst
214
215Get the object tag for ADT object in register ``object``. And saves the reult to register ``dst``.
216
217Fatal
218^^^^^
219Fail the virtual machine execution.
220
221Goto
222^^^^
223**Arguments**:
224::
225
226  Index pc_offset
227
228Relative unconditional jump by ``pc_offset``.
229
230Invoke
231^^^^^^
232**Arguments**:
233::
234
235  Index func_index
236
237Invoke function at ``func_index``, consumes the number of arguments contained in the VMFunction's
238arity field.
239
240InvokeClosure
241^^^^^^^^^^^^^
242**Arguments**:
243::
244
245    RegName closure
246    Index num_closure_args
247    RegName* closure_args
248
249Invokes ``closure``, consuming the number of arguments declared in the closure's VMFunction.
250
251LoadConst
252^^^^^^^^^
253**Arguments**:
254::
255
256  RegName dst
257  Index const_index
258
259Load the constant at ``const_index`` from the constant pool. The result is saved to register ``dst``.
260
261LoadConsti
262^^^^^^^^^^
263**Arguments**:
264::
265
266  Index val
267  RegName dst
268
269Load the constant integer ``val`` to register ``dst``. The result is a 0-rank tensor.
270
271Object Representation
272~~~~~~~~~~~~~~~~~~~~~
273We leverage the object protocol to represent the objects that are used by the
274VM.
275
276Currently, three types of objects, ``NDArray``, ``ADT``, and ``Closure`` objects, are used
277to represent tensor, tuple/list, and closure data, respectively. More details
278for each of them can be found at `include/tvm/runtime/ndarray.h`_,
279`include/tvm/runtime/vm/vm.h`_, and `include/tvm/runtime/container.h`_, respectively.
280
281.. _include/tvm/runtime/ndarray.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/ndarray.h
282
283.. _include/tvm/runtime/vm/vm.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/vm/vm.h
284
285.. _include/tvm/runtime/container.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/container.h
286
287Stack and State
288~~~~~~~~~~~~~~~
289
290The Relay VM maintains a stack frame, which contains information about how to resume the
291previous call. Registers are allocated in a continuous space (virtual register file) for each function.
292
293We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).
294
295.. code-block:: c
296
297    struct VirtualMachine {
298      ...
299      std::vector<VMFrame> frames;
300      ...
301      // Current function.
302      size_t func_index;
303      // Pointer into the current function's instructions.
304      const Instruction* code;
305      // Current program counter relative to the code pointer.
306      size_t pc;
307      ...
308    };
309
310
311Dispatch Loop
312~~~~~~~~~~~~~
313A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a
314virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented
315a simple ``switch``/``goto`` dispatch loop which dispatches based on instruction op code.
316
317This loop is implemented by ``VirtualMachine::Run()``.
318
319VM Compiler
320~~~~~~~~~~~
321
322An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
323The VM compiler transforms a ``tvm::relay::Module`` into a ``tvm::relay::vm::Executable``. The executable
324contains a set of compiled functions, the compiled functions are contained in ``tvm::relay::vm::Function``.
325The functions contain metadata about the function as well as its compiled bytecode. The emitted executable
326object then can be loaded and run by a ``tvm::relay::vm::VirtualMachine`` object. For full definitions of the
327data structures, please see `include/tvm/runtime/vm/executable.h`_ and `include/tvm/runtime/vm/vm.h`_.
328
329.. _include/tvm/runtime/vm/executable.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/vm/executable.h
330
331Optimizations
332~~~~~~~~~~~~~
333
334There are quite a few optimizations required by the VM compiler. Each of them
335is implemented as a pass which is managed by the Relay pass manager.
336
337Optimizations marked with `TODO` are not implemented yet.
338
339- A-Normal Form
340- Lambda Lift (see `src/relay/vm/lambda_lift.cc`_)
341- Inline Primitives (see `src/relay/vm/inline_primitives.cc`_)
342- Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`_)
343- Tail Call Optimization (TODO)
344- Liveness Analysis (TODO)
345
346.. _src/relay/vm/lambda_lift.cc: https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/vm/lambda_lift.cc
347
348.. _src/relay/vm/inline_primitives.cc: https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/vm/inline_primitives.cc
349
350.. _src/relay/backend/vm/compiler.cc: https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/vm/compiler.cc
351
352Serialization
353~~~~~~~~~~~~~
354
355Serializing and deserializing the executable generated by the Relay VM compiler is a must as
356we may want to save the model to the disk and perform inference later. Previously, Relay has produced
357a serialized form in a json file for the graph runtime. However, the same format is not directly
358applicable to the VM as it emits bytecode instead of graph-style programs.
359Serialization of an executable essentially needs to handle both model specific
360(i.e. weights and kernels) and VM related (i.e. bytecode and global function names) data.
361
362For kernels, we can conveniently leverage existing TVM infra to save and load
363the compiled library module. Here we only focus on serializing other several
364components in a binary format that is organized with the following sections in order.
365
366- Global section. This section contains the globals (function names) used by the virtual machine.
367
368- Constant section. This section is used to store the constant pool (i.e. weights of the model)
369  for a virtual machine.
370
371- Primitive name section. This section is introduced to accommodate the list of primitive
372  operator names that will be invoked by the virtual machine, i.e. the names
373  starting with ``fused_``. The primitive names are used as symbols to look up
374  function pointers in the compiled kernel library.
375
376- Code section. The VM functions, including bytecode, are sitting in this section. The dispatching
377  loop iterates through this section to fetch instructions for execution.
378
379Hence, unlike the graph runtime artifact that contains weight (.params), graph json (.json),
380and compiled kernel library (.so), the serialized executable artifact is composed of the Relay
381object file (.ro) and the compiled kernel library (.so).
382
383A ``save`` function is implemented to store the executable to the disk and
384serialize it into the above format. Meanwhile, a ``load_exec`` function is used to
385load the serialized kernel binary and executable related binary code, which will be again used to
386instantiate a VM object. Please refer to the `test_vm_serialization.py`_ file for more
387examples.
388
389.. _test_vm_serialization.py: https://github.com/apache/incubator-tvm/blob/master/tests/python/relay/test_vm_serialization.py
390
391Unresolved Questions
392~~~~~~~~~~~~~~~~~~~~
393
394How do we handle dynamic shapes?
395^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
396
397TODO
398
399How can we modify the VM to support JIT compilation of certain code paths?
400^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
401
402In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
403to be very flexible so we can modify it for future experiments.
404
405How do we support heterogenous execution?
406^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
407
408Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
409In order to do this properly we need to run the device annotation and copying passes.
410