1daec8d40SPaolo Bonzini.. SPDX-License-Identifier: GPL-2.0
2daec8d40SPaolo Bonzini
3daec8d40SPaolo Bonzini==========
4daec8d40SPaolo BonziniNested VMX
5daec8d40SPaolo Bonzini==========
6daec8d40SPaolo Bonzini
7daec8d40SPaolo BonziniOverview
8daec8d40SPaolo Bonzini---------
9daec8d40SPaolo Bonzini
10daec8d40SPaolo BonziniOn Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
11daec8d40SPaolo Bonzinito easily and efficiently run guest operating systems. Normally, these guests
12daec8d40SPaolo Bonzini*cannot* themselves be hypervisors running their own guests, because in VMX,
13daec8d40SPaolo Bonziniguests cannot use VMX instructions.
14daec8d40SPaolo Bonzini
15daec8d40SPaolo BonziniThe "Nested VMX" feature adds this missing capability - of running guest
16daec8d40SPaolo Bonzinihypervisors (which use VMX) with their own nested guests. It does so by
17daec8d40SPaolo Bonziniallowing a guest to use VMX instructions, and correctly and efficiently
18daec8d40SPaolo Bonziniemulating them using the single level of VMX available in the hardware.
19daec8d40SPaolo Bonzini
20daec8d40SPaolo BonziniWe describe in much greater detail the theory behind the nested VMX feature,
21daec8d40SPaolo Bonziniits implementation and its performance characteristics, in the OSDI 2010 paper
22daec8d40SPaolo Bonzini"The Turtles Project: Design and Implementation of Nested Virtualization",
23daec8d40SPaolo Bonziniavailable at:
24daec8d40SPaolo Bonzini
25daec8d40SPaolo Bonzini	https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
26daec8d40SPaolo Bonzini
27daec8d40SPaolo Bonzini
28daec8d40SPaolo BonziniTerminology
29daec8d40SPaolo Bonzini-----------
30daec8d40SPaolo Bonzini
31daec8d40SPaolo BonziniSingle-level virtualization has two levels - the host (KVM) and the guests.
32daec8d40SPaolo BonziniIn nested virtualization, we have three levels: The host (KVM), which we call
33daec8d40SPaolo BonziniL0, the guest hypervisor, which we call L1, and its nested guest, which we
34daec8d40SPaolo Bonzinicall L2.
35daec8d40SPaolo Bonzini
36daec8d40SPaolo Bonzini
37daec8d40SPaolo BonziniRunning nested VMX
38daec8d40SPaolo Bonzini------------------
39daec8d40SPaolo Bonzini
40daec8d40SPaolo BonziniThe nested VMX feature is enabled by default since Linux kernel v4.20. For
41daec8d40SPaolo Bonziniolder Linux kernel, it can be enabled by giving the "nested=1" option to the
42daec8d40SPaolo Bonzinikvm-intel module.
43daec8d40SPaolo Bonzini
44daec8d40SPaolo Bonzini
45daec8d40SPaolo BonziniNo modifications are required to user space (qemu). However, qemu's default
46daec8d40SPaolo Bonziniemulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
47daec8d40SPaolo Bonziniexplicitly enabled, by giving qemu one of the following options:
48daec8d40SPaolo Bonzini
49daec8d40SPaolo Bonzini     - cpu host              (emulated CPU has all features of the real CPU)
50daec8d40SPaolo Bonzini
51daec8d40SPaolo Bonzini     - cpu qemu64,+vmx       (add just the vmx feature to a named CPU type)
52daec8d40SPaolo Bonzini
53daec8d40SPaolo Bonzini
54daec8d40SPaolo BonziniABIs
55daec8d40SPaolo Bonzini----
56daec8d40SPaolo Bonzini
57daec8d40SPaolo BonziniNested VMX aims to present a standard and (eventually) fully-functional VMX
58daec8d40SPaolo Bonziniimplementation for the a guest hypervisor to use. As such, the official
59daec8d40SPaolo Bonzinispecification of the ABI that it provides is Intel's VMX specification,
60daec8d40SPaolo Bonzininamely volume 3B of their "Intel 64 and IA-32 Architectures Software
61daec8d40SPaolo BonziniDeveloper's Manual". Not all of VMX's features are currently fully supported,
62daec8d40SPaolo Bonzinibut the goal is to eventually support them all, starting with the VMX features
63daec8d40SPaolo Bonziniwhich are used in practice by popular hypervisors (KVM and others).
64daec8d40SPaolo Bonzini
65daec8d40SPaolo BonziniAs a VMX implementation, nested VMX presents a VMCS structure to L1.
66daec8d40SPaolo BonziniAs mandated by the spec, other than the two fields revision_id and abort,
67daec8d40SPaolo Bonzinithis structure is *opaque* to its user, who is not supposed to know or care
68daec8d40SPaolo Bonziniabout its internal structure. Rather, the structure is accessed through the
69daec8d40SPaolo BonziniVMREAD and VMWRITE instructions.
70daec8d40SPaolo BonziniStill, for debugging purposes, KVM developers might be interested to know the
71daec8d40SPaolo Bonziniinternals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
72daec8d40SPaolo Bonzini
73daec8d40SPaolo BonziniThe name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
74daec8d40SPaolo Bonzinialso have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
75daec8d40SPaolo Bonziniwhich L0 builds to actually run L2 - how this is done is explained in the
76daec8d40SPaolo Bonziniaforementioned paper.
77daec8d40SPaolo Bonzini
78daec8d40SPaolo BonziniFor convenience, we repeat the content of struct vmcs12 here. If the internals
79daec8d40SPaolo Bonziniof this structure changes, this can break live migration across KVM versions.
80daec8d40SPaolo BonziniVMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
81daec8d40SPaolo Bonzinistruct shadow_vmcs is ever changed.
82daec8d40SPaolo Bonzini
83daec8d40SPaolo Bonzini::
84daec8d40SPaolo Bonzini
85daec8d40SPaolo Bonzini	typedef u64 natural_width;
86daec8d40SPaolo Bonzini	struct __packed vmcs12 {
87daec8d40SPaolo Bonzini		/* According to the Intel spec, a VMCS region must start with
88daec8d40SPaolo Bonzini		 * these two user-visible fields */
89daec8d40SPaolo Bonzini		u32 revision_id;
90daec8d40SPaolo Bonzini		u32 abort;
91daec8d40SPaolo Bonzini
92daec8d40SPaolo Bonzini		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
93daec8d40SPaolo Bonzini		u32 padding[7]; /* room for future expansion */
94daec8d40SPaolo Bonzini
95daec8d40SPaolo Bonzini		u64 io_bitmap_a;
96daec8d40SPaolo Bonzini		u64 io_bitmap_b;
97daec8d40SPaolo Bonzini		u64 msr_bitmap;
98daec8d40SPaolo Bonzini		u64 vm_exit_msr_store_addr;
99daec8d40SPaolo Bonzini		u64 vm_exit_msr_load_addr;
100daec8d40SPaolo Bonzini		u64 vm_entry_msr_load_addr;
101daec8d40SPaolo Bonzini		u64 tsc_offset;
102daec8d40SPaolo Bonzini		u64 virtual_apic_page_addr;
103daec8d40SPaolo Bonzini		u64 apic_access_addr;
104daec8d40SPaolo Bonzini		u64 ept_pointer;
105daec8d40SPaolo Bonzini		u64 guest_physical_address;
106daec8d40SPaolo Bonzini		u64 vmcs_link_pointer;
107daec8d40SPaolo Bonzini		u64 guest_ia32_debugctl;
108daec8d40SPaolo Bonzini		u64 guest_ia32_pat;
109daec8d40SPaolo Bonzini		u64 guest_ia32_efer;
110daec8d40SPaolo Bonzini		u64 guest_pdptr0;
111daec8d40SPaolo Bonzini		u64 guest_pdptr1;
112daec8d40SPaolo Bonzini		u64 guest_pdptr2;
113daec8d40SPaolo Bonzini		u64 guest_pdptr3;
114daec8d40SPaolo Bonzini		u64 host_ia32_pat;
115daec8d40SPaolo Bonzini		u64 host_ia32_efer;
116daec8d40SPaolo Bonzini		u64 padding64[8]; /* room for future expansion */
117daec8d40SPaolo Bonzini		natural_width cr0_guest_host_mask;
118daec8d40SPaolo Bonzini		natural_width cr4_guest_host_mask;
119daec8d40SPaolo Bonzini		natural_width cr0_read_shadow;
120daec8d40SPaolo Bonzini		natural_width cr4_read_shadow;
121daec8d40SPaolo Bonzini		natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */
122daec8d40SPaolo Bonzini		natural_width exit_qualification;
123daec8d40SPaolo Bonzini		natural_width guest_linear_address;
124daec8d40SPaolo Bonzini		natural_width guest_cr0;
125daec8d40SPaolo Bonzini		natural_width guest_cr3;
126daec8d40SPaolo Bonzini		natural_width guest_cr4;
127daec8d40SPaolo Bonzini		natural_width guest_es_base;
128daec8d40SPaolo Bonzini		natural_width guest_cs_base;
129daec8d40SPaolo Bonzini		natural_width guest_ss_base;
130daec8d40SPaolo Bonzini		natural_width guest_ds_base;
131daec8d40SPaolo Bonzini		natural_width guest_fs_base;
132daec8d40SPaolo Bonzini		natural_width guest_gs_base;
133daec8d40SPaolo Bonzini		natural_width guest_ldtr_base;
134daec8d40SPaolo Bonzini		natural_width guest_tr_base;
135daec8d40SPaolo Bonzini		natural_width guest_gdtr_base;
136daec8d40SPaolo Bonzini		natural_width guest_idtr_base;
137daec8d40SPaolo Bonzini		natural_width guest_dr7;
138daec8d40SPaolo Bonzini		natural_width guest_rsp;
139daec8d40SPaolo Bonzini		natural_width guest_rip;
140daec8d40SPaolo Bonzini		natural_width guest_rflags;
141daec8d40SPaolo Bonzini		natural_width guest_pending_dbg_exceptions;
142daec8d40SPaolo Bonzini		natural_width guest_sysenter_esp;
143daec8d40SPaolo Bonzini		natural_width guest_sysenter_eip;
144daec8d40SPaolo Bonzini		natural_width host_cr0;
145daec8d40SPaolo Bonzini		natural_width host_cr3;
146daec8d40SPaolo Bonzini		natural_width host_cr4;
147daec8d40SPaolo Bonzini		natural_width host_fs_base;
148daec8d40SPaolo Bonzini		natural_width host_gs_base;
149daec8d40SPaolo Bonzini		natural_width host_tr_base;
150daec8d40SPaolo Bonzini		natural_width host_gdtr_base;
151daec8d40SPaolo Bonzini		natural_width host_idtr_base;
152daec8d40SPaolo Bonzini		natural_width host_ia32_sysenter_esp;
153daec8d40SPaolo Bonzini		natural_width host_ia32_sysenter_eip;
154daec8d40SPaolo Bonzini		natural_width host_rsp;
155daec8d40SPaolo Bonzini		natural_width host_rip;
156daec8d40SPaolo Bonzini		natural_width paddingl[8]; /* room for future expansion */
157daec8d40SPaolo Bonzini		u32 pin_based_vm_exec_control;
158daec8d40SPaolo Bonzini		u32 cpu_based_vm_exec_control;
159daec8d40SPaolo Bonzini		u32 exception_bitmap;
160daec8d40SPaolo Bonzini		u32 page_fault_error_code_mask;
161daec8d40SPaolo Bonzini		u32 page_fault_error_code_match;
162daec8d40SPaolo Bonzini		u32 cr3_target_count;
163daec8d40SPaolo Bonzini		u32 vm_exit_controls;
164daec8d40SPaolo Bonzini		u32 vm_exit_msr_store_count;
165daec8d40SPaolo Bonzini		u32 vm_exit_msr_load_count;
166daec8d40SPaolo Bonzini		u32 vm_entry_controls;
167daec8d40SPaolo Bonzini		u32 vm_entry_msr_load_count;
168daec8d40SPaolo Bonzini		u32 vm_entry_intr_info_field;
169daec8d40SPaolo Bonzini		u32 vm_entry_exception_error_code;
170daec8d40SPaolo Bonzini		u32 vm_entry_instruction_len;
171daec8d40SPaolo Bonzini		u32 tpr_threshold;
172daec8d40SPaolo Bonzini		u32 secondary_vm_exec_control;
173daec8d40SPaolo Bonzini		u32 vm_instruction_error;
174daec8d40SPaolo Bonzini		u32 vm_exit_reason;
175daec8d40SPaolo Bonzini		u32 vm_exit_intr_info;
176daec8d40SPaolo Bonzini		u32 vm_exit_intr_error_code;
177daec8d40SPaolo Bonzini		u32 idt_vectoring_info_field;
178daec8d40SPaolo Bonzini		u32 idt_vectoring_error_code;
179daec8d40SPaolo Bonzini		u32 vm_exit_instruction_len;
180daec8d40SPaolo Bonzini		u32 vmx_instruction_info;
181daec8d40SPaolo Bonzini		u32 guest_es_limit;
182daec8d40SPaolo Bonzini		u32 guest_cs_limit;
183daec8d40SPaolo Bonzini		u32 guest_ss_limit;
184daec8d40SPaolo Bonzini		u32 guest_ds_limit;
185daec8d40SPaolo Bonzini		u32 guest_fs_limit;
186daec8d40SPaolo Bonzini		u32 guest_gs_limit;
187daec8d40SPaolo Bonzini		u32 guest_ldtr_limit;
188daec8d40SPaolo Bonzini		u32 guest_tr_limit;
189daec8d40SPaolo Bonzini		u32 guest_gdtr_limit;
190daec8d40SPaolo Bonzini		u32 guest_idtr_limit;
191daec8d40SPaolo Bonzini		u32 guest_es_ar_bytes;
192daec8d40SPaolo Bonzini		u32 guest_cs_ar_bytes;
193daec8d40SPaolo Bonzini		u32 guest_ss_ar_bytes;
194daec8d40SPaolo Bonzini		u32 guest_ds_ar_bytes;
195daec8d40SPaolo Bonzini		u32 guest_fs_ar_bytes;
196daec8d40SPaolo Bonzini		u32 guest_gs_ar_bytes;
197daec8d40SPaolo Bonzini		u32 guest_ldtr_ar_bytes;
198daec8d40SPaolo Bonzini		u32 guest_tr_ar_bytes;
199daec8d40SPaolo Bonzini		u32 guest_interruptibility_info;
200daec8d40SPaolo Bonzini		u32 guest_activity_state;
201daec8d40SPaolo Bonzini		u32 guest_sysenter_cs;
202daec8d40SPaolo Bonzini		u32 host_ia32_sysenter_cs;
203daec8d40SPaolo Bonzini		u32 padding32[8]; /* room for future expansion */
204daec8d40SPaolo Bonzini		u16 virtual_processor_id;
205daec8d40SPaolo Bonzini		u16 guest_es_selector;
206daec8d40SPaolo Bonzini		u16 guest_cs_selector;
207daec8d40SPaolo Bonzini		u16 guest_ss_selector;
208daec8d40SPaolo Bonzini		u16 guest_ds_selector;
209daec8d40SPaolo Bonzini		u16 guest_fs_selector;
210daec8d40SPaolo Bonzini		u16 guest_gs_selector;
211daec8d40SPaolo Bonzini		u16 guest_ldtr_selector;
212daec8d40SPaolo Bonzini		u16 guest_tr_selector;
213daec8d40SPaolo Bonzini		u16 host_es_selector;
214daec8d40SPaolo Bonzini		u16 host_cs_selector;
215daec8d40SPaolo Bonzini		u16 host_ss_selector;
216daec8d40SPaolo Bonzini		u16 host_ds_selector;
217daec8d40SPaolo Bonzini		u16 host_fs_selector;
218daec8d40SPaolo Bonzini		u16 host_gs_selector;
219daec8d40SPaolo Bonzini		u16 host_tr_selector;
220daec8d40SPaolo Bonzini	};
221daec8d40SPaolo Bonzini
222daec8d40SPaolo Bonzini
223daec8d40SPaolo BonziniAuthors
224daec8d40SPaolo Bonzini-------
225daec8d40SPaolo Bonzini
226daec8d40SPaolo BonziniThese patches were written by:
227daec8d40SPaolo Bonzini    - Abel Gordon, abelg <at> il.ibm.com
228daec8d40SPaolo Bonzini    - Nadav Har'El, nyh <at> il.ibm.com
229daec8d40SPaolo Bonzini    - Orit Wasserman, oritw <at> il.ibm.com
230daec8d40SPaolo Bonzini    - Ben-Ami Yassor, benami <at> il.ibm.com
231daec8d40SPaolo Bonzini    - Muli Ben-Yehuda, muli <at> il.ibm.com
232daec8d40SPaolo Bonzini
233daec8d40SPaolo BonziniWith contributions by:
234daec8d40SPaolo Bonzini    - Anthony Liguori, aliguori <at> us.ibm.com
235daec8d40SPaolo Bonzini    - Mike Day, mdday <at> us.ibm.com
236daec8d40SPaolo Bonzini    - Michael Factor, factor <at> il.ibm.com
237daec8d40SPaolo Bonzini    - Zvi Dubitzky, dubi <at> il.ibm.com
238daec8d40SPaolo Bonzini
239daec8d40SPaolo BonziniAnd valuable reviews by:
240daec8d40SPaolo Bonzini    - Avi Kivity, avi <at> redhat.com
241daec8d40SPaolo Bonzini    - Gleb Natapov, gleb <at> redhat.com
242daec8d40SPaolo Bonzini    - Marcelo Tosatti, mtosatti <at> redhat.com
243daec8d40SPaolo Bonzini    - Kevin Tian, kevin.tian <at> intel.com
244daec8d40SPaolo Bonzini    - and others.
245