xref: /qemu/docs/rdma.txt (revision b21a6e31)
1f4abc9d6SMichael R. Hines(RDMA: Remote Direct Memory Access)
2f4abc9d6SMichael R. HinesRDMA Live Migration Specification, Version # 1
3f4abc9d6SMichael R. Hines==============================================
470b7fba9SStefan HajnocziWiki: https://wiki.qemu.org/Features/RDMALiveMigration
5f4abc9d6SMichael R. HinesGithub: git@github.com:hinesmr/qemu.git, 'rdma' branch
6f4abc9d6SMichael R. Hines
7f4abc9d6SMichael R. HinesCopyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
8f4abc9d6SMichael R. Hines
9f4abc9d6SMichael R. HinesAn *exhaustive* paper (2010) shows additional performance details
10f4abc9d6SMichael R. Hineslinked on the QEMU wiki above.
11f4abc9d6SMichael R. Hines
12f4abc9d6SMichael R. HinesContents:
13f4abc9d6SMichael R. Hines=========
14f4abc9d6SMichael R. Hines* Introduction
15f4abc9d6SMichael R. Hines* Before running
16f4abc9d6SMichael R. Hines* Running
17f4abc9d6SMichael R. Hines* Performance
18f4abc9d6SMichael R. Hines* RDMA Migration Protocol Description
19f4abc9d6SMichael R. Hines* Versioning and Capabilities
20f4abc9d6SMichael R. Hines* QEMUFileRDMA Interface
21971ae6efSzhanghailiang* Migration of VM's ram
22f4abc9d6SMichael R. Hines* Error handling
23f4abc9d6SMichael R. Hines* TODO
24f4abc9d6SMichael R. Hines
25f4abc9d6SMichael R. HinesIntroduction:
26f4abc9d6SMichael R. Hines=============
27f4abc9d6SMichael R. Hines
28f4abc9d6SMichael R. HinesRDMA helps make your migration more deterministic under heavy load because
29f4abc9d6SMichael R. Hinesof the significantly lower latency and higher throughput over TCP/IP. This is
30f4abc9d6SMichael R. Hinesbecause the RDMA I/O architecture reduces the number of interrupts and
31f4abc9d6SMichael R. Hinesdata copies by bypassing the host networking stack. In particular, a TCP-based
32f4abc9d6SMichael R. Hinesmigration, under certain types of memory-bound workloads, may take a more
33806be373SLike Xuunpredictable amount of time to complete the migration if the amount of
34f4abc9d6SMichael R. Hinesmemory tracked during each live migration iteration round cannot keep pace
35f4abc9d6SMichael R. Hineswith the rate of dirty memory produced by the workload.
36f4abc9d6SMichael R. Hines
37f4abc9d6SMichael R. HinesRDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
38a5f56b90SMichael R. Hinesover Converged Ethernet) as well as Infiniband-based. This implementation of
39f4abc9d6SMichael R. Hinesmigration using RDMA is capable of using both technologies because of
40f4abc9d6SMichael R. Hinesthe use of the OpenFabrics OFED software stack that abstracts out the
41f4abc9d6SMichael R. Hinesprogramming model irrespective of the underlying hardware.
42f4abc9d6SMichael R. Hines
43f4abc9d6SMichael R. HinesRefer to openfabrics.org or your respective RDMA hardware vendor for
44f4abc9d6SMichael R. Hinesan understanding on how to verify that you have the OFED software stack
45f4abc9d6SMichael R. Hinesinstalled in your environment. You should be able to successfully link
46f4abc9d6SMichael R. Hinesagainst the "librdmacm" and "libibverbs" libraries and development headers
47f4abc9d6SMichael R. Hinesfor a working build of QEMU to run successfully using RDMA Migration.
48f4abc9d6SMichael R. Hines
49f4abc9d6SMichael R. HinesBEFORE RUNNING:
50f4abc9d6SMichael R. Hines===============
51f4abc9d6SMichael R. Hines
52f4abc9d6SMichael R. HinesUse of RDMA during migration requires pinning and registering memory
53f4abc9d6SMichael R. Hineswith the hardware. This means that memory must be physically resident
54f4abc9d6SMichael R. Hinesbefore the hardware can transmit that memory to another machine.
55f4abc9d6SMichael R. HinesIf this is not acceptable for your application or product, then the use
56f4abc9d6SMichael R. Hinesof RDMA migration may in fact be harmful to co-located VMs or other
57f4abc9d6SMichael R. Hinessoftware on the machine if there is not sufficient memory available to
58f4abc9d6SMichael R. Hinesrelocate the entire footprint of the virtual machine. If so, then the
59f4abc9d6SMichael R. Hinesuse of RDMA is discouraged and it is recommended to use standard TCP migration.
60f4abc9d6SMichael R. Hines
61f4abc9d6SMichael R. HinesExperimental: Next, decide if you want dynamic page registration.
62f4abc9d6SMichael R. HinesFor example, if you have an 8GB RAM virtual machine, but only 1GB
63f4abc9d6SMichael R. Hinesis in active use, then enabling this feature will cause all 8GB to
64f4abc9d6SMichael R. Hinesbe pinned and resident in memory. This feature mostly affects the
65f4abc9d6SMichael R. Hinesbulk-phase round of the migration and can be enabled for extremely
66f4abc9d6SMichael R. Hineshigh-performance RDMA hardware using the following command:
67f4abc9d6SMichael R. Hines
68f4abc9d6SMichael R. HinesQEMU Monitor Command:
6941310c68SMichael R. Hines$ migrate_set_capability rdma-pin-all on # disabled by default
70f4abc9d6SMichael R. Hines
71f4abc9d6SMichael R. HinesPerforming this action will cause all 8GB to be pinned, so if that's
72f4abc9d6SMichael R. Hinesnot what you want, then please ignore this step altogether.
73f4abc9d6SMichael R. Hines
74f4abc9d6SMichael R. HinesOn the other hand, this will also significantly speed up the bulk round
75f4abc9d6SMichael R. Hinesof the migration, which can greatly reduce the "total" time of your migration.
76f4abc9d6SMichael R. HinesExample performance of this using an idle VM in the previous example
77f4abc9d6SMichael R. Hinescan be found in the "Performance" section.
78f4abc9d6SMichael R. Hines
79f4abc9d6SMichael R. HinesNote: for very large virtual machines (hundreds of GBs), pinning all
80f4abc9d6SMichael R. Hines*all* of the memory of your virtual machine in the kernel is very expensive
81f4abc9d6SMichael R. Hinesmay extend the initial bulk iteration time by many seconds,
82f4abc9d6SMichael R. Hinesand thus extending the total migration time. However, this will not
83f4abc9d6SMichael R. Hinesaffect the determinism or predictability of your migration you will
84f4abc9d6SMichael R. Hinesstill gain from the benefits of advanced pinning with RDMA.
85f4abc9d6SMichael R. Hines
86f4abc9d6SMichael R. HinesRUNNING:
87f4abc9d6SMichael R. Hines========
88f4abc9d6SMichael R. Hines
89f4abc9d6SMichael R. HinesFirst, set the migration speed to match your hardware's capabilities:
90f4abc9d6SMichael R. Hines
91f4abc9d6SMichael R. HinesQEMU Monitor Command:
92b21a6e31SMarkus Armbruster$ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device
93f4abc9d6SMichael R. Hines
94f4abc9d6SMichael R. HinesNext, on the destination machine, add the following to the QEMU command line:
95f4abc9d6SMichael R. Hines
9641310c68SMichael R. Hinesqemu ..... -incoming rdma:host:port
97f4abc9d6SMichael R. Hines
98f4abc9d6SMichael R. HinesFinally, perform the actual migration on the source machine:
99f4abc9d6SMichael R. Hines
100f4abc9d6SMichael R. HinesQEMU Monitor Command:
10141310c68SMichael R. Hines$ migrate -d rdma:host:port
102f4abc9d6SMichael R. Hines
103f4abc9d6SMichael R. HinesPERFORMANCE
104f4abc9d6SMichael R. Hines===========
105f4abc9d6SMichael R. Hines
106f4abc9d6SMichael R. HinesHere is a brief summary of total migration time and downtime using RDMA:
107f4abc9d6SMichael R. HinesUsing a 40gbps infiniband link performing a worst-case stress test,
108f4abc9d6SMichael R. Hinesusing an 8GB RAM virtual machine:
109f4abc9d6SMichael R. Hines
110f4abc9d6SMichael R. HinesUsing the following command:
111f4abc9d6SMichael R. Hines$ apt-get install stress
112f4abc9d6SMichael R. Hines$ stress --vm-bytes 7500M --vm 1 --vm-keep
113f4abc9d6SMichael R. Hines
114f4abc9d6SMichael R. Hines1. Migration throughput: 26 gigabits/second.
115f4abc9d6SMichael R. Hines2. Downtime (stop time) varies between 15 and 100 milliseconds.
116f4abc9d6SMichael R. Hines
117f4abc9d6SMichael R. HinesEFFECTS of memory registration on bulk phase round:
118f4abc9d6SMichael R. Hines
119f4abc9d6SMichael R. HinesFor example, in the same 8GB RAM example with all 8GB of memory in
120f4abc9d6SMichael R. Hinesactive use and the VM itself is completely idle using the same 40 gbps
121f4abc9d6SMichael R. Hinesinfiniband link:
122f4abc9d6SMichael R. Hines
12341310c68SMichael R. Hines1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
12441310c68SMichael R. Hines2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
125f4abc9d6SMichael R. Hines
126f4abc9d6SMichael R. HinesThese numbers would of course scale up to whatever size virtual machine
127f4abc9d6SMichael R. Hinesyou have to migrate using RDMA.
128f4abc9d6SMichael R. Hines
129f4abc9d6SMichael R. HinesEnabling this feature does *not* have any measurable affect on
130f4abc9d6SMichael R. Hinesmigration *downtime*. This is because, without this feature, all of the
131f4abc9d6SMichael R. Hinesmemory will have already been registered already in advance during
132f4abc9d6SMichael R. Hinesthe bulk round and does not need to be re-registered during the successive
133f4abc9d6SMichael R. Hinesiteration rounds.
134f4abc9d6SMichael R. Hines
135f4abc9d6SMichael R. HinesRDMA Protocol Description:
136f4abc9d6SMichael R. Hines==========================
137f4abc9d6SMichael R. Hines
138f4abc9d6SMichael R. HinesMigration with RDMA is separated into two parts:
139f4abc9d6SMichael R. Hines
140f4abc9d6SMichael R. Hines1. The transmission of the pages using RDMA
141f4abc9d6SMichael R. Hines2. Everything else (a control channel is introduced)
142f4abc9d6SMichael R. Hines
143f4abc9d6SMichael R. Hines"Everything else" is transmitted using a formal
144f4abc9d6SMichael R. Hinesprotocol now, consisting of infiniband SEND messages.
145f4abc9d6SMichael R. Hines
146f4abc9d6SMichael R. HinesAn infiniband SEND message is the standard ibverbs
147f4abc9d6SMichael R. Hinesmessage used by applications of infiniband hardware.
148f4abc9d6SMichael R. HinesThe only difference between a SEND message and an RDMA
149f4abc9d6SMichael R. Hinesmessage is that SEND messages cause notifications
150f4abc9d6SMichael R. Hinesto be posted to the completion queue (CQ) on the
151f4abc9d6SMichael R. Hinesinfiniband receiver side, whereas RDMA messages (used
152971ae6efSzhanghailiangfor VM's ram) do not (to behave like an actual DMA).
153f4abc9d6SMichael R. Hines
154f4abc9d6SMichael R. HinesMessages in infiniband require two things:
155f4abc9d6SMichael R. Hines
156f4abc9d6SMichael R. Hines1. registration of the memory that will be transmitted
157f4abc9d6SMichael R. Hines2. (SEND only) work requests to be posted on both
158f4abc9d6SMichael R. Hines   sides of the network before the actual transmission
159f4abc9d6SMichael R. Hines   can occur.
160f4abc9d6SMichael R. Hines
161f4abc9d6SMichael R. HinesRDMA messages are much easier to deal with. Once the memory
162f4abc9d6SMichael R. Hineson the receiver side is registered and pinned, we're
163f4abc9d6SMichael R. Hinesbasically done. All that is required is for the sender
164f4abc9d6SMichael R. Hinesside to start dumping bytes onto the link.
165f4abc9d6SMichael R. Hines
166f4abc9d6SMichael R. Hines(Memory is not released from pinning until the migration
167f4abc9d6SMichael R. Hinescompletes, given that RDMA migrations are very fast.)
168f4abc9d6SMichael R. Hines
169f4abc9d6SMichael R. HinesSEND messages require more coordination because the
170f4abc9d6SMichael R. Hinesreceiver must have reserved space (using a receive
171f4abc9d6SMichael R. Hineswork request) on the receive queue (RQ) before QEMUFileRDMA
172f4abc9d6SMichael R. Hinescan start using them to carry all the bytes as
173f4abc9d6SMichael R. Hinesa control transport for migration of device state.
174f4abc9d6SMichael R. Hines
175f4abc9d6SMichael R. HinesTo begin the migration, the initial connection setup is
176f4abc9d6SMichael R. Hinesas follows (migration-rdma.c):
177f4abc9d6SMichael R. Hines
178f4abc9d6SMichael R. Hines1. Receiver and Sender are started (command line or libvirt):
179f4abc9d6SMichael R. Hines2. Both sides post two RQ work requests
180f4abc9d6SMichael R. Hines3. Receiver does listen()
181f4abc9d6SMichael R. Hines4. Sender does connect()
182f4abc9d6SMichael R. Hines5. Receiver accept()
183f4abc9d6SMichael R. Hines6. Check versioning and capabilities (described later)
184f4abc9d6SMichael R. Hines
185f4abc9d6SMichael R. HinesAt this point, we define a control channel on top of SEND messages
186f4abc9d6SMichael R. Hineswhich is described by a formal protocol. Each SEND message has a
187f4abc9d6SMichael R. Hinesheader portion and a data portion (but together are transmitted
188f4abc9d6SMichael R. Hinesas a single SEND message).
189f4abc9d6SMichael R. Hines
190f4abc9d6SMichael R. HinesHeader:
191f4abc9d6SMichael R. Hines    * Length               (of the data portion, uint32, network byte order)
192f4abc9d6SMichael R. Hines    * Type                 (what command to perform, uint32, network byte order)
193f4abc9d6SMichael R. Hines    * Repeat               (Number of commands in data portion, same type only)
194f4abc9d6SMichael R. Hines
195f4abc9d6SMichael R. HinesThe 'Repeat' field is here to support future multiple page registrations
196f4abc9d6SMichael R. Hinesin a single message without any need to change the protocol itself
197f4abc9d6SMichael R. Hinesso that the protocol is compatible against multiple versions of QEMU.
198f4abc9d6SMichael R. HinesVersion #1 requires that all server implementations of the protocol must
199f4abc9d6SMichael R. Hinescheck this field and register all requests found in the array of commands located
200f4abc9d6SMichael R. Hinesin the data portion and return an equal number of results in the response.
201f4abc9d6SMichael R. HinesThe maximum number of repeats is hard-coded to 4096. This is a conservative
20252f35022SStefan Weillimit based on the maximum size of a SEND message along with empirical
203f4abc9d6SMichael R. Hinesobservations on the maximum future benefit of simultaneous page registrations.
204f4abc9d6SMichael R. Hines
205a5f56b90SMichael R. HinesThe 'type' field has 12 different command values:
206f4abc9d6SMichael R. Hines     1. Unused
207f4abc9d6SMichael R. Hines     2. Error                      (sent to the source during bad things)
208f4abc9d6SMichael R. Hines     3. Ready                      (control-channel is available)
209f4abc9d6SMichael R. Hines     4. QEMU File                  (for sending non-live device state)
210f4abc9d6SMichael R. Hines     5. RAM Blocks request         (used right after connection setup)
211f4abc9d6SMichael R. Hines     6. RAM Blocks result          (used right after connection setup)
212f4abc9d6SMichael R. Hines     7. Compress page              (zap zero page and skip registration)
213f4abc9d6SMichael R. Hines     8. Register request           (dynamic chunk registration)
214f4abc9d6SMichael R. Hines     9. Register result            ('rkey' to be used by sender)
215f4abc9d6SMichael R. Hines    10. Register finished          (registration for current iteration finished)
216a5f56b90SMichael R. Hines    11. Unregister request         (unpin previously registered memory)
217a5f56b90SMichael R. Hines    12. Unregister finished        (confirmation that unpin completed)
218f4abc9d6SMichael R. Hines
219f4abc9d6SMichael R. HinesA single control message, as hinted above, can contain within the data
220f4abc9d6SMichael R. Hinesportion an array of many commands of the same type. If there is more than
221f4abc9d6SMichael R. Hinesone command, then the 'repeat' field will be greater than 1.
222f4abc9d6SMichael R. Hines
223f4abc9d6SMichael R. HinesAfter connection setup, message 5 & 6 are used to exchange ram block
224f4abc9d6SMichael R. Hinesinformation and optionally pin all the memory if requested by the user.
225f4abc9d6SMichael R. Hines
226f4abc9d6SMichael R. HinesAfter ram block exchange is completed, we have two protocol-level
227f4abc9d6SMichael R. Hinesfunctions, responsible for communicating control-channel commands
228f4abc9d6SMichael R. Hinesusing the above list of values:
229f4abc9d6SMichael R. Hines
230f4abc9d6SMichael R. HinesLogically:
231f4abc9d6SMichael R. Hines
232f4abc9d6SMichael R. Hinesqemu_rdma_exchange_recv(header, expected command type)
233f4abc9d6SMichael R. Hines
234f4abc9d6SMichael R. Hines1. We transmit a READY command to let the sender know that
235f4abc9d6SMichael R. Hines   we are *ready* to receive some data bytes on the control channel.
236f4abc9d6SMichael R. Hines2. Before attempting to receive the expected command, we post another
237f4abc9d6SMichael R. Hines   RQ work request to replace the one we just used up.
238f4abc9d6SMichael R. Hines3. Block on a CQ event channel and wait for the SEND to arrive.
239f4abc9d6SMichael R. Hines4. When the send arrives, librdmacm will unblock us.
240f4abc9d6SMichael R. Hines5. Verify that the command-type and version received matches the one we expected.
241f4abc9d6SMichael R. Hines
242f4abc9d6SMichael R. Hinesqemu_rdma_exchange_send(header, data, optional response header & data):
243f4abc9d6SMichael R. Hines
244f4abc9d6SMichael R. Hines1. Block on the CQ event channel waiting for a READY command
245f4abc9d6SMichael R. Hines   from the receiver to tell us that the receiver
246f4abc9d6SMichael R. Hines   is *ready* for us to transmit some new bytes.
247f4abc9d6SMichael R. Hines2. Optionally: if we are expecting a response from the command
248a5f56b90SMichael R. Hines   (that we have not yet transmitted), let's post an RQ
249f4abc9d6SMichael R. Hines   work request to receive that data a few moments later.
250f4abc9d6SMichael R. Hines3. When the READY arrives, librdmacm will
251f4abc9d6SMichael R. Hines   unblock us and we immediately post a RQ work request
252f4abc9d6SMichael R. Hines   to replace the one we just used up.
253f4abc9d6SMichael R. Hines4. Now, we can actually post the work request to SEND
254f4abc9d6SMichael R. Hines   the requested command type of the header we were asked for.
255f4abc9d6SMichael R. Hines5. Optionally, if we are expecting a response (as before),
256f4abc9d6SMichael R. Hines   we block again and wait for that response using the additional
257f4abc9d6SMichael R. Hines   work request we previously posted. (This is used to carry
258f4abc9d6SMichael R. Hines   'Register result' commands #6 back to the sender which
259f4abc9d6SMichael R. Hines   hold the rkey need to perform RDMA. Note that the virtual address
260f4abc9d6SMichael R. Hines   corresponding to this rkey was already exchanged at the beginning
261f4abc9d6SMichael R. Hines   of the connection (described below).
262f4abc9d6SMichael R. Hines
263f4abc9d6SMichael R. HinesAll of the remaining command types (not including 'ready')
26476ca4b58Szhaolichangdescribed above all use the aforementioned two functions to do the hard work:
265f4abc9d6SMichael R. Hines
266f4abc9d6SMichael R. Hines1. After connection setup, RAMBlock information is exchanged using
267f4abc9d6SMichael R. Hines   this protocol before the actual migration begins. This information includes
268f4abc9d6SMichael R. Hines   a description of each RAMBlock on the server side as well as the virtual addresses
269f4abc9d6SMichael R. Hines   and lengths of each RAMBlock. This is used by the client to determine the
270f4abc9d6SMichael R. Hines   start and stop locations of chunks and how to register them dynamically
271f4abc9d6SMichael R. Hines   before performing the RDMA operations.
272f4abc9d6SMichael R. Hines2. During runtime, once a 'chunk' becomes full of pages ready to
273f4abc9d6SMichael R. Hines   be sent with RDMA, the registration commands are used to ask the
274f4abc9d6SMichael R. Hines   other side to register the memory for this chunk and respond
275f4abc9d6SMichael R. Hines   with the result (rkey) of the registration.
276f4abc9d6SMichael R. Hines3. Also, the QEMUFile interfaces also call these functions (described below)
277f4abc9d6SMichael R. Hines   when transmitting non-live state, such as devices or to send
278f4abc9d6SMichael R. Hines   its own protocol information during the migration process.
279f4abc9d6SMichael R. Hines4. Finally, zero pages are only checked if a page has not yet been registered
280f4abc9d6SMichael R. Hines   using chunk registration (or not checked at all and unconditionally
281f4abc9d6SMichael R. Hines   written if chunk registration is disabled. This is accomplished using
282f4abc9d6SMichael R. Hines   the "Compress" command listed above. If the page *has* been registered
283f4abc9d6SMichael R. Hines   then we check the entire chunk for zero. Only if the entire chunk is
284f4abc9d6SMichael R. Hines   zero, then we send a compress command to zap the page on the other side.
285f4abc9d6SMichael R. Hines
286f4abc9d6SMichael R. HinesVersioning and Capabilities
287f4abc9d6SMichael R. Hines===========================
288f4abc9d6SMichael R. HinesCurrent version of the protocol is version #1.
289f4abc9d6SMichael R. Hines
290f4abc9d6SMichael R. HinesThe same version applies to both for protocol traffic and capabilities
291f4abc9d6SMichael R. Hinesnegotiation. (i.e. There is only one version number that is referred to
292f4abc9d6SMichael R. Hinesby all communication).
293f4abc9d6SMichael R. Hines
294f4abc9d6SMichael R. Hineslibrdmacm provides the user with a 'private data' area to be exchanged
295f4abc9d6SMichael R. Hinesat connection-setup time before any infiniband traffic is generated.
296f4abc9d6SMichael R. Hines
297f4abc9d6SMichael R. HinesHeader:
298a5f56b90SMichael R. Hines    * Version (protocol version validated before send/recv occurs),
299a5f56b90SMichael R. Hines                                               uint32, network byte order
300a5f56b90SMichael R. Hines    * Flags   (bitwise OR of each capability),
301a5f56b90SMichael R. Hines                                               uint32, network byte order
302f4abc9d6SMichael R. Hines
303f4abc9d6SMichael R. HinesThere is no data portion of this header right now, so there is
304f4abc9d6SMichael R. Hinesno length field. The maximum size of the 'private data' section
305f4abc9d6SMichael R. Hinesis only 192 bytes per the Infiniband specification, so it's not
306f4abc9d6SMichael R. Hinesvery useful for data anyway. This structure needs to remain small.
307f4abc9d6SMichael R. Hines
308f4abc9d6SMichael R. HinesThis private data area is a convenient place to check for protocol
309f4abc9d6SMichael R. Hinesversioning because the user does not need to register memory to
310f4abc9d6SMichael R. Hinestransmit a few bytes of version information.
311f4abc9d6SMichael R. Hines
312f4abc9d6SMichael R. HinesThis is also a convenient place to negotiate capabilities
313f4abc9d6SMichael R. Hines(like dynamic page registration).
314f4abc9d6SMichael R. Hines
315f4abc9d6SMichael R. HinesIf the version is invalid, we throw an error.
316f4abc9d6SMichael R. Hines
317f4abc9d6SMichael R. HinesIf the version is new, we only negotiate the capabilities that the
318f4abc9d6SMichael R. Hinesrequested version is able to perform and ignore the rest.
319f4abc9d6SMichael R. Hines
320a5f56b90SMichael R. HinesCurrently there is only one capability in Version #1: dynamic page registration
321f4abc9d6SMichael R. Hines
322f4abc9d6SMichael R. HinesFinally: Negotiation happens with the Flags field: If the primary-VM
323f4abc9d6SMichael R. Hinessets a flag, but the destination does not support this capability, it
324f4abc9d6SMichael R. Hineswill return a zero-bit for that flag and the primary-VM will understand
325f4abc9d6SMichael R. Hinesthat as not being an available capability and will thus disable that
326f4abc9d6SMichael R. Hinescapability on the primary-VM side.
327f4abc9d6SMichael R. Hines
328f4abc9d6SMichael R. HinesQEMUFileRDMA Interface:
329f4abc9d6SMichael R. Hines=======================
330f4abc9d6SMichael R. Hines
331f4abc9d6SMichael R. HinesQEMUFileRDMA introduces a couple of new functions:
332f4abc9d6SMichael R. Hines
333f4abc9d6SMichael R. Hines1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
334f4abc9d6SMichael R. Hines2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
335f4abc9d6SMichael R. Hines
336f4abc9d6SMichael R. HinesThese two functions are very short and simply use the protocol
337f4abc9d6SMichael R. Hinesdescribe above to deliver bytes without changing the upper-level
338f4abc9d6SMichael R. Hinesusers of QEMUFile that depend on a bytestream abstraction.
339f4abc9d6SMichael R. Hines
340f4abc9d6SMichael R. HinesFinally, how do we handoff the actual bytes to get_buffer()?
341f4abc9d6SMichael R. Hines
342f4abc9d6SMichael R. HinesAgain, because we're trying to "fake" a bytestream abstraction
343f4abc9d6SMichael R. Hinesusing an analogy not unlike individual UDP frames, we have
344f4abc9d6SMichael R. Hinesto hold on to the bytes received from control-channel's SEND
345f4abc9d6SMichael R. Hinesmessages in memory.
346f4abc9d6SMichael R. Hines
347f4abc9d6SMichael R. HinesEach time we receive a complete "QEMU File" control-channel
348f4abc9d6SMichael R. Hinesmessage, the bytes from SEND are copied into a small local holding area.
349f4abc9d6SMichael R. Hines
350f4abc9d6SMichael R. HinesThen, we return the number of bytes requested by get_buffer()
351f4abc9d6SMichael R. Hinesand leave the remaining bytes in the holding area until get_buffer()
352f4abc9d6SMichael R. Hinescomes around for another pass.
353f4abc9d6SMichael R. Hines
354f4abc9d6SMichael R. HinesIf the buffer is empty, then we follow the same steps
355f4abc9d6SMichael R. Hineslisted above and issue another "QEMU File" protocol command,
356f4abc9d6SMichael R. Hinesasking for a new SEND message to re-fill the buffer.
357f4abc9d6SMichael R. Hines
358971ae6efSzhanghailiangMigration of VM's ram:
359f4abc9d6SMichael R. Hines====================
360f4abc9d6SMichael R. Hines
361f4abc9d6SMichael R. HinesAt the beginning of the migration, (migration-rdma.c),
362f4abc9d6SMichael R. Hinesthe sender and the receiver populate the list of RAMBlocks
363f4abc9d6SMichael R. Hinesto be registered with each other into a structure.
364f4abc9d6SMichael R. HinesThen, using the aforementioned protocol, they exchange a
365f4abc9d6SMichael R. Hinesdescription of these blocks with each other, to be used later
366f4abc9d6SMichael R. Hinesduring the iteration of main memory. This description includes
367f4abc9d6SMichael R. Hinesa list of all the RAMBlocks, their offsets and lengths, virtual
368f4abc9d6SMichael R. Hinesaddresses and possibly includes pre-registered RDMA keys in case dynamic
369f4abc9d6SMichael R. Hinespage registration was disabled on the server-side, otherwise not.
370f4abc9d6SMichael R. Hines
371f4abc9d6SMichael R. HinesMain memory is not migrated with the aforementioned protocol,
372f4abc9d6SMichael R. Hinesbut is instead migrated with normal RDMA Write operations.
373f4abc9d6SMichael R. Hines
374f4abc9d6SMichael R. HinesPages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
375f4abc9d6SMichael R. HinesChunk size is not dynamic, but it could be in a future implementation.
376f4abc9d6SMichael R. HinesThere's nothing to indicate that this is useful right now.
377f4abc9d6SMichael R. Hines
378f4abc9d6SMichael R. HinesWhen a chunk is full (or a flush() occurs), the memory backed by
379f4abc9d6SMichael R. Hinesthe chunk is registered with librdmacm is pinned in memory on
380f4abc9d6SMichael R. Hinesboth sides using the aforementioned protocol.
381f4abc9d6SMichael R. HinesAfter pinning, an RDMA Write is generated and transmitted
382f4abc9d6SMichael R. Hinesfor the entire chunk.
383f4abc9d6SMichael R. Hines
384f4abc9d6SMichael R. HinesChunks are also transmitted in batches: This means that we
385f4abc9d6SMichael R. Hinesdo not request that the hardware signal the completion queue
386f4abc9d6SMichael R. Hinesfor the completion of *every* chunk. The current batch size
387f4abc9d6SMichael R. Hinesis about 64 chunks (corresponding to 64 MB of memory).
388f4abc9d6SMichael R. HinesOnly the last chunk in a batch must be signaled.
389f4abc9d6SMichael R. HinesThis helps keep everything as asynchronous as possible
390f4abc9d6SMichael R. Hinesand helps keep the hardware busy performing RDMA operations.
391f4abc9d6SMichael R. Hines
392f4abc9d6SMichael R. HinesError-handling:
393f4abc9d6SMichael R. Hines===============
394f4abc9d6SMichael R. Hines
395f4abc9d6SMichael R. HinesInfiniband has what is called a "Reliable, Connected"
396f4abc9d6SMichael R. Hineslink (one of 4 choices). This is the mode in which
397f4abc9d6SMichael R. Hineswe use for RDMA migration.
398f4abc9d6SMichael R. Hines
399f4abc9d6SMichael R. HinesIf a *single* message fails,
400f4abc9d6SMichael R. Hinesthe decision is to abort the migration entirely and
401f4abc9d6SMichael R. Hinescleanup all the RDMA descriptors and unregister all
402f4abc9d6SMichael R. Hinesthe memory.
403f4abc9d6SMichael R. Hines
404f4abc9d6SMichael R. HinesAfter cleanup, the Virtual Machine is returned to normal
405f4abc9d6SMichael R. Hinesoperation the same way that would happen if the TCP
406f4abc9d6SMichael R. Hinessocket is broken during a non-RDMA based migration.
407f4abc9d6SMichael R. Hines
408f4abc9d6SMichael R. HinesTODO:
409f4abc9d6SMichael R. Hines=====
41041310c68SMichael R. Hines1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
411806be373SLike Xu   are not compatible with infiniband memory pinning and will result in
412f4abc9d6SMichael R. Hines   an aborted migration (but with the source VM left unaffected).
41341310c68SMichael R. Hines2. Use of the recent /proc/<pid>/pagemap would likely speed up
414f4abc9d6SMichael R. Hines   the use of KSM and ballooning while using RDMA.
41541310c68SMichael R. Hines3. Also, some form of balloon-device usage tracking would also
416f4abc9d6SMichael R. Hines   help alleviate some issues.
41741310c68SMichael R. Hines4. Use LRU to provide more fine-grained direction of UNREGISTER
418a5f56b90SMichael R. Hines   requests for unpinning memory in an overcommitted environment.
41941310c68SMichael R. Hines5. Expose UNREGISTER support to the user by way of workload-specific
420a5f56b90SMichael R. Hines   hints about application behavior.
421