xref: /qemu/docs/block-replication.txt (revision 036ef344)
168365a38SWen CongyangBlock replication
268365a38SWen Congyang----------------------------------------
368365a38SWen CongyangCopyright Fujitsu, Corp. 2016
468365a38SWen CongyangCopyright (c) 2016 Intel Corporation
568365a38SWen CongyangCopyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
668365a38SWen Congyang
768365a38SWen CongyangThis work is licensed under the terms of the GNU GPL, version 2 or later.
868365a38SWen CongyangSee the COPYING file in the top-level directory.
968365a38SWen Congyang
1068365a38SWen CongyangBlock replication is used for continuous checkpoints. It is designed
1168365a38SWen Congyangfor COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
1268365a38SWen CongyangIt can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
1368365a38SWen Congyangwhere the Secondary VM is not running.
1468365a38SWen Congyang
1568365a38SWen CongyangThis document gives an overview of block replication's design.
1668365a38SWen Congyang
1768365a38SWen Congyang== Background ==
1868365a38SWen CongyangHigh availability solutions such as micro checkpoint and COLO will do
1968365a38SWen Congyangconsecutive checkpoints. The VM state of the Primary and Secondary VM is
2068365a38SWen Congyangidentical right after a VM checkpoint, but becomes different as the VM
2168365a38SWen Congyangexecutes till the next checkpoint. To support disk contents checkpoint,
2268365a38SWen Congyangthe modified disk contents in the Secondary VM must be buffered, and are
2368365a38SWen Congyangonly dropped at next checkpoint time. To reduce the network transportation
2468365a38SWen Congyangeffort during a vmstate checkpoint, the disk modification operations of
2568365a38SWen Congyangthe Primary disk are asynchronously forwarded to the Secondary node.
2668365a38SWen Congyang
2768365a38SWen Congyang== Workflow ==
2868365a38SWen CongyangThe following is the image of block replication workflow:
2968365a38SWen Congyang
3068365a38SWen Congyang        +----------------------+            +------------------------+
3168365a38SWen Congyang        |Primary Write Requests|            |Secondary Write Requests|
3268365a38SWen Congyang        +----------------------+            +------------------------+
3368365a38SWen Congyang                  |                                       |
3468365a38SWen Congyang                  |                                      (4)
3568365a38SWen Congyang                  |                                       V
3668365a38SWen Congyang                  |                              /-------------\
3768365a38SWen Congyang                  |      Copy and Forward        |             |
3868365a38SWen Congyang                  |---------(1)----------+       | Disk Buffer |
3968365a38SWen Congyang                  |                      |       |             |
4068365a38SWen Congyang                  |                     (3)      \-------------/
4168365a38SWen Congyang                  |                 speculative      ^
4268365a38SWen Congyang                  |                write through    (2)
4368365a38SWen Congyang                  |                      |           |
4468365a38SWen Congyang                  V                      V           |
4568365a38SWen Congyang           +--------------+           +----------------+
4668365a38SWen Congyang           | Primary Disk |           | Secondary Disk |
4768365a38SWen Congyang           +--------------+           +----------------+
4868365a38SWen Congyang
4968365a38SWen Congyang    1) Primary write requests will be copied and forwarded to Secondary
5068365a38SWen Congyang       QEMU.
5168365a38SWen Congyang    2) Before Primary write requests are written to Secondary disk, the
5268365a38SWen Congyang       original sector content will be read from Secondary disk and
5368365a38SWen Congyang       buffered in the Disk buffer, but it will not overwrite the existing
5468365a38SWen Congyang       sector content (it could be from either "Secondary Write Requests" or
5568365a38SWen Congyang       previous COW of "Primary Write Requests") in the Disk buffer.
5668365a38SWen Congyang    3) Primary write requests will be written to Secondary disk.
5768365a38SWen Congyang    4) Secondary write requests will be buffered in the Disk buffer and it
5868365a38SWen Congyang       will overwrite the existing sector content in the buffer.
5968365a38SWen Congyang
6068365a38SWen Congyang== Architecture ==
6168365a38SWen CongyangWe are going to implement block replication from many basic
6268365a38SWen Congyangblocks that are already in QEMU.
6368365a38SWen Congyang
6468365a38SWen Congyang         virtio-blk       ||
6568365a38SWen Congyang             ^            ||                            .----------
6668365a38SWen Congyang             |            ||                            | Secondary
6768365a38SWen Congyang        1 Quorum          ||                            '----------
6890dfe59bSLukas Straub         /      \         ||                                                           virtio-blk
6990dfe59bSLukas Straub        /        \        ||                                                               ^
7090dfe59bSLukas Straub   Primary    2 filter                                                                     |
7190dfe59bSLukas Straub     disk         ^                                                                   7 Quorum
7290dfe59bSLukas Straub                  |                                                                    /
7390dfe59bSLukas Straub                3 NBD  ------->  3 NBD                                                /
7468365a38SWen Congyang                client    ||     server                                          2 filter
7568365a38SWen Congyang                          ||        ^                                                ^
7668365a38SWen Congyang--------.                 ||        |                                                |
7768365a38SWen CongyangPrimary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
7868365a38SWen Congyang--------'                 ||        |          backing        ^       backing
7968365a38SWen Congyang                          ||        |                         |
8068365a38SWen Congyang                          ||        |                         |
8168365a38SWen Congyang                          ||        '-------------------------'
829a599217SVladimir Sementsov-Ogievskiy                          ||         blockdev-backup sync=none 6
8368365a38SWen Congyang
8468365a38SWen Congyang1) The disk on the primary is represented by a block device with two
8568365a38SWen Congyangchildren, providing replication between a primary disk and the host that
8668365a38SWen Congyangruns the secondary VM. The read pattern (fifo) for quorum can be extended
8768365a38SWen Congyangto make the primary always read from the local disk instead of going through
8868365a38SWen CongyangNBD.
8968365a38SWen Congyang
9068365a38SWen Congyang2) The new block filter (the name is replication) will control the block
9168365a38SWen Congyangreplication.
9268365a38SWen Congyang
9368365a38SWen Congyang3) The secondary disk receives writes from the primary VM through QEMU's
9468365a38SWen Congyangembedded NBD server (speculative write-through).
9568365a38SWen Congyang
9668365a38SWen Congyang4) The disk on the secondary is represented by a custom block device
9768365a38SWen Congyang(called active-disk). It should start as an empty disk, and the format
9868365a38SWen Congyangshould support bdrv_make_empty() and backing file.
9968365a38SWen Congyang
10068365a38SWen Congyang5) The hidden-disk is created automatically. It buffers the original content
10168365a38SWen Congyangthat is modified by the primary VM. It should also start as an empty disk,
10268365a38SWen Congyangand the driver supports bdrv_make_empty() and backing file.
10368365a38SWen Congyang
1049a599217SVladimir Sementsov-Ogievskiy6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer
10568365a38SWen Congyangany state that would otherwise be lost by the speculative write-through
10668365a38SWen Congyangof the NBD server into the secondary disk. So before block replication,
10768365a38SWen Congyangthe primary disk and secondary disk should contain the same data.
10868365a38SWen Congyang
10990dfe59bSLukas Straub7) The secondary also has a quorum node, so after secondary failover it
11090dfe59bSLukas Straubcan become the new primary and continue replication.
11190dfe59bSLukas Straub
11290dfe59bSLukas Straub
11368365a38SWen Congyang== Failure Handling ==
11468365a38SWen CongyangThere are 7 internal errors when block replication is running:
11568365a38SWen Congyang1. I/O error on primary disk
11668365a38SWen Congyang2. Forwarding primary write requests failed
11768365a38SWen Congyang3. Backup failed
11868365a38SWen Congyang4. I/O error on secondary disk
11968365a38SWen Congyang5. I/O error on active disk
12068365a38SWen Congyang6. Making active disk or hidden disk empty failed
12168365a38SWen Congyang7. Doing failover failed
12268365a38SWen CongyangIn case 1 and 5, we just report the error to the disk layer. In case 2, 3,
12368365a38SWen Congyang4 and 6, we just report block replication's error to FT/HA manager (which
12468365a38SWen Congyangdecides when to do a new checkpoint, when to do failover).
12568365a38SWen CongyangIn case 7, if active commit failed, we use replication failover failed state
12668365a38SWen Congyangin Secondary's write operation (what decides which target to write).
12768365a38SWen Congyang
12868365a38SWen Congyang== New block driver interface ==
12968365a38SWen CongyangWe add four block driver interfaces to control block replication:
13068365a38SWen Congyanga. replication_start_all()
13168365a38SWen Congyang   Start block replication, called in migration/checkpoint thread.
13268365a38SWen Congyang   We must call block_replication_start_all() in secondary QEMU before
13368365a38SWen Congyang   calling block_replication_start_all() in primary QEMU. The caller
13468365a38SWen Congyang   must hold the I/O mutex lock if it is in migration/checkpoint
13568365a38SWen Congyang   thread.
13668365a38SWen Congyangb. replication_do_checkpoint_all()
13768365a38SWen Congyang   This interface is called after all VM state is transferred to
13868365a38SWen Congyang   Secondary QEMU. The Disk buffer will be dropped in this interface.
13968365a38SWen Congyang   The caller must hold the I/O mutex lock if it is in migration/checkpoint
14068365a38SWen Congyang   thread.
14168365a38SWen Congyangc. replication_get_error_all()
14268365a38SWen Congyang   This interface is called to check if error happened in replication.
14368365a38SWen Congyang   The caller must hold the I/O mutex lock if it is in migration/checkpoint
14468365a38SWen Congyang   thread.
14568365a38SWen Congyangd. replication_stop_all()
14668365a38SWen Congyang   It is called on failover. We will flush the Disk buffer into
14768365a38SWen Congyang   Secondary Disk and stop block replication. The vm should be stopped
14868365a38SWen Congyang   before calling it if you use this API to shutdown the guest, or other
14968365a38SWen Congyang   things except failover. The caller must hold the I/O mutex lock if it is
15068365a38SWen Congyang   in migration/checkpoint thread.
15168365a38SWen Congyang
15268365a38SWen Congyang== Usage ==
15368365a38SWen CongyangPrimary:
15468365a38SWen Congyang  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
15568365a38SWen Congyang         children.0.file.filename=1.raw,\
15668365a38SWen Congyang         children.0.driver=raw
15768365a38SWen Congyang
15868365a38SWen Congyang  Run qmp command in primary qemu:
159eff708a8SRao, Lei    { "execute": "human-monitor-command",
160eff708a8SRao, Lei      "arguments": {
161eff708a8SRao, Lei          "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1"
16268365a38SWen Congyang      }
16368365a38SWen Congyang    }
164eff708a8SRao, Lei    { "execute": "x-blockdev-change",
165eff708a8SRao, Lei      "arguments": {
166eff708a8SRao, Lei          "parent": "colo1",
167eff708a8SRao, Lei          "node": "nbd_client1"
16868365a38SWen Congyang      }
16968365a38SWen Congyang    }
17068365a38SWen Congyang  Note:
17168365a38SWen Congyang  1. There should be only one NBD Client for each primary disk.
17268365a38SWen Congyang  2. host is the secondary physical machine's hostname or IP
17368365a38SWen Congyang  3. Each disk must have its own export name.
17468365a38SWen Congyang  4. It is all a single argument to -drive and you should ignore the
17568365a38SWen Congyang     leading whitespace.
17668365a38SWen Congyang  5. The qmp command line must be run after running qmp command line in
17768365a38SWen Congyang     secondary qemu.
17890dfe59bSLukas Straub  6. After primary failover we need remove children.1 (replication driver).
17968365a38SWen Congyang
18068365a38SWen CongyangSecondary:
18168365a38SWen Congyang  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
182*036ef344SZhang Chen  -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1
18368365a38SWen Congyang         file.file.filename=active_disk.qcow2,\
18468365a38SWen Congyang         file.driver=qcow2,\
18568365a38SWen Congyang         file.backing.file.filename=hidden_disk.qcow2,\
18668365a38SWen Congyang         file.backing.driver=qcow2,\
18768365a38SWen Congyang         file.backing.backing=colo1
18890dfe59bSLukas Straub  -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\
18990dfe59bSLukas Straub         vote-threshold=1,children.0=childs1
19068365a38SWen Congyang
19168365a38SWen Congyang  Then run qmp command in secondary qemu:
192eff708a8SRao, Lei    { "execute": "nbd-server-start",
193eff708a8SRao, Lei      "arguments": {
194eff708a8SRao, Lei          "addr": {
195eff708a8SRao, Lei              "type": "inet",
196eff708a8SRao, Lei              "data": {
197eff708a8SRao, Lei                  "host": "xxx",
198eff708a8SRao, Lei                  "port": "xxx"
19968365a38SWen Congyang              }
20068365a38SWen Congyang          }
20168365a38SWen Congyang      }
20268365a38SWen Congyang    }
203eff708a8SRao, Lei    { "execute": "nbd-server-add",
204eff708a8SRao, Lei      "arguments": {
205eff708a8SRao, Lei          "device": "colo1",
206eff708a8SRao, Lei          "writable": true
20768365a38SWen Congyang      }
20868365a38SWen Congyang    }
20968365a38SWen Congyang
21068365a38SWen Congyang  Note:
21168365a38SWen Congyang  1. The export name in secondary QEMU command line is the secondary
21268365a38SWen Congyang     disk's id.
21368365a38SWen Congyang  2. The export name for the same disk must be the same
21468365a38SWen Congyang  3. The qmp command nbd-server-start and nbd-server-add must be run
21568365a38SWen Congyang     before running the qmp command migrate on primary QEMU
21668365a38SWen Congyang  4. Active disk, hidden disk and nbd target's length should be the
21768365a38SWen Congyang     same.
21868365a38SWen Congyang  5. It is better to put active disk and hidden disk in ramdisk.
21968365a38SWen Congyang  6. It is all a single argument to -drive, and you should ignore
22068365a38SWen Congyang     the leading whitespace.
22168365a38SWen Congyang
22268365a38SWen CongyangAfter Failover:
22368365a38SWen CongyangPrimary:
22468365a38SWen Congyang  The secondary host is down, so we should run the following qmp command
22568365a38SWen Congyang  to remove the nbd child from the quorum:
226eff708a8SRao, Lei  { "execute": "x-blockdev-change",
227eff708a8SRao, Lei    "arguments": {
228eff708a8SRao, Lei        "parent": "colo1",
229eff708a8SRao, Lei        "child": "children.1"
23068365a38SWen Congyang    }
23168365a38SWen Congyang  }
232eff708a8SRao, Lei  { "execute": "human-monitor-command",
233eff708a8SRao, Lei    "arguments": {
234eff708a8SRao, Lei        "command-line": "drive_del xxxx"
23568365a38SWen Congyang    }
23668365a38SWen Congyang  }
23768365a38SWen Congyang  Note: there is no qmp command to remove the blockdev now
23868365a38SWen Congyang
23968365a38SWen CongyangSecondary:
24068365a38SWen Congyang  The primary host is down, so we should do the following thing:
241eff708a8SRao, Lei  { "execute": "nbd-server-stop" }
24268365a38SWen Congyang
24390dfe59bSLukas StraubPromote Secondary to Primary:
24490dfe59bSLukas Straub  see COLO-FT.txt
24590dfe59bSLukas Straub
24668365a38SWen CongyangTODO:
24790dfe59bSLukas Straub1. Shared disk
248