168365a38SWen CongyangBlock replication 268365a38SWen Congyang---------------------------------------- 368365a38SWen CongyangCopyright Fujitsu, Corp. 2016 468365a38SWen CongyangCopyright (c) 2016 Intel Corporation 568365a38SWen CongyangCopyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. 668365a38SWen Congyang 768365a38SWen CongyangThis work is licensed under the terms of the GNU GPL, version 2 or later. 868365a38SWen CongyangSee the COPYING file in the top-level directory. 968365a38SWen Congyang 1068365a38SWen CongyangBlock replication is used for continuous checkpoints. It is designed 1168365a38SWen Congyangfor COLO (COarse-grain LOck-stepping) where the Secondary VM is running. 1268365a38SWen CongyangIt can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, 1368365a38SWen Congyangwhere the Secondary VM is not running. 1468365a38SWen Congyang 1568365a38SWen CongyangThis document gives an overview of block replication's design. 1668365a38SWen Congyang 1768365a38SWen Congyang== Background == 1868365a38SWen CongyangHigh availability solutions such as micro checkpoint and COLO will do 1968365a38SWen Congyangconsecutive checkpoints. The VM state of the Primary and Secondary VM is 2068365a38SWen Congyangidentical right after a VM checkpoint, but becomes different as the VM 2168365a38SWen Congyangexecutes till the next checkpoint. To support disk contents checkpoint, 2268365a38SWen Congyangthe modified disk contents in the Secondary VM must be buffered, and are 2368365a38SWen Congyangonly dropped at next checkpoint time. To reduce the network transportation 2468365a38SWen Congyangeffort during a vmstate checkpoint, the disk modification operations of 2568365a38SWen Congyangthe Primary disk are asynchronously forwarded to the Secondary node. 2668365a38SWen Congyang 2768365a38SWen Congyang== Workflow == 2868365a38SWen CongyangThe following is the image of block replication workflow: 2968365a38SWen Congyang 3068365a38SWen Congyang +----------------------+ +------------------------+ 3168365a38SWen Congyang |Primary Write Requests| |Secondary Write Requests| 3268365a38SWen Congyang +----------------------+ +------------------------+ 3368365a38SWen Congyang | | 3468365a38SWen Congyang | (4) 3568365a38SWen Congyang | V 3668365a38SWen Congyang | /-------------\ 3768365a38SWen Congyang | Copy and Forward | | 3868365a38SWen Congyang |---------(1)----------+ | Disk Buffer | 3968365a38SWen Congyang | | | | 4068365a38SWen Congyang | (3) \-------------/ 4168365a38SWen Congyang | speculative ^ 4268365a38SWen Congyang | write through (2) 4368365a38SWen Congyang | | | 4468365a38SWen Congyang V V | 4568365a38SWen Congyang +--------------+ +----------------+ 4668365a38SWen Congyang | Primary Disk | | Secondary Disk | 4768365a38SWen Congyang +--------------+ +----------------+ 4868365a38SWen Congyang 4968365a38SWen Congyang 1) Primary write requests will be copied and forwarded to Secondary 5068365a38SWen Congyang QEMU. 5168365a38SWen Congyang 2) Before Primary write requests are written to Secondary disk, the 5268365a38SWen Congyang original sector content will be read from Secondary disk and 5368365a38SWen Congyang buffered in the Disk buffer, but it will not overwrite the existing 5468365a38SWen Congyang sector content (it could be from either "Secondary Write Requests" or 5568365a38SWen Congyang previous COW of "Primary Write Requests") in the Disk buffer. 5668365a38SWen Congyang 3) Primary write requests will be written to Secondary disk. 5768365a38SWen Congyang 4) Secondary write requests will be buffered in the Disk buffer and it 5868365a38SWen Congyang will overwrite the existing sector content in the buffer. 5968365a38SWen Congyang 6068365a38SWen Congyang== Architecture == 6168365a38SWen CongyangWe are going to implement block replication from many basic 6268365a38SWen Congyangblocks that are already in QEMU. 6368365a38SWen Congyang 6468365a38SWen Congyang virtio-blk || 6568365a38SWen Congyang ^ || .---------- 6668365a38SWen Congyang | || | Secondary 6768365a38SWen Congyang 1 Quorum || '---------- 6890dfe59bSLukas Straub / \ || virtio-blk 6990dfe59bSLukas Straub / \ || ^ 7090dfe59bSLukas Straub Primary 2 filter | 7190dfe59bSLukas Straub disk ^ 7 Quorum 7290dfe59bSLukas Straub | / 7390dfe59bSLukas Straub 3 NBD -------> 3 NBD / 7468365a38SWen Congyang client || server 2 filter 7568365a38SWen Congyang || ^ ^ 7668365a38SWen Congyang--------. || | | 7768365a38SWen CongyangPrimary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 7868365a38SWen Congyang--------' || | backing ^ backing 7968365a38SWen Congyang || | | 8068365a38SWen Congyang || | | 8168365a38SWen Congyang || '-------------------------' 829a599217SVladimir Sementsov-Ogievskiy || blockdev-backup sync=none 6 8368365a38SWen Congyang 8468365a38SWen Congyang1) The disk on the primary is represented by a block device with two 8568365a38SWen Congyangchildren, providing replication between a primary disk and the host that 8668365a38SWen Congyangruns the secondary VM. The read pattern (fifo) for quorum can be extended 8768365a38SWen Congyangto make the primary always read from the local disk instead of going through 8868365a38SWen CongyangNBD. 8968365a38SWen Congyang 9068365a38SWen Congyang2) The new block filter (the name is replication) will control the block 9168365a38SWen Congyangreplication. 9268365a38SWen Congyang 9368365a38SWen Congyang3) The secondary disk receives writes from the primary VM through QEMU's 9468365a38SWen Congyangembedded NBD server (speculative write-through). 9568365a38SWen Congyang 9668365a38SWen Congyang4) The disk on the secondary is represented by a custom block device 9768365a38SWen Congyang(called active-disk). It should start as an empty disk, and the format 9868365a38SWen Congyangshould support bdrv_make_empty() and backing file. 9968365a38SWen Congyang 10068365a38SWen Congyang5) The hidden-disk is created automatically. It buffers the original content 10168365a38SWen Congyangthat is modified by the primary VM. It should also start as an empty disk, 10268365a38SWen Congyangand the driver supports bdrv_make_empty() and backing file. 10368365a38SWen Congyang 1049a599217SVladimir Sementsov-Ogievskiy6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer 10568365a38SWen Congyangany state that would otherwise be lost by the speculative write-through 10668365a38SWen Congyangof the NBD server into the secondary disk. So before block replication, 10768365a38SWen Congyangthe primary disk and secondary disk should contain the same data. 10868365a38SWen Congyang 10990dfe59bSLukas Straub7) The secondary also has a quorum node, so after secondary failover it 11090dfe59bSLukas Straubcan become the new primary and continue replication. 11190dfe59bSLukas Straub 11290dfe59bSLukas Straub 11368365a38SWen Congyang== Failure Handling == 11468365a38SWen CongyangThere are 7 internal errors when block replication is running: 11568365a38SWen Congyang1. I/O error on primary disk 11668365a38SWen Congyang2. Forwarding primary write requests failed 11768365a38SWen Congyang3. Backup failed 11868365a38SWen Congyang4. I/O error on secondary disk 11968365a38SWen Congyang5. I/O error on active disk 12068365a38SWen Congyang6. Making active disk or hidden disk empty failed 12168365a38SWen Congyang7. Doing failover failed 12268365a38SWen CongyangIn case 1 and 5, we just report the error to the disk layer. In case 2, 3, 12368365a38SWen Congyang4 and 6, we just report block replication's error to FT/HA manager (which 12468365a38SWen Congyangdecides when to do a new checkpoint, when to do failover). 12568365a38SWen CongyangIn case 7, if active commit failed, we use replication failover failed state 12668365a38SWen Congyangin Secondary's write operation (what decides which target to write). 12768365a38SWen Congyang 12868365a38SWen Congyang== New block driver interface == 12968365a38SWen CongyangWe add four block driver interfaces to control block replication: 13068365a38SWen Congyanga. replication_start_all() 13168365a38SWen Congyang Start block replication, called in migration/checkpoint thread. 13268365a38SWen Congyang We must call block_replication_start_all() in secondary QEMU before 13368365a38SWen Congyang calling block_replication_start_all() in primary QEMU. The caller 13468365a38SWen Congyang must hold the I/O mutex lock if it is in migration/checkpoint 13568365a38SWen Congyang thread. 13668365a38SWen Congyangb. replication_do_checkpoint_all() 13768365a38SWen Congyang This interface is called after all VM state is transferred to 13868365a38SWen Congyang Secondary QEMU. The Disk buffer will be dropped in this interface. 13968365a38SWen Congyang The caller must hold the I/O mutex lock if it is in migration/checkpoint 14068365a38SWen Congyang thread. 14168365a38SWen Congyangc. replication_get_error_all() 14268365a38SWen Congyang This interface is called to check if error happened in replication. 14368365a38SWen Congyang The caller must hold the I/O mutex lock if it is in migration/checkpoint 14468365a38SWen Congyang thread. 14568365a38SWen Congyangd. replication_stop_all() 14668365a38SWen Congyang It is called on failover. We will flush the Disk buffer into 14768365a38SWen Congyang Secondary Disk and stop block replication. The vm should be stopped 14868365a38SWen Congyang before calling it if you use this API to shutdown the guest, or other 14968365a38SWen Congyang things except failover. The caller must hold the I/O mutex lock if it is 15068365a38SWen Congyang in migration/checkpoint thread. 15168365a38SWen Congyang 15268365a38SWen Congyang== Usage == 15368365a38SWen CongyangPrimary: 15468365a38SWen Congyang -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ 15568365a38SWen Congyang children.0.file.filename=1.raw,\ 15668365a38SWen Congyang children.0.driver=raw 15768365a38SWen Congyang 15868365a38SWen Congyang Run qmp command in primary qemu: 159eff708a8SRao, Lei { "execute": "human-monitor-command", 160eff708a8SRao, Lei "arguments": { 161eff708a8SRao, Lei "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1" 16268365a38SWen Congyang } 16368365a38SWen Congyang } 164eff708a8SRao, Lei { "execute": "x-blockdev-change", 165eff708a8SRao, Lei "arguments": { 166eff708a8SRao, Lei "parent": "colo1", 167eff708a8SRao, Lei "node": "nbd_client1" 16868365a38SWen Congyang } 16968365a38SWen Congyang } 17068365a38SWen Congyang Note: 17168365a38SWen Congyang 1. There should be only one NBD Client for each primary disk. 17268365a38SWen Congyang 2. host is the secondary physical machine's hostname or IP 17368365a38SWen Congyang 3. Each disk must have its own export name. 17468365a38SWen Congyang 4. It is all a single argument to -drive and you should ignore the 17568365a38SWen Congyang leading whitespace. 17668365a38SWen Congyang 5. The qmp command line must be run after running qmp command line in 17768365a38SWen Congyang secondary qemu. 17890dfe59bSLukas Straub 6. After primary failover we need remove children.1 (replication driver). 17968365a38SWen Congyang 18068365a38SWen CongyangSecondary: 18168365a38SWen Congyang -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ 182*036ef344SZhang Chen -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1 18368365a38SWen Congyang file.file.filename=active_disk.qcow2,\ 18468365a38SWen Congyang file.driver=qcow2,\ 18568365a38SWen Congyang file.backing.file.filename=hidden_disk.qcow2,\ 18668365a38SWen Congyang file.backing.driver=qcow2,\ 18768365a38SWen Congyang file.backing.backing=colo1 18890dfe59bSLukas Straub -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\ 18990dfe59bSLukas Straub vote-threshold=1,children.0=childs1 19068365a38SWen Congyang 19168365a38SWen Congyang Then run qmp command in secondary qemu: 192eff708a8SRao, Lei { "execute": "nbd-server-start", 193eff708a8SRao, Lei "arguments": { 194eff708a8SRao, Lei "addr": { 195eff708a8SRao, Lei "type": "inet", 196eff708a8SRao, Lei "data": { 197eff708a8SRao, Lei "host": "xxx", 198eff708a8SRao, Lei "port": "xxx" 19968365a38SWen Congyang } 20068365a38SWen Congyang } 20168365a38SWen Congyang } 20268365a38SWen Congyang } 203eff708a8SRao, Lei { "execute": "nbd-server-add", 204eff708a8SRao, Lei "arguments": { 205eff708a8SRao, Lei "device": "colo1", 206eff708a8SRao, Lei "writable": true 20768365a38SWen Congyang } 20868365a38SWen Congyang } 20968365a38SWen Congyang 21068365a38SWen Congyang Note: 21168365a38SWen Congyang 1. The export name in secondary QEMU command line is the secondary 21268365a38SWen Congyang disk's id. 21368365a38SWen Congyang 2. The export name for the same disk must be the same 21468365a38SWen Congyang 3. The qmp command nbd-server-start and nbd-server-add must be run 21568365a38SWen Congyang before running the qmp command migrate on primary QEMU 21668365a38SWen Congyang 4. Active disk, hidden disk and nbd target's length should be the 21768365a38SWen Congyang same. 21868365a38SWen Congyang 5. It is better to put active disk and hidden disk in ramdisk. 21968365a38SWen Congyang 6. It is all a single argument to -drive, and you should ignore 22068365a38SWen Congyang the leading whitespace. 22168365a38SWen Congyang 22268365a38SWen CongyangAfter Failover: 22368365a38SWen CongyangPrimary: 22468365a38SWen Congyang The secondary host is down, so we should run the following qmp command 22568365a38SWen Congyang to remove the nbd child from the quorum: 226eff708a8SRao, Lei { "execute": "x-blockdev-change", 227eff708a8SRao, Lei "arguments": { 228eff708a8SRao, Lei "parent": "colo1", 229eff708a8SRao, Lei "child": "children.1" 23068365a38SWen Congyang } 23168365a38SWen Congyang } 232eff708a8SRao, Lei { "execute": "human-monitor-command", 233eff708a8SRao, Lei "arguments": { 234eff708a8SRao, Lei "command-line": "drive_del xxxx" 23568365a38SWen Congyang } 23668365a38SWen Congyang } 23768365a38SWen Congyang Note: there is no qmp command to remove the blockdev now 23868365a38SWen Congyang 23968365a38SWen CongyangSecondary: 24068365a38SWen Congyang The primary host is down, so we should do the following thing: 241eff708a8SRao, Lei { "execute": "nbd-server-stop" } 24268365a38SWen Congyang 24390dfe59bSLukas StraubPromote Secondary to Primary: 24490dfe59bSLukas Straub see COLO-FT.txt 24590dfe59bSLukas Straub 24668365a38SWen CongyangTODO: 24790dfe59bSLukas Straub1. Shared disk 248