179c0f397SHaozhong ZhangQEMU Virtual NVDIMM 279c0f397SHaozhong Zhang=================== 379c0f397SHaozhong Zhang 479c0f397SHaozhong ZhangThis document explains the usage of virtual NVDIMM (vNVDIMM) feature 579c0f397SHaozhong Zhangwhich is available since QEMU v2.6.0. 679c0f397SHaozhong Zhang 779c0f397SHaozhong ZhangThe current QEMU only implements the persistent memory mode of vNVDIMM 879c0f397SHaozhong Zhangdevice and not the block window mode. 979c0f397SHaozhong Zhang 1079c0f397SHaozhong ZhangBasic Usage 1179c0f397SHaozhong Zhang----------- 1279c0f397SHaozhong Zhang 1379c0f397SHaozhong ZhangThe storage of a vNVDIMM device in QEMU is provided by the memory 1479c0f397SHaozhong Zhangbackend (i.e. memory-backend-file and memory-backend-ram). A simple 1579c0f397SHaozhong Zhangway to create a vNVDIMM device at startup time is done via the 1679c0f397SHaozhong Zhangfollowing command line options: 1779c0f397SHaozhong Zhang 18*ca577afcSPankaj Gupta -machine pc,nvdimm=on 1979c0f397SHaozhong Zhang -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE 20dbd730e8SStefan Hajnoczi -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off 21dbd730e8SStefan Hajnoczi -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off 2279c0f397SHaozhong Zhang 2379c0f397SHaozhong ZhangWhere, 2479c0f397SHaozhong Zhang 2579c0f397SHaozhong Zhang - the "nvdimm" machine option enables vNVDIMM feature. 2679c0f397SHaozhong Zhang 2779c0f397SHaozhong Zhang - "slots=$N" should be equal to or larger than the total amount of 2879c0f397SHaozhong Zhang normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. 2979c0f397SHaozhong Zhang 3079c0f397SHaozhong Zhang - "maxmem=$MAX_SIZE" should be equal to or larger than the total size 3179c0f397SHaozhong Zhang of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be 3279c0f397SHaozhong Zhang >= $RAM_SIZE + $NVDIMM_SIZE here. 3379c0f397SHaozhong Zhang 34dbd730e8SStefan Hajnoczi - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH, 35dbd730e8SStefan Hajnoczi size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size 36dbd730e8SStefan Hajnoczi $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go 37dbd730e8SStefan Hajnoczi to the file $PATH. 3879c0f397SHaozhong Zhang 3979c0f397SHaozhong Zhang "share=on/off" controls the visibility of guest writes. If 4079c0f397SHaozhong Zhang "share=on", then guest writes will be applied to the backend 4179c0f397SHaozhong Zhang file. If another guest uses the same backend file with option 4279c0f397SHaozhong Zhang "share=on", then above writes will be visible to it as well. If 4379c0f397SHaozhong Zhang "share=off", then guest writes won't be applied to the backend 4479c0f397SHaozhong Zhang file and thus will be invisible to other guests. 4579c0f397SHaozhong Zhang 46dbd730e8SStefan Hajnoczi "readonly=on/off" controls whether the file $PATH is opened read-only or 47dbd730e8SStefan Hajnoczi read/write (default). 48dbd730e8SStefan Hajnoczi 49dbd730e8SStefan Hajnoczi - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write 50dbd730e8SStefan Hajnoczi virtual NVDIMM device whose storage is provided by above memory backend 51dbd730e8SStefan Hajnoczi device. 52dbd730e8SStefan Hajnoczi 53dbd730e8SStefan Hajnoczi "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM 54dbd730e8SStefan Hajnoczi State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept 55dbd730e8SStefan Hajnoczi persistent writes. Linux guest drivers set the device to read-only when this 56dbd730e8SStefan Hajnoczi bit is present. Set unarmed to on when the memdev has readonly=on. 5779c0f397SHaozhong Zhang 5879c0f397SHaozhong ZhangMultiple vNVDIMM devices can be created if multiple pairs of "-object" 5979c0f397SHaozhong Zhangand "-device" are provided. 6079c0f397SHaozhong Zhang 6179c0f397SHaozhong ZhangFor above command line options, if the guest OS has the proper NVDIMM 62bd54b110SKees Cookdriver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to 63bd54b110SKees Cookdetect a NVDIMM device which is in the persistent memory mode and whose 64bd54b110SKees Cooksize is $NVDIMM_SIZE. 6579c0f397SHaozhong Zhang 6679c0f397SHaozhong ZhangNote: 6779c0f397SHaozhong Zhang 6879c0f397SHaozhong Zhang1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual 6979c0f397SHaozhong Zhang backend file size is not equal to the size given by "size" option, 7079c0f397SHaozhong Zhang QEMU will truncate the backend file by ftruncate(2), which will 7179c0f397SHaozhong Zhang corrupt the existing data in the backend file, especially for the 7279c0f397SHaozhong Zhang shrink case. 7379c0f397SHaozhong Zhang 7479c0f397SHaozhong Zhang QEMU v2.8.0 and later check the backend file size and the "size" 7579c0f397SHaozhong Zhang option. If they do not match, QEMU will report errors and abort in 7679c0f397SHaozhong Zhang order to avoid the data corruption. 7779c0f397SHaozhong Zhang 7879c0f397SHaozhong Zhang2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" 7979c0f397SHaozhong Zhang option of memory-backend-file, e.g. 4KB alignment on x86. However, 8079c0f397SHaozhong Zhang QEMU v.2.7.0 puts an additional alignment requirement, which may 8179c0f397SHaozhong Zhang require a larger value than the basic one, e.g. 2MB on x86. This 8279c0f397SHaozhong Zhang change breaks the usage of memory-backend-file that only satisfies 8379c0f397SHaozhong Zhang the basic alignment. 8479c0f397SHaozhong Zhang 8579c0f397SHaozhong Zhang QEMU v2.8.0 and later remove the additional alignment on non-s390x 8679c0f397SHaozhong Zhang architectures, so the broken memory-backend-file can work again. 8779c0f397SHaozhong Zhang 8879c0f397SHaozhong ZhangLabel 8979c0f397SHaozhong Zhang----- 9079c0f397SHaozhong Zhang 9179c0f397SHaozhong ZhangQEMU v2.7.0 and later implement the label support for vNVDIMM devices. 9279c0f397SHaozhong ZhangTo enable label on vNVDIMM devices, users can simply add 9379c0f397SHaozhong Zhang"label-size=$SZ" option to "-device nvdimm", e.g. 9479c0f397SHaozhong Zhang 9579c0f397SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K 9679c0f397SHaozhong Zhang 9779c0f397SHaozhong ZhangNote: 9879c0f397SHaozhong Zhang 9979c0f397SHaozhong Zhang1. The minimal label size is 128KB. 10079c0f397SHaozhong Zhang 10179c0f397SHaozhong Zhang2. QEMU v2.7.0 and later store labels at the end of backend storage. 10279c0f397SHaozhong Zhang If a memory backend file, which was previously used as the backend 10379c0f397SHaozhong Zhang of a vNVDIMM device without labels, is now used for a vNVDIMM 10479c0f397SHaozhong Zhang device with label, the data in the label area at the end of file 10579c0f397SHaozhong Zhang will be inaccessible to the guest. If any useful data (e.g. the 10679c0f397SHaozhong Zhang meta-data of the file system) was stored there, the latter usage 10779c0f397SHaozhong Zhang may result guest data corruption (e.g. breakage of guest file 10879c0f397SHaozhong Zhang system). 10979c0f397SHaozhong Zhang 11079c0f397SHaozhong ZhangHotplug 11179c0f397SHaozhong Zhang------- 11279c0f397SHaozhong Zhang 11379c0f397SHaozhong ZhangQEMU v2.8.0 and later implement the hotplug support for vNVDIMM 11479c0f397SHaozhong Zhangdevices. Similarly to the RAM hotplug, the vNVDIMM hotplug is 11579c0f397SHaozhong Zhangaccomplished by two monitor commands "object_add" and "device_add". 11679c0f397SHaozhong Zhang 11779c0f397SHaozhong ZhangFor example, the following commands add another 4GB vNVDIMM device to 11879c0f397SHaozhong Zhangthe guest: 11979c0f397SHaozhong Zhang 12079c0f397SHaozhong Zhang (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G 12179c0f397SHaozhong Zhang (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 12279c0f397SHaozhong Zhang 12379c0f397SHaozhong ZhangNote: 12479c0f397SHaozhong Zhang 12579c0f397SHaozhong Zhang1. Each hotplugged vNVDIMM device consumes one memory slot. Users 12679c0f397SHaozhong Zhang should always ensure the memory option "-m ...,slots=N" specifies 12779c0f397SHaozhong Zhang enough number of slots, i.e. 12879c0f397SHaozhong Zhang N >= number of RAM devices + 12979c0f397SHaozhong Zhang number of statically plugged vNVDIMM devices + 13079c0f397SHaozhong Zhang number of hotplugged vNVDIMM devices 13179c0f397SHaozhong Zhang 13279c0f397SHaozhong Zhang2. The similar is required for the memory option "-m ...,maxmem=M", i.e. 13379c0f397SHaozhong Zhang M >= size of RAM devices + 13479c0f397SHaozhong Zhang size of statically plugged vNVDIMM devices + 13579c0f397SHaozhong Zhang size of hotplugged vNVDIMM devices 13698376843SHaozhong Zhang 13798376843SHaozhong ZhangAlignment 13898376843SHaozhong Zhang--------- 13998376843SHaozhong Zhang 14098376843SHaozhong ZhangQEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping 14198376843SHaozhong Zhangaddress to the page size (getpagesize(2)) by default. However, some 14298376843SHaozhong Zhangtypes of backends may require an alignment different than the page 14398376843SHaozhong Zhangsize. In that case, QEMU v2.12.0 and later provide 'align' option to 14498376843SHaozhong Zhangmemory-backend-file to allow users to specify the proper alignment. 1455f509751SJingqi LiuFor device dax (e.g., /dev/dax0.0), this alignment needs to match the 1465f509751SJingqi Liualignment requirement of the device dax. The NUM of 'align=NUM' option 1475f509751SJingqi Liumust be larger than or equal to the 'align' of device dax. 1485f509751SJingqi LiuWe can use one of the following commands to show the 'align' of device dax. 1495f509751SJingqi Liu 1505f509751SJingqi Liu ndctl list -X 1515f509751SJingqi Liu daxctl list -R 1525f509751SJingqi Liu 1535f509751SJingqi LiuIn order to get the proper 'align' of device dax, you need to install 1545f509751SJingqi Liuthe library 'libdaxctl'. 15598376843SHaozhong Zhang 15698376843SHaozhong ZhangFor example, device dax require the 2 MB alignment, so we can use 15798376843SHaozhong Zhangfollowing QEMU command line options to use it (/dev/dax0.0) as the 15898376843SHaozhong Zhangbackend of vNVDIMM: 15998376843SHaozhong Zhang 16098376843SHaozhong Zhang -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M 16198376843SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1 162cb836434SHaozhong Zhang 163cb836434SHaozhong ZhangGuest Data Persistence 164cb836434SHaozhong Zhang---------------------- 165cb836434SHaozhong Zhang 166cb836434SHaozhong ZhangThough QEMU supports multiple types of vNVDIMM backends on Linux, 167119906afSZhang Yithe only backend that can guarantee the guest write persistence is: 168119906afSZhang Yi 169119906afSZhang YiA. DAX device (e.g., /dev/dax0.0, ) or 170119906afSZhang YiB. DAX file(mounted with dax option) 171119906afSZhang Yi 172119906afSZhang YiWhen using B (A file supporting direct mapping of persistent memory) 173119906afSZhang Yias a backend, write persistence is guaranteed if the host kernel has 174119906afSZhang Yisupport for the MAP_SYNC flag in the mmap system call (available 175119906afSZhang Yisince Linux 4.15 and on certain distro kernels) and additionally 176119906afSZhang Yiboth 'pmem' and 'share' flags are set to 'on' on the backend. 177119906afSZhang Yi 178119906afSZhang YiIf these conditions are not satisfied i.e. if either 'pmem' or 'share' 179119906afSZhang Yiare not set, if the backend file does not support DAX or if MAP_SYNC 180119906afSZhang Yiis not supported by the host kernel, write persistence is not 181119906afSZhang Yiguaranteed after a system crash. For compatibility reasons, these 182119906afSZhang Yiconditions are ignored if not satisfied. Currently, no way is 183119906afSZhang Yiprovided to test for them. 184119906afSZhang YiFor more details, please reference mmap(2) man page: 185119906afSZhang Yihttp://man7.org/linux/man-pages/man2/mmap.2.html. 186cb836434SHaozhong Zhang 187cb836434SHaozhong ZhangWhen using other types of backends, it's suggested to set 'unarmed' 188cb836434SHaozhong Zhangoption of '-device nvdimm' to 'on', which sets the unarmed flag of the 189cb836434SHaozhong Zhangguest NVDIMM region mapping structure. This unarmed flag indicates 190cb836434SHaozhong Zhangguest software that this vNVDIMM device contains a region that cannot 191cb836434SHaozhong Zhangaccept persistent writes. In result, for example, the guest Linux 192cb836434SHaozhong ZhangNVDIMM driver, marks such vNVDIMM device as read-only. 1939ab3aad2SRoss Zwisler 194d8b92bd4SWei YangBackend File Setup Example 195d8b92bd4SWei Yang-------------------------- 196d8b92bd4SWei Yang 197d8b92bd4SWei YangHere are two examples showing how to setup these persistent backends on 198d8b92bd4SWei Yanglinux using the tool ndctl [3]. 199d8b92bd4SWei Yang 200d8b92bd4SWei YangA. DAX device 201d8b92bd4SWei Yang 202d8b92bd4SWei YangUse the following command to set up /dev/dax0.0 so that the entirety of 203d8b92bd4SWei Yangnamespace0.0 can be exposed as an emulated NVDIMM to the guest: 204d8b92bd4SWei Yang 205d8b92bd4SWei Yang ndctl create-namespace -f -e namespace0.0 -m devdax 206d8b92bd4SWei Yang 207d8b92bd4SWei YangThe /dev/dax0.0 could be used directly in "mem-path" option. 208d8b92bd4SWei Yang 209d8b92bd4SWei YangB. DAX file 210d8b92bd4SWei Yang 211d8b92bd4SWei YangIndividual files on a DAX host file system can be exposed as emulated 212d8b92bd4SWei YangNVDIMMS. First an fsdax block device is created, partitioned, and then 213d8b92bd4SWei Yangmounted with the "dax" mount option: 214d8b92bd4SWei Yang 215d8b92bd4SWei Yang ndctl create-namespace -f -e namespace0.0 -m fsdax 216d8b92bd4SWei Yang (partition /dev/pmem0 with name pmem0p1) 217d8b92bd4SWei Yang mount -o dax /dev/pmem0p1 /mnt 218d8b92bd4SWei Yang (create or copy a disk image file with qemu-img(1), cp(1), or dd(1) 219d8b92bd4SWei Yang in /mnt) 220d8b92bd4SWei Yang 221d8b92bd4SWei YangThen the new file in /mnt could be used in "mem-path" option. 222d8b92bd4SWei Yang 22311c39b5cSRoss ZwislerNVDIMM Persistence 22411c39b5cSRoss Zwisler------------------ 2259ab3aad2SRoss Zwisler 2269ab3aad2SRoss ZwislerACPI 6.2 Errata A added support for a new Platform Capabilities Structure 2279ab3aad2SRoss Zwislerwhich allows the platform to communicate what features it supports related to 22811c39b5cSRoss ZwislerNVDIMM data persistence. Users can provide a persistence value to a guest via 22911c39b5cSRoss Zwislerthe optional "nvdimm-persistence" machine command line option: 2309ab3aad2SRoss Zwisler 23111c39b5cSRoss Zwisler -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu 2329ab3aad2SRoss Zwisler 23311c39b5cSRoss ZwislerThere are currently two valid values for this option: 2349ab3aad2SRoss Zwisler 23511c39b5cSRoss Zwisler"mem-ctrl" - The platform supports flushing dirty data from the memory 23611c39b5cSRoss Zwisler controller to the NVDIMMs in the event of power loss. 2379ab3aad2SRoss Zwisler 23811c39b5cSRoss Zwisler"cpu" - The platform supports flushing dirty data from the CPU cache to 23911c39b5cSRoss Zwisler the NVDIMMs in the event of power loss. This implies that the 24011c39b5cSRoss Zwisler platform also supports flushing dirty data through the memory 24111c39b5cSRoss Zwisler controller on power loss. 242a4de8552SJunyan He 243a4de8552SJunyan HeIf the vNVDIMM backend is in host persistent memory that can be accessed in 244a4de8552SJunyan HeSNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set 245a4de8552SJunyan Hethe 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU 246a4de8552SJunyan Heis built with libpmem [2] support (configured with --enable-libpmem), QEMU 247a4de8552SJunyan Hewill take necessary operations to guarantee the persistence of its own writes 248a4de8552SJunyan Heto the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration). 249a4de8552SJunyan HeIf 'pmem' is 'on' while there is no libpmem support, qemu will exit and report 250a4de8552SJunyan Hea "lack of libpmem support" message to ensure the persistence is available. 251a4de8552SJunyan HeFor example, if we want to ensure the persistence for some backend file, 252a4de8552SJunyan Heuse the QEMU command line: 253a4de8552SJunyan He 254a4de8552SJunyan He -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on 255a4de8552SJunyan He 256a4de8552SJunyan HeReferences 257a4de8552SJunyan He---------- 258a4de8552SJunyan He 259a4de8552SJunyan He[1] NVM Programming Model (NPM) 260a4de8552SJunyan He Version 1.2 261a4de8552SJunyan He https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf 262a4de8552SJunyan He[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page: 263a4de8552SJunyan He http://pmem.io/pmdk/ 264d8b92bd4SWei Yang[3] ndctl-create-namespace - provision or reconfigure a namespace 265d8b92bd4SWei Yang http://pmem.io/ndctl/ndctl-create-namespace.html 266