xref: /qemu/docs/nvdimm.txt (revision ca577afc)
179c0f397SHaozhong ZhangQEMU Virtual NVDIMM
279c0f397SHaozhong Zhang===================
379c0f397SHaozhong Zhang
479c0f397SHaozhong ZhangThis document explains the usage of virtual NVDIMM (vNVDIMM) feature
579c0f397SHaozhong Zhangwhich is available since QEMU v2.6.0.
679c0f397SHaozhong Zhang
779c0f397SHaozhong ZhangThe current QEMU only implements the persistent memory mode of vNVDIMM
879c0f397SHaozhong Zhangdevice and not the block window mode.
979c0f397SHaozhong Zhang
1079c0f397SHaozhong ZhangBasic Usage
1179c0f397SHaozhong Zhang-----------
1279c0f397SHaozhong Zhang
1379c0f397SHaozhong ZhangThe storage of a vNVDIMM device in QEMU is provided by the memory
1479c0f397SHaozhong Zhangbackend (i.e. memory-backend-file and memory-backend-ram). A simple
1579c0f397SHaozhong Zhangway to create a vNVDIMM device at startup time is done via the
1679c0f397SHaozhong Zhangfollowing command line options:
1779c0f397SHaozhong Zhang
18*ca577afcSPankaj Gupta -machine pc,nvdimm=on
1979c0f397SHaozhong Zhang -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
20dbd730e8SStefan Hajnoczi -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off
21dbd730e8SStefan Hajnoczi -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off
2279c0f397SHaozhong Zhang
2379c0f397SHaozhong ZhangWhere,
2479c0f397SHaozhong Zhang
2579c0f397SHaozhong Zhang - the "nvdimm" machine option enables vNVDIMM feature.
2679c0f397SHaozhong Zhang
2779c0f397SHaozhong Zhang - "slots=$N" should be equal to or larger than the total amount of
2879c0f397SHaozhong Zhang   normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
2979c0f397SHaozhong Zhang
3079c0f397SHaozhong Zhang - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
3179c0f397SHaozhong Zhang   of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
3279c0f397SHaozhong Zhang   >= $RAM_SIZE + $NVDIMM_SIZE here.
3379c0f397SHaozhong Zhang
34dbd730e8SStefan Hajnoczi - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,
35dbd730e8SStefan Hajnoczi   size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size
36dbd730e8SStefan Hajnoczi   $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go
37dbd730e8SStefan Hajnoczi   to the file $PATH.
3879c0f397SHaozhong Zhang
3979c0f397SHaozhong Zhang   "share=on/off" controls the visibility of guest writes. If
4079c0f397SHaozhong Zhang   "share=on", then guest writes will be applied to the backend
4179c0f397SHaozhong Zhang   file. If another guest uses the same backend file with option
4279c0f397SHaozhong Zhang   "share=on", then above writes will be visible to it as well. If
4379c0f397SHaozhong Zhang   "share=off", then guest writes won't be applied to the backend
4479c0f397SHaozhong Zhang   file and thus will be invisible to other guests.
4579c0f397SHaozhong Zhang
46dbd730e8SStefan Hajnoczi   "readonly=on/off" controls whether the file $PATH is opened read-only or
47dbd730e8SStefan Hajnoczi   read/write (default).
48dbd730e8SStefan Hajnoczi
49dbd730e8SStefan Hajnoczi - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write
50dbd730e8SStefan Hajnoczi   virtual NVDIMM device whose storage is provided by above memory backend
51dbd730e8SStefan Hajnoczi   device.
52dbd730e8SStefan Hajnoczi
53dbd730e8SStefan Hajnoczi   "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
54dbd730e8SStefan Hajnoczi   State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
55dbd730e8SStefan Hajnoczi   persistent writes. Linux guest drivers set the device to read-only when this
56dbd730e8SStefan Hajnoczi   bit is present. Set unarmed to on when the memdev has readonly=on.
5779c0f397SHaozhong Zhang
5879c0f397SHaozhong ZhangMultiple vNVDIMM devices can be created if multiple pairs of "-object"
5979c0f397SHaozhong Zhangand "-device" are provided.
6079c0f397SHaozhong Zhang
6179c0f397SHaozhong ZhangFor above command line options, if the guest OS has the proper NVDIMM
62bd54b110SKees Cookdriver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
63bd54b110SKees Cookdetect a NVDIMM device which is in the persistent memory mode and whose
64bd54b110SKees Cooksize is $NVDIMM_SIZE.
6579c0f397SHaozhong Zhang
6679c0f397SHaozhong ZhangNote:
6779c0f397SHaozhong Zhang
6879c0f397SHaozhong Zhang1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
6979c0f397SHaozhong Zhang   backend file size is not equal to the size given by "size" option,
7079c0f397SHaozhong Zhang   QEMU will truncate the backend file by ftruncate(2), which will
7179c0f397SHaozhong Zhang   corrupt the existing data in the backend file, especially for the
7279c0f397SHaozhong Zhang   shrink case.
7379c0f397SHaozhong Zhang
7479c0f397SHaozhong Zhang   QEMU v2.8.0 and later check the backend file size and the "size"
7579c0f397SHaozhong Zhang   option. If they do not match, QEMU will report errors and abort in
7679c0f397SHaozhong Zhang   order to avoid the data corruption.
7779c0f397SHaozhong Zhang
7879c0f397SHaozhong Zhang2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
7979c0f397SHaozhong Zhang   option of memory-backend-file, e.g. 4KB alignment on x86.  However,
8079c0f397SHaozhong Zhang   QEMU v.2.7.0 puts an additional alignment requirement, which may
8179c0f397SHaozhong Zhang   require a larger value than the basic one, e.g. 2MB on x86. This
8279c0f397SHaozhong Zhang   change breaks the usage of memory-backend-file that only satisfies
8379c0f397SHaozhong Zhang   the basic alignment.
8479c0f397SHaozhong Zhang
8579c0f397SHaozhong Zhang   QEMU v2.8.0 and later remove the additional alignment on non-s390x
8679c0f397SHaozhong Zhang   architectures, so the broken memory-backend-file can work again.
8779c0f397SHaozhong Zhang
8879c0f397SHaozhong ZhangLabel
8979c0f397SHaozhong Zhang-----
9079c0f397SHaozhong Zhang
9179c0f397SHaozhong ZhangQEMU v2.7.0 and later implement the label support for vNVDIMM devices.
9279c0f397SHaozhong ZhangTo enable label on vNVDIMM devices, users can simply add
9379c0f397SHaozhong Zhang"label-size=$SZ" option to "-device nvdimm", e.g.
9479c0f397SHaozhong Zhang
9579c0f397SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
9679c0f397SHaozhong Zhang
9779c0f397SHaozhong ZhangNote:
9879c0f397SHaozhong Zhang
9979c0f397SHaozhong Zhang1. The minimal label size is 128KB.
10079c0f397SHaozhong Zhang
10179c0f397SHaozhong Zhang2. QEMU v2.7.0 and later store labels at the end of backend storage.
10279c0f397SHaozhong Zhang   If a memory backend file, which was previously used as the backend
10379c0f397SHaozhong Zhang   of a vNVDIMM device without labels, is now used for a vNVDIMM
10479c0f397SHaozhong Zhang   device with label, the data in the label area at the end of file
10579c0f397SHaozhong Zhang   will be inaccessible to the guest. If any useful data (e.g. the
10679c0f397SHaozhong Zhang   meta-data of the file system) was stored there, the latter usage
10779c0f397SHaozhong Zhang   may result guest data corruption (e.g. breakage of guest file
10879c0f397SHaozhong Zhang   system).
10979c0f397SHaozhong Zhang
11079c0f397SHaozhong ZhangHotplug
11179c0f397SHaozhong Zhang-------
11279c0f397SHaozhong Zhang
11379c0f397SHaozhong ZhangQEMU v2.8.0 and later implement the hotplug support for vNVDIMM
11479c0f397SHaozhong Zhangdevices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
11579c0f397SHaozhong Zhangaccomplished by two monitor commands "object_add" and "device_add".
11679c0f397SHaozhong Zhang
11779c0f397SHaozhong ZhangFor example, the following commands add another 4GB vNVDIMM device to
11879c0f397SHaozhong Zhangthe guest:
11979c0f397SHaozhong Zhang
12079c0f397SHaozhong Zhang (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
12179c0f397SHaozhong Zhang (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2
12279c0f397SHaozhong Zhang
12379c0f397SHaozhong ZhangNote:
12479c0f397SHaozhong Zhang
12579c0f397SHaozhong Zhang1. Each hotplugged vNVDIMM device consumes one memory slot. Users
12679c0f397SHaozhong Zhang   should always ensure the memory option "-m ...,slots=N" specifies
12779c0f397SHaozhong Zhang   enough number of slots, i.e.
12879c0f397SHaozhong Zhang     N >= number of RAM devices +
12979c0f397SHaozhong Zhang          number of statically plugged vNVDIMM devices +
13079c0f397SHaozhong Zhang          number of hotplugged vNVDIMM devices
13179c0f397SHaozhong Zhang
13279c0f397SHaozhong Zhang2. The similar is required for the memory option "-m ...,maxmem=M", i.e.
13379c0f397SHaozhong Zhang     M >= size of RAM devices +
13479c0f397SHaozhong Zhang          size of statically plugged vNVDIMM devices +
13579c0f397SHaozhong Zhang          size of hotplugged vNVDIMM devices
13698376843SHaozhong Zhang
13798376843SHaozhong ZhangAlignment
13898376843SHaozhong Zhang---------
13998376843SHaozhong Zhang
14098376843SHaozhong ZhangQEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
14198376843SHaozhong Zhangaddress to the page size (getpagesize(2)) by default. However, some
14298376843SHaozhong Zhangtypes of backends may require an alignment different than the page
14398376843SHaozhong Zhangsize. In that case, QEMU v2.12.0 and later provide 'align' option to
14498376843SHaozhong Zhangmemory-backend-file to allow users to specify the proper alignment.
1455f509751SJingqi LiuFor device dax (e.g., /dev/dax0.0), this alignment needs to match the
1465f509751SJingqi Liualignment requirement of the device dax. The NUM of 'align=NUM' option
1475f509751SJingqi Liumust be larger than or equal to the 'align' of device dax.
1485f509751SJingqi LiuWe can use one of the following commands to show the 'align' of device dax.
1495f509751SJingqi Liu
1505f509751SJingqi Liu    ndctl list -X
1515f509751SJingqi Liu    daxctl list -R
1525f509751SJingqi Liu
1535f509751SJingqi LiuIn order to get the proper 'align' of device dax, you need to install
1545f509751SJingqi Liuthe library 'libdaxctl'.
15598376843SHaozhong Zhang
15698376843SHaozhong ZhangFor example, device dax require the 2 MB alignment, so we can use
15798376843SHaozhong Zhangfollowing QEMU command line options to use it (/dev/dax0.0) as the
15898376843SHaozhong Zhangbackend of vNVDIMM:
15998376843SHaozhong Zhang
16098376843SHaozhong Zhang -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
16198376843SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1
162cb836434SHaozhong Zhang
163cb836434SHaozhong ZhangGuest Data Persistence
164cb836434SHaozhong Zhang----------------------
165cb836434SHaozhong Zhang
166cb836434SHaozhong ZhangThough QEMU supports multiple types of vNVDIMM backends on Linux,
167119906afSZhang Yithe only backend that can guarantee the guest write persistence is:
168119906afSZhang Yi
169119906afSZhang YiA. DAX device (e.g., /dev/dax0.0, ) or
170119906afSZhang YiB. DAX file(mounted with dax option)
171119906afSZhang Yi
172119906afSZhang YiWhen using B (A file supporting direct mapping of persistent memory)
173119906afSZhang Yias a backend, write persistence is guaranteed if the host kernel has
174119906afSZhang Yisupport for the MAP_SYNC flag in the mmap system call (available
175119906afSZhang Yisince Linux 4.15 and on certain distro kernels) and additionally
176119906afSZhang Yiboth 'pmem' and 'share' flags are set to 'on' on the backend.
177119906afSZhang Yi
178119906afSZhang YiIf these conditions are not satisfied i.e. if either 'pmem' or 'share'
179119906afSZhang Yiare not set, if the backend file does not support DAX or if MAP_SYNC
180119906afSZhang Yiis not supported by the host kernel, write persistence is not
181119906afSZhang Yiguaranteed after a system crash. For compatibility reasons, these
182119906afSZhang Yiconditions are ignored if not satisfied. Currently, no way is
183119906afSZhang Yiprovided to test for them.
184119906afSZhang YiFor more details, please reference mmap(2) man page:
185119906afSZhang Yihttp://man7.org/linux/man-pages/man2/mmap.2.html.
186cb836434SHaozhong Zhang
187cb836434SHaozhong ZhangWhen using other types of backends, it's suggested to set 'unarmed'
188cb836434SHaozhong Zhangoption of '-device nvdimm' to 'on', which sets the unarmed flag of the
189cb836434SHaozhong Zhangguest NVDIMM region mapping structure.  This unarmed flag indicates
190cb836434SHaozhong Zhangguest software that this vNVDIMM device contains a region that cannot
191cb836434SHaozhong Zhangaccept persistent writes. In result, for example, the guest Linux
192cb836434SHaozhong ZhangNVDIMM driver, marks such vNVDIMM device as read-only.
1939ab3aad2SRoss Zwisler
194d8b92bd4SWei YangBackend File Setup Example
195d8b92bd4SWei Yang--------------------------
196d8b92bd4SWei Yang
197d8b92bd4SWei YangHere are two examples showing how to setup these persistent backends on
198d8b92bd4SWei Yanglinux using the tool ndctl [3].
199d8b92bd4SWei Yang
200d8b92bd4SWei YangA. DAX device
201d8b92bd4SWei Yang
202d8b92bd4SWei YangUse the following command to set up /dev/dax0.0 so that the entirety of
203d8b92bd4SWei Yangnamespace0.0 can be exposed as an emulated NVDIMM to the guest:
204d8b92bd4SWei Yang
205d8b92bd4SWei Yang    ndctl create-namespace -f -e namespace0.0 -m devdax
206d8b92bd4SWei Yang
207d8b92bd4SWei YangThe /dev/dax0.0 could be used directly in "mem-path" option.
208d8b92bd4SWei Yang
209d8b92bd4SWei YangB. DAX file
210d8b92bd4SWei Yang
211d8b92bd4SWei YangIndividual files on a DAX host file system can be exposed as emulated
212d8b92bd4SWei YangNVDIMMS.  First an fsdax block device is created, partitioned, and then
213d8b92bd4SWei Yangmounted with the "dax" mount option:
214d8b92bd4SWei Yang
215d8b92bd4SWei Yang    ndctl create-namespace -f -e namespace0.0 -m fsdax
216d8b92bd4SWei Yang    (partition /dev/pmem0 with name pmem0p1)
217d8b92bd4SWei Yang    mount -o dax /dev/pmem0p1 /mnt
218d8b92bd4SWei Yang    (create or copy a disk image file with qemu-img(1), cp(1), or dd(1)
219d8b92bd4SWei Yang     in /mnt)
220d8b92bd4SWei Yang
221d8b92bd4SWei YangThen the new file in /mnt could be used in "mem-path" option.
222d8b92bd4SWei Yang
22311c39b5cSRoss ZwislerNVDIMM Persistence
22411c39b5cSRoss Zwisler------------------
2259ab3aad2SRoss Zwisler
2269ab3aad2SRoss ZwislerACPI 6.2 Errata A added support for a new Platform Capabilities Structure
2279ab3aad2SRoss Zwislerwhich allows the platform to communicate what features it supports related to
22811c39b5cSRoss ZwislerNVDIMM data persistence.  Users can provide a persistence value to a guest via
22911c39b5cSRoss Zwislerthe optional "nvdimm-persistence" machine command line option:
2309ab3aad2SRoss Zwisler
23111c39b5cSRoss Zwisler    -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
2329ab3aad2SRoss Zwisler
23311c39b5cSRoss ZwislerThere are currently two valid values for this option:
2349ab3aad2SRoss Zwisler
23511c39b5cSRoss Zwisler"mem-ctrl" - The platform supports flushing dirty data from the memory
23611c39b5cSRoss Zwisler             controller to the NVDIMMs in the event of power loss.
2379ab3aad2SRoss Zwisler
23811c39b5cSRoss Zwisler"cpu"      - The platform supports flushing dirty data from the CPU cache to
23911c39b5cSRoss Zwisler             the NVDIMMs in the event of power loss.  This implies that the
24011c39b5cSRoss Zwisler             platform also supports flushing dirty data through the memory
24111c39b5cSRoss Zwisler             controller on power loss.
242a4de8552SJunyan He
243a4de8552SJunyan HeIf the vNVDIMM backend is in host persistent memory that can be accessed in
244a4de8552SJunyan HeSNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set
245a4de8552SJunyan Hethe 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU
246a4de8552SJunyan Heis built with libpmem [2] support (configured with --enable-libpmem), QEMU
247a4de8552SJunyan Hewill take necessary operations to guarantee the persistence of its own writes
248a4de8552SJunyan Heto the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
249a4de8552SJunyan HeIf 'pmem' is 'on' while there is no libpmem support, qemu will exit and report
250a4de8552SJunyan Hea "lack of libpmem support" message to ensure the persistence is available.
251a4de8552SJunyan HeFor example, if we want to ensure the persistence for some backend file,
252a4de8552SJunyan Heuse the QEMU command line:
253a4de8552SJunyan He
254a4de8552SJunyan He    -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
255a4de8552SJunyan He
256a4de8552SJunyan HeReferences
257a4de8552SJunyan He----------
258a4de8552SJunyan He
259a4de8552SJunyan He[1] NVM Programming Model (NPM)
260a4de8552SJunyan He	Version 1.2
261a4de8552SJunyan He    https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
262a4de8552SJunyan He[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page:
263a4de8552SJunyan He    http://pmem.io/pmdk/
264d8b92bd4SWei Yang[3] ndctl-create-namespace - provision or reconfigure a namespace
265d8b92bd4SWei Yang    http://pmem.io/ndctl/ndctl-create-namespace.html
266