1d7e2a8f | 13-Oct-2023 |
Hawkins Jiawei <yin31149@gmail.com> |
vdpa: Introduce cursors to vhost_vdpa_net_loadx()
This patch introduces two new arugments, `out_cursor` and `in_cursor`, to vhost_vdpa_net_loadx(). Addtionally, it includes a helper function vhost_v
vdpa: Introduce cursors to vhost_vdpa_net_loadx()
This patch introduces two new arugments, `out_cursor` and `in_cursor`, to vhost_vdpa_net_loadx(). Addtionally, it includes a helper function vhost_vdpa_net_load_cursor_reset() for resetting these cursors.
Furthermore, this patch refactors vhost_vdpa_net_load_cmd() so that vhost_vdpa_net_load_cmd() prepares buffers for the device using the cursors arguments, instead of directly accesses `s->cvq_cmd_out_buffer` and `s->status` fields.
By making these change, next patches in this series can refactor vhost_vdpa_net_load_cmd() directly to iterate through the control commands shadow buffers, allowing QEMU to send CVQ state load commands in parallel at device startup.
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Message-Id: <1c6516e233a14cc222f0884e148e4e1adceda78d.1697165821.git.yin31149@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
24e59cfe | 13-Oct-2023 |
Hawkins Jiawei <yin31149@gmail.com> |
vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
Considering that vhost_vdpa_net_load_rx_mode() is only called within vhost_vdpa_net_load_rx() now, this patch refactors vhost_vdpa_net_load_rx
vdpa: Check device ack in vhost_vdpa_net_load_rx_mode()
Considering that vhost_vdpa_net_load_rx_mode() is only called within vhost_vdpa_net_load_rx() now, this patch refactors vhost_vdpa_net_load_rx_mode() to include a check for the device's ack, simplifying the code and improving its maintainability.
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <68811d52f96ae12d68f0d67d996ac1642a623943.1697165821.git.yin31149@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
327dedb8 | 13-Oct-2023 |
Hawkins Jiawei <yin31149@gmail.com> |
vdpa: Avoid using vhost_vdpa_net_load_*() outside vhost_vdpa_net_load()
Next patches in this series will refactor vhost_vdpa_net_load_cmd() to iterate through the control commands shadow buffers, al
vdpa: Avoid using vhost_vdpa_net_load_*() outside vhost_vdpa_net_load()
Next patches in this series will refactor vhost_vdpa_net_load_cmd() to iterate through the control commands shadow buffers, allowing QEMU to send CVQ state load commands in parallel at device startup.
Considering that QEMU always forwards the CVQ command serialized outside of vhost_vdpa_net_load(), it is more elegant to send the CVQ commands directly without invoking vhost_vdpa_net_load_*() helpers.
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <254f0618efde7af7229ba4fdada667bb9d318991.1697165821.git.yin31149@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
73071f19 | 04-Oct-2023 |
Philippe Mathieu-Daudé <philmd@linaro.org> |
net/net: Clean up global variable shadowing
Fix:
net/net.c:1680:35: error: declaration shadows a variable in the global scope [-Werror,-Wshadow] bool netdev_is_modern(const char *optarg)
net/net: Clean up global variable shadowing
Fix:
net/net.c:1680:35: error: declaration shadows a variable in the global scope [-Werror,-Wshadow] bool netdev_is_modern(const char *optarg) ^ net/net.c:1714:38: error: declaration shadows a variable in the global scope [-Werror,-Wshadow] void netdev_parse_modern(const char *optarg) ^ net/net.c:1728:60: error: declaration shadows a variable in the global scope [-Werror,-Wshadow] void net_client_parse(QemuOptsList *opts_list, const char *optarg) ^ /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/getopt.h:77:14: note: previous declaration is here extern char *optarg; /* getopt(3) external variables */ ^
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20231004120019.93101-4-philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
show more ...
|
0a7a164b | 13-Sep-2023 |
Eugenio Pérez <eperezma@redhat.com> |
vdpa net: zero vhost_vdpa iova_tree pointer at cleanup
Not zeroing it causes a SIGSEGV if the live migration is cancelled, at net device restart.
This is caused because CVQ tries to reuse the iova_
vdpa net: zero vhost_vdpa iova_tree pointer at cleanup
Not zeroing it causes a SIGSEGV if the live migration is cancelled, at net device restart.
This is caused because CVQ tries to reuse the iova_tree that is present in the first vhost_vdpa device at the end of vhost_vdpa_net_cvq_start. As a consequence, it tries to access an iova_tree that has been already free.
Fixes: 00ef422e9fbf ("vdpa net: move iova tree creation from init to start") Reported-by: Yanhui Ma <yama@redhat.com> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20230913123408.2819185-1-eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
b0de17a2 | 29-Aug-2023 |
Hawkins Jiawei <yin31149@gmail.com> |
vhost: Add count argument to vhost_svq_poll()
Next patches in this series will no longer perform an immediate poll and check of the device's used buffers for each CVQ state load command. Instead, th
vhost: Add count argument to vhost_svq_poll()
Next patches in this series will no longer perform an immediate poll and check of the device's used buffers for each CVQ state load command. Instead, they will send CVQ state load commands in parallel by polling multiple pending buffers at once.
To achieve this, this patch refactoring vhost_svq_poll() to accept a new argument `num`, which allows vhost_svq_poll() to wait for the device to use multiple elements, rather than polling for a single element.
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <950b3bfcfc5d446168b9d6a249d554a013a691d4.1693287885.git.yin31149@gmail.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
f13f5f64 | 22-Aug-2023 |
Eugenio Pérez <eperezma@redhat.com> |
vdpa: remove net cvq migration blocker
Now that we have add migration blockers if the device does not support all the needed features, remove the general blocker applied to all net devices with CVQ.
vdpa: remove net cvq migration blocker
Now that we have add migration blockers if the device does not support all the needed features, remove the general blocker applied to all net devices with CVQ.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230822085330.3978829-6-eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
6c482547 | 22-Aug-2023 |
Eugenio Pérez <eperezma@redhat.com> |
vdpa: move vhost_vdpa_set_vring_ready to the caller
Doing that way allows CVQ to be enabled before the dataplane vqs, restoring the state as MQ or MAC addresses properly in the case of a migration.
vdpa: move vhost_vdpa_set_vring_ready to the caller
Doing that way allows CVQ to be enabled before the dataplane vqs, restoring the state as MQ or MAC addresses properly in the case of a migration.
The patch does it by defining a ->load NetClientInfo callback also for dataplane. Ideally, this should be done by an independent patch, but the function is already static so it would only add an empty vhost_vdpa_net_data_load stub.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Message-Id: <20230822085330.3978829-5-eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
b40eba9c | 22-Aug-2023 |
Eugenio Pérez <eperezma@redhat.com> |
vdpa: use first queue SVQ state for CVQ default
Previous to this patch the only way CVQ would be shadowed is if it does support to isolate CVQ group or if all vqs were shadowed from the beginning.
vdpa: use first queue SVQ state for CVQ default
Previous to this patch the only way CVQ would be shadowed is if it does support to isolate CVQ group or if all vqs were shadowed from the beginning. The second condition was checked at the beginning, and no more configuration was done.
After this series we need to check if data queues are shadowed because they are in the middle of the migration. As checking if they are shadowed already covers the previous case, let's just mimic it.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230822085330.3978829-2-eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
show more ...
|
6d7a53e9 | 24-Aug-2023 |
Peter Maydell <peter.maydell@linaro.org> |
net/tap: Avoid variable-length array
Use a heap allocation instead of a variable length array in tap_receive_iov().
The codebase has very few VLAs, and if we can get rid of them all we can make the
net/tap: Avoid variable-length array
Use a heap allocation instead of a variable length array in tap_receive_iov().
The codebase has very few VLAs, and if we can get rid of them all we can make the compiler error on new additions. This is a defensive measure against security bugs where an on-stack dynamic allocation isn't correctly size-checked (e.g. CVE-2021-3527).
Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Francisco Iglesias <frasse.iglesias@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
show more ...
|
cb039ef3 | 13-Sep-2023 |
Ilya Maximets <i.maximets@ovn.org> |
net: add initial support for AF_XDP network backend
AF_XDP is a network socket family that allows communication directly with the network device driver in the kernel, bypassing most or all of the ke
net: add initial support for AF_XDP network backend
AF_XDP is a network socket family that allows communication directly with the network device driver in the kernel, bypassing most or all of the kernel networking stack. In the essence, the technology is pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native and works with any network interfaces without driver modifications. Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't require access to character devices or unix sockets. Only access to the network interface itself is necessary.
This patch implements a network backend that communicates with the kernel by creating an AF_XDP socket. A chunk of userspace memory is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx, Fill and Completion) are placed in that memory along with a pool of memory buffers for the packet data. Data transmission is done by allocating one of the buffers, copying packet data into it and placing the pointer into Tx ring. After transmission, device will return the buffer via Completion ring. On Rx, device will take a buffer form a pre-populated Fill ring, write the packet data into it and place the buffer into Rx ring.
AF_XDP network backend takes on the communication with the host kernel and the network interface and forwards packets to/from the peer device in QEMU.
Usage example:
-device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
XDP program bridges the socket with a network interface. It can be attached to the interface in 2 different modes:
1. skb - this mode should work for any interface and doesn't require driver support. With a caveat of lower performance.
2. native - this does require support from the driver and allows to bypass skb allocation in the kernel and potentially use zero-copy while getting packets in/out userspace.
By default, QEMU will try to use native mode and fall back to skb. Mode can be forced via 'mode' option. To force 'copy' even in native mode, use 'force-copy=on' option. This might be useful if there is some issue with the driver.
Option 'queues=N' allows to specify how many device queues should be open. Note that all the queues that are not open are still functional and can receive traffic, but it will not be delivered to QEMU. So, the number of device queues should generally match the QEMU configuration, unless the device is shared with something else and the traffic re-direction to appropriate queues is correctly configured on a device level (e.g. with ethtool -N). 'start-queue=M' option can be used to specify from which queue id QEMU should start configuring 'N' queues. It might also be necessary to use this option with certain NICs, e.g. MLX5 NICs. See the docs for examples.
In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN or CAP_BPF capabilities in order to load default XSK/XDP programs to the network interface and configure BPF maps. It is possible, however, to run with no capabilities. For that to work, an external process with enough capabilities will need to pre-load default XSK program, create AF_XDP sockets and pass their file descriptors to QEMU process on startup via 'sock-fds' option. Network backend will need to be configured with 'inhibit=on' to avoid loading of the program. QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK.
There are few performance challenges with the current network backends.
First is that they do not support IO threads. This means that data path is handled by the main thread in QEMU and may slow down other work or may be slowed down by some other work. This also means that taking advantage of multi-queue is generally not possible today.
Another thing is that data path is going through the device emulation code, which is not really optimized for performance. The fastest "frontend" device is virtio-net. But it's not optimized for heavy traffic either, because it expects such use-cases to be handled via some implementation of vhost (user, kernel, vdpa). In practice, we have virtio notifications and rcu lock/unlock on a per-packet basis and not very efficient accesses to the guest memory. Communication channels between backend and frontend devices do not allow passing more than one packet at a time as well.
Some of these challenges can be avoided in the future by adding better batching into device emulation or by implementing vhost-af-xdp variant.
There are also a few kernel limitations. AF_XDP sockets do not support any kinds of checksum or segmentation offloading. Buffers are limited to a page size (4K), i.e. MTU is limited. Multi-buffer support implementation for AF_XDP is in progress, but not ready yet. Also, transmission in all non-zero-copy modes is synchronous, i.e. done in a syscall. That doesn't allow high packet rates on virtual interfaces.
However, keeping in mind all of these challenges, current implementation of the AF_XDP backend shows a decent performance while running on top of a physical NIC with zero-copy support.
Test setup:
2 VMs running on 2 physical hosts connected via ConnectX6-Dx card. Network backend is configured to open the NIC directly in native mode. The driver supports zero-copy. NIC is configured to use 1 queue.
Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd for PPS testing.
iperf3 result: TCP stream : 19.1 Gbps
dpdk-testpmd (single queue, single CPU core, 64 B packets) results: Tx only : 3.4 Mpps Rx only : 2.0 Mpps L2 FWD Loopback : 1.5 Mpps
In skb mode the same setup shows much lower performance, similar to the setup where pair of physical NICs is replaced with veth pair:
iperf3 result: TCP stream : 9 Gbps
dpdk-testpmd (single queue, single CPU core, 64 B packets) results: Tx only : 1.2 Mpps Rx only : 1.0 Mpps L2 FWD Loopback : 0.7 Mpps
Results in skb mode or over the veth are close to results of a tap backend with vhost=on and disabled segmentation offloading bridged with a NIC.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> (docker/lcitool) Signed-off-by: Jason Wang <jasowang@redhat.com>
show more ...
|