1*37002bc6SCosta Shulyupin================================== 2*37002bc6SCosta Shulyupinvfio-ccw: the basic infrastructure 3*37002bc6SCosta Shulyupin================================== 4*37002bc6SCosta Shulyupin 5*37002bc6SCosta ShulyupinIntroduction 6*37002bc6SCosta Shulyupin------------ 7*37002bc6SCosta Shulyupin 8*37002bc6SCosta ShulyupinHere we describe the vfio support for I/O subchannel devices for 9*37002bc6SCosta ShulyupinLinux/s390. Motivation for vfio-ccw is to passthrough subchannels to a 10*37002bc6SCosta Shulyupinvirtual machine, while vfio is the means. 11*37002bc6SCosta Shulyupin 12*37002bc6SCosta ShulyupinDifferent than other hardware architectures, s390 has defined a unified 13*37002bc6SCosta ShulyupinI/O access method, which is so called Channel I/O. It has its own access 14*37002bc6SCosta Shulyupinpatterns: 15*37002bc6SCosta Shulyupin 16*37002bc6SCosta Shulyupin- Channel programs run asynchronously on a separate (co)processor. 17*37002bc6SCosta Shulyupin- The channel subsystem will access any memory designated by the caller 18*37002bc6SCosta Shulyupin in the channel program directly, i.e. there is no iommu involved. 19*37002bc6SCosta Shulyupin 20*37002bc6SCosta ShulyupinThus when we introduce vfio support for these devices, we realize it 21*37002bc6SCosta Shulyupinwith a mediated device (mdev) implementation. The vfio mdev will be 22*37002bc6SCosta Shulyupinadded to an iommu group, so as to make itself able to be managed by the 23*37002bc6SCosta Shulyupinvfio framework. And we add read/write callbacks for special vfio I/O 24*37002bc6SCosta Shulyupinregions to pass the channel programs from the mdev to its parent device 25*37002bc6SCosta Shulyupin(the real I/O subchannel device) to do further address translation and 26*37002bc6SCosta Shulyupinto perform I/O instructions. 27*37002bc6SCosta Shulyupin 28*37002bc6SCosta ShulyupinThis document does not intend to explain the s390 I/O architecture in 29*37002bc6SCosta Shulyupinevery detail. More information/reference could be found here: 30*37002bc6SCosta Shulyupin 31*37002bc6SCosta Shulyupin- A good start to know Channel I/O in general: 32*37002bc6SCosta Shulyupin https://en.wikipedia.org/wiki/Channel_I/O 33*37002bc6SCosta Shulyupin- s390 architecture: 34*37002bc6SCosta Shulyupin s390 Principles of Operation manual (IBM Form. No. SA22-7832) 35*37002bc6SCosta Shulyupin- The existing QEMU code which implements a simple emulated channel 36*37002bc6SCosta Shulyupin subsystem could also be a good reference. It makes it easier to follow 37*37002bc6SCosta Shulyupin the flow. 38*37002bc6SCosta Shulyupin qemu/hw/s390x/css.c 39*37002bc6SCosta Shulyupin 40*37002bc6SCosta ShulyupinFor vfio mediated device framework: 41*37002bc6SCosta Shulyupin- Documentation/driver-api/vfio-mediated-device.rst 42*37002bc6SCosta Shulyupin 43*37002bc6SCosta ShulyupinMotivation of vfio-ccw 44*37002bc6SCosta Shulyupin---------------------- 45*37002bc6SCosta Shulyupin 46*37002bc6SCosta ShulyupinTypically, a guest virtualized via QEMU/KVM on s390 only sees 47*37002bc6SCosta Shulyupinparavirtualized virtio devices via the "Virtio Over Channel I/O 48*37002bc6SCosta Shulyupin(virtio-ccw)" transport. This makes virtio devices discoverable via 49*37002bc6SCosta Shulyupinstandard operating system algorithms for handling channel devices. 50*37002bc6SCosta Shulyupin 51*37002bc6SCosta ShulyupinHowever this is not enough. On s390 for the majority of devices, which 52*37002bc6SCosta Shulyupinuse the standard Channel I/O based mechanism, we also need to provide 53*37002bc6SCosta Shulyupinthe functionality of passing through them to a QEMU virtual machine. 54*37002bc6SCosta ShulyupinThis includes devices that don't have a virtio counterpart (e.g. tape 55*37002bc6SCosta Shulyupindrives) or that have specific characteristics which guests want to 56*37002bc6SCosta Shulyupinexploit. 57*37002bc6SCosta Shulyupin 58*37002bc6SCosta ShulyupinFor passing a device to a guest, we want to use the same interface as 59*37002bc6SCosta Shulyupineverybody else, namely vfio. We implement this vfio support for channel 60*37002bc6SCosta Shulyupindevices via the vfio mediated device framework and the subchannel device 61*37002bc6SCosta Shulyupindriver "vfio_ccw". 62*37002bc6SCosta Shulyupin 63*37002bc6SCosta ShulyupinAccess patterns of CCW devices 64*37002bc6SCosta Shulyupin------------------------------ 65*37002bc6SCosta Shulyupin 66*37002bc6SCosta Shulyupins390 architecture has implemented a so called channel subsystem, that 67*37002bc6SCosta Shulyupinprovides a unified view of the devices physically attached to the 68*37002bc6SCosta Shulyupinsystems. Though the s390 hardware platform knows about a huge variety of 69*37002bc6SCosta Shulyupindifferent peripheral attachments like disk devices (aka. DASDs), tapes, 70*37002bc6SCosta Shulyupincommunication controllers, etc. They can all be accessed by a well 71*37002bc6SCosta Shulyupindefined access method and they are presenting I/O completion a unified 72*37002bc6SCosta Shulyupinway: I/O interruptions. 73*37002bc6SCosta Shulyupin 74*37002bc6SCosta ShulyupinAll I/O requires the use of channel command words (CCWs). A CCW is an 75*37002bc6SCosta Shulyupininstruction to a specialized I/O channel processor. A channel program is 76*37002bc6SCosta Shulyupina sequence of CCWs which are executed by the I/O channel subsystem. To 77*37002bc6SCosta Shulyupinissue a channel program to the channel subsystem, it is required to 78*37002bc6SCosta Shulyupinbuild an operation request block (ORB), which can be used to point out 79*37002bc6SCosta Shulyupinthe format of the CCW and other control information to the system. The 80*37002bc6SCosta Shulyupinoperating system signals the I/O channel subsystem to begin executing 81*37002bc6SCosta Shulyupinthe channel program with a SSCH (start sub-channel) instruction. The 82*37002bc6SCosta Shulyupincentral processor is then free to proceed with non-I/O instructions 83*37002bc6SCosta Shulyupinuntil interrupted. The I/O completion result is received by the 84*37002bc6SCosta Shulyupininterrupt handler in the form of interrupt response block (IRB). 85*37002bc6SCosta Shulyupin 86*37002bc6SCosta ShulyupinBack to vfio-ccw, in short: 87*37002bc6SCosta Shulyupin 88*37002bc6SCosta Shulyupin- ORBs and channel programs are built in guest kernel (with guest 89*37002bc6SCosta Shulyupin physical addresses). 90*37002bc6SCosta Shulyupin- ORBs and channel programs are passed to the host kernel. 91*37002bc6SCosta Shulyupin- Host kernel translates the guest physical addresses to real addresses 92*37002bc6SCosta Shulyupin and starts the I/O with issuing a privileged Channel I/O instruction 93*37002bc6SCosta Shulyupin (e.g SSCH). 94*37002bc6SCosta Shulyupin- channel programs run asynchronously on a separate processor. 95*37002bc6SCosta Shulyupin- I/O completion will be signaled to the host with I/O interruptions. 96*37002bc6SCosta Shulyupin And it will be copied as IRB to user space to pass it back to the 97*37002bc6SCosta Shulyupin guest. 98*37002bc6SCosta Shulyupin 99*37002bc6SCosta ShulyupinPhysical vfio ccw device and its child mdev 100*37002bc6SCosta Shulyupin------------------------------------------- 101*37002bc6SCosta Shulyupin 102*37002bc6SCosta ShulyupinAs mentioned above, we realize vfio-ccw with a mdev implementation. 103*37002bc6SCosta Shulyupin 104*37002bc6SCosta ShulyupinChannel I/O does not have IOMMU hardware support, so the physical 105*37002bc6SCosta Shulyupinvfio-ccw device does not have an IOMMU level translation or isolation. 106*37002bc6SCosta Shulyupin 107*37002bc6SCosta ShulyupinSubchannel I/O instructions are all privileged instructions. When 108*37002bc6SCosta Shulyupinhandling the I/O instruction interception, vfio-ccw has the software 109*37002bc6SCosta Shulyupinpolicing and translation how the channel program is programmed before 110*37002bc6SCosta Shulyupinit gets sent to hardware. 111*37002bc6SCosta Shulyupin 112*37002bc6SCosta ShulyupinWithin this implementation, we have two drivers for two types of 113*37002bc6SCosta Shulyupindevices: 114*37002bc6SCosta Shulyupin 115*37002bc6SCosta Shulyupin- The vfio_ccw driver for the physical subchannel device. 116*37002bc6SCosta Shulyupin This is an I/O subchannel driver for the real subchannel device. It 117*37002bc6SCosta Shulyupin realizes a group of callbacks and registers to the mdev framework as a 118*37002bc6SCosta Shulyupin parent (physical) device. As a consequence, mdev provides vfio_ccw a 119*37002bc6SCosta Shulyupin generic interface (sysfs) to create mdev devices. A vfio mdev could be 120*37002bc6SCosta Shulyupin created by vfio_ccw then and added to the mediated bus. It is the vfio 121*37002bc6SCosta Shulyupin device that added to an IOMMU group and a vfio group. 122*37002bc6SCosta Shulyupin vfio_ccw also provides an I/O region to accept channel program 123*37002bc6SCosta Shulyupin request from user space and store I/O interrupt result for user 124*37002bc6SCosta Shulyupin space to retrieve. To notify user space an I/O completion, it offers 125*37002bc6SCosta Shulyupin an interface to setup an eventfd fd for asynchronous signaling. 126*37002bc6SCosta Shulyupin 127*37002bc6SCosta Shulyupin- The vfio_mdev driver for the mediated vfio ccw device. 128*37002bc6SCosta Shulyupin This is provided by the mdev framework. It is a vfio device driver for 129*37002bc6SCosta Shulyupin the mdev that created by vfio_ccw. 130*37002bc6SCosta Shulyupin It realizes a group of vfio device driver callbacks, adds itself to a 131*37002bc6SCosta Shulyupin vfio group, and registers itself to the mdev framework as a mdev 132*37002bc6SCosta Shulyupin driver. 133*37002bc6SCosta Shulyupin It uses a vfio iommu backend that uses the existing map and unmap 134*37002bc6SCosta Shulyupin ioctls, but rather than programming them into an IOMMU for a device, 135*37002bc6SCosta Shulyupin it simply stores the translations for use by later requests. This 136*37002bc6SCosta Shulyupin means that a device programmed in a VM with guest physical addresses 137*37002bc6SCosta Shulyupin can have the vfio kernel convert that address to process virtual 138*37002bc6SCosta Shulyupin address, pin the page and program the hardware with the host physical 139*37002bc6SCosta Shulyupin address in one step. 140*37002bc6SCosta Shulyupin For a mdev, the vfio iommu backend will not pin the pages during the 141*37002bc6SCosta Shulyupin VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database 142*37002bc6SCosta Shulyupin of the iova<->vaddr mappings in this operation. And they export a 143*37002bc6SCosta Shulyupin vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu 144*37002bc6SCosta Shulyupin backend for the physical devices to pin and unpin pages by demand. 145*37002bc6SCosta Shulyupin 146*37002bc6SCosta ShulyupinBelow is a high Level block diagram:: 147*37002bc6SCosta Shulyupin 148*37002bc6SCosta Shulyupin +-------------+ 149*37002bc6SCosta Shulyupin | | 150*37002bc6SCosta Shulyupin | +---------+ | mdev_register_driver() +--------------+ 151*37002bc6SCosta Shulyupin | | Mdev | +<-----------------------+ | 152*37002bc6SCosta Shulyupin | | bus | | | vfio_mdev.ko | 153*37002bc6SCosta Shulyupin | | driver | +----------------------->+ |<-> VFIO user 154*37002bc6SCosta Shulyupin | +---------+ | probe()/remove() +--------------+ APIs 155*37002bc6SCosta Shulyupin | | 156*37002bc6SCosta Shulyupin | MDEV CORE | 157*37002bc6SCosta Shulyupin | MODULE | 158*37002bc6SCosta Shulyupin | mdev.ko | 159*37002bc6SCosta Shulyupin | +---------+ | mdev_register_parent() +--------------+ 160*37002bc6SCosta Shulyupin | |Physical | +<-----------------------+ | 161*37002bc6SCosta Shulyupin | | device | | | vfio_ccw.ko |<-> subchannel 162*37002bc6SCosta Shulyupin | |interface| +----------------------->+ | device 163*37002bc6SCosta Shulyupin | +---------+ | callback +--------------+ 164*37002bc6SCosta Shulyupin +-------------+ 165*37002bc6SCosta Shulyupin 166*37002bc6SCosta ShulyupinThe process of how these work together. 167*37002bc6SCosta Shulyupin 168*37002bc6SCosta Shulyupin1. vfio_ccw.ko drives the physical I/O subchannel, and registers the 169*37002bc6SCosta Shulyupin physical device (with callbacks) to mdev framework. 170*37002bc6SCosta Shulyupin When vfio_ccw probing the subchannel device, it registers device 171*37002bc6SCosta Shulyupin pointer and callbacks to the mdev framework. Mdev related file nodes 172*37002bc6SCosta Shulyupin under the device node in sysfs would be created for the subchannel 173*37002bc6SCosta Shulyupin device, namely 'mdev_create', 'mdev_destroy' and 174*37002bc6SCosta Shulyupin 'mdev_supported_types'. 175*37002bc6SCosta Shulyupin2. Create a mediated vfio ccw device. 176*37002bc6SCosta Shulyupin Use the 'mdev_create' sysfs file, we need to manually create one (and 177*37002bc6SCosta Shulyupin only one for our case) mediated device. 178*37002bc6SCosta Shulyupin3. vfio_mdev.ko drives the mediated ccw device. 179*37002bc6SCosta Shulyupin vfio_mdev is also the vfio device driver. It will probe the mdev and 180*37002bc6SCosta Shulyupin add it to an iommu_group and a vfio_group. Then we could pass through 181*37002bc6SCosta Shulyupin the mdev to a guest. 182*37002bc6SCosta Shulyupin 183*37002bc6SCosta Shulyupin 184*37002bc6SCosta ShulyupinVFIO-CCW Regions 185*37002bc6SCosta Shulyupin---------------- 186*37002bc6SCosta Shulyupin 187*37002bc6SCosta ShulyupinThe vfio-ccw driver exposes MMIO regions to accept requests from and return 188*37002bc6SCosta Shulyupinresults to userspace. 189*37002bc6SCosta Shulyupin 190*37002bc6SCosta Shulyupinvfio-ccw I/O region 191*37002bc6SCosta Shulyupin------------------- 192*37002bc6SCosta Shulyupin 193*37002bc6SCosta ShulyupinAn I/O region is used to accept channel program request from user 194*37002bc6SCosta Shulyupinspace and store I/O interrupt result for user space to retrieve. The 195*37002bc6SCosta Shulyupindefinition of the region is:: 196*37002bc6SCosta Shulyupin 197*37002bc6SCosta Shulyupin struct ccw_io_region { 198*37002bc6SCosta Shulyupin #define ORB_AREA_SIZE 12 199*37002bc6SCosta Shulyupin __u8 orb_area[ORB_AREA_SIZE]; 200*37002bc6SCosta Shulyupin #define SCSW_AREA_SIZE 12 201*37002bc6SCosta Shulyupin __u8 scsw_area[SCSW_AREA_SIZE]; 202*37002bc6SCosta Shulyupin #define IRB_AREA_SIZE 96 203*37002bc6SCosta Shulyupin __u8 irb_area[IRB_AREA_SIZE]; 204*37002bc6SCosta Shulyupin __u32 ret_code; 205*37002bc6SCosta Shulyupin } __packed; 206*37002bc6SCosta Shulyupin 207*37002bc6SCosta ShulyupinThis region is always available. 208*37002bc6SCosta Shulyupin 209*37002bc6SCosta ShulyupinWhile starting an I/O request, orb_area should be filled with the 210*37002bc6SCosta Shulyupinguest ORB, and scsw_area should be filled with the SCSW of the Virtual 211*37002bc6SCosta ShulyupinSubchannel. 212*37002bc6SCosta Shulyupin 213*37002bc6SCosta Shulyupinirb_area stores the I/O result. 214*37002bc6SCosta Shulyupin 215*37002bc6SCosta Shulyupinret_code stores a return code for each access of the region. The following 216*37002bc6SCosta Shulyupinvalues may occur: 217*37002bc6SCosta Shulyupin 218*37002bc6SCosta Shulyupin``0`` 219*37002bc6SCosta Shulyupin The operation was successful. 220*37002bc6SCosta Shulyupin 221*37002bc6SCosta Shulyupin``-EOPNOTSUPP`` 222*37002bc6SCosta Shulyupin The ORB specified transport mode or the 223*37002bc6SCosta Shulyupin SCSW specified a function other than the start function. 224*37002bc6SCosta Shulyupin 225*37002bc6SCosta Shulyupin``-EIO`` 226*37002bc6SCosta Shulyupin A request was issued while the device was not in a state ready to accept 227*37002bc6SCosta Shulyupin requests, or an internal error occurred. 228*37002bc6SCosta Shulyupin 229*37002bc6SCosta Shulyupin``-EBUSY`` 230*37002bc6SCosta Shulyupin The subchannel was status pending or busy, or a request is already active. 231*37002bc6SCosta Shulyupin 232*37002bc6SCosta Shulyupin``-EAGAIN`` 233*37002bc6SCosta Shulyupin A request was being processed, and the caller should retry. 234*37002bc6SCosta Shulyupin 235*37002bc6SCosta Shulyupin``-EACCES`` 236*37002bc6SCosta Shulyupin The channel path(s) used for the I/O were found to be not operational. 237*37002bc6SCosta Shulyupin 238*37002bc6SCosta Shulyupin``-ENODEV`` 239*37002bc6SCosta Shulyupin The device was found to be not operational. 240*37002bc6SCosta Shulyupin 241*37002bc6SCosta Shulyupin``-EINVAL`` 242*37002bc6SCosta Shulyupin The orb specified a chain longer than 255 ccws, or an internal error 243*37002bc6SCosta Shulyupin occurred. 244*37002bc6SCosta Shulyupin 245*37002bc6SCosta Shulyupin 246*37002bc6SCosta Shulyupinvfio-ccw cmd region 247*37002bc6SCosta Shulyupin------------------- 248*37002bc6SCosta Shulyupin 249*37002bc6SCosta ShulyupinThe vfio-ccw cmd region is used to accept asynchronous instructions 250*37002bc6SCosta Shulyupinfrom userspace:: 251*37002bc6SCosta Shulyupin 252*37002bc6SCosta Shulyupin #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0) 253*37002bc6SCosta Shulyupin #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1) 254*37002bc6SCosta Shulyupin struct ccw_cmd_region { 255*37002bc6SCosta Shulyupin __u32 command; 256*37002bc6SCosta Shulyupin __u32 ret_code; 257*37002bc6SCosta Shulyupin } __packed; 258*37002bc6SCosta Shulyupin 259*37002bc6SCosta ShulyupinThis region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD. 260*37002bc6SCosta Shulyupin 261*37002bc6SCosta ShulyupinCurrently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region. 262*37002bc6SCosta Shulyupin 263*37002bc6SCosta Shulyupincommand specifies the command to be issued; ret_code stores a return code 264*37002bc6SCosta Shulyupinfor each access of the region. The following values may occur: 265*37002bc6SCosta Shulyupin 266*37002bc6SCosta Shulyupin``0`` 267*37002bc6SCosta Shulyupin The operation was successful. 268*37002bc6SCosta Shulyupin 269*37002bc6SCosta Shulyupin``-ENODEV`` 270*37002bc6SCosta Shulyupin The device was found to be not operational. 271*37002bc6SCosta Shulyupin 272*37002bc6SCosta Shulyupin``-EINVAL`` 273*37002bc6SCosta Shulyupin A command other than halt or clear was specified. 274*37002bc6SCosta Shulyupin 275*37002bc6SCosta Shulyupin``-EIO`` 276*37002bc6SCosta Shulyupin A request was issued while the device was not in a state ready to accept 277*37002bc6SCosta Shulyupin requests. 278*37002bc6SCosta Shulyupin 279*37002bc6SCosta Shulyupin``-EAGAIN`` 280*37002bc6SCosta Shulyupin A request was being processed, and the caller should retry. 281*37002bc6SCosta Shulyupin 282*37002bc6SCosta Shulyupin``-EBUSY`` 283*37002bc6SCosta Shulyupin The subchannel was status pending or busy while processing a halt request. 284*37002bc6SCosta Shulyupin 285*37002bc6SCosta Shulyupinvfio-ccw schib region 286*37002bc6SCosta Shulyupin--------------------- 287*37002bc6SCosta Shulyupin 288*37002bc6SCosta ShulyupinThe vfio-ccw schib region is used to return Subchannel-Information 289*37002bc6SCosta ShulyupinBlock (SCHIB) data to userspace:: 290*37002bc6SCosta Shulyupin 291*37002bc6SCosta Shulyupin struct ccw_schib_region { 292*37002bc6SCosta Shulyupin #define SCHIB_AREA_SIZE 52 293*37002bc6SCosta Shulyupin __u8 schib_area[SCHIB_AREA_SIZE]; 294*37002bc6SCosta Shulyupin } __packed; 295*37002bc6SCosta Shulyupin 296*37002bc6SCosta ShulyupinThis region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB. 297*37002bc6SCosta Shulyupin 298*37002bc6SCosta ShulyupinReading this region triggers a STORE SUBCHANNEL to be issued to the 299*37002bc6SCosta Shulyupinassociated hardware. 300*37002bc6SCosta Shulyupin 301*37002bc6SCosta Shulyupinvfio-ccw crw region 302*37002bc6SCosta Shulyupin--------------------- 303*37002bc6SCosta Shulyupin 304*37002bc6SCosta ShulyupinThe vfio-ccw crw region is used to return Channel Report Word (CRW) 305*37002bc6SCosta Shulyupindata to userspace:: 306*37002bc6SCosta Shulyupin 307*37002bc6SCosta Shulyupin struct ccw_crw_region { 308*37002bc6SCosta Shulyupin __u32 crw; 309*37002bc6SCosta Shulyupin __u32 pad; 310*37002bc6SCosta Shulyupin } __packed; 311*37002bc6SCosta Shulyupin 312*37002bc6SCosta ShulyupinThis region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW. 313*37002bc6SCosta Shulyupin 314*37002bc6SCosta ShulyupinReading this region returns a CRW if one that is relevant for this 315*37002bc6SCosta Shulyupinsubchannel (e.g. one reporting changes in channel path state) is 316*37002bc6SCosta Shulyupinpending, or all zeroes if not. If multiple CRWs are pending (including 317*37002bc6SCosta Shulyupinpossibly chained CRWs), reading this region again will return the next 318*37002bc6SCosta Shulyupinone, until no more CRWs are pending and zeroes are returned. This is 319*37002bc6SCosta Shulyupinsimilar to how STORE CHANNEL REPORT WORD works. 320*37002bc6SCosta Shulyupin 321*37002bc6SCosta Shulyupinvfio-ccw operation details 322*37002bc6SCosta Shulyupin-------------------------- 323*37002bc6SCosta Shulyupin 324*37002bc6SCosta Shulyupinvfio-ccw follows what vfio-pci did on the s390 platform and uses 325*37002bc6SCosta Shulyupinvfio-iommu-type1 as the vfio iommu backend. 326*37002bc6SCosta Shulyupin 327*37002bc6SCosta Shulyupin* CCW translation APIs 328*37002bc6SCosta Shulyupin A group of APIs (start with `cp_`) to do CCW translation. The CCWs 329*37002bc6SCosta Shulyupin passed in by a user space program are organized with their guest 330*37002bc6SCosta Shulyupin physical memory addresses. These APIs will copy the CCWs into kernel 331*37002bc6SCosta Shulyupin space, and assemble a runnable kernel channel program by updating the 332*37002bc6SCosta Shulyupin guest physical addresses with their corresponding host physical addresses. 333*37002bc6SCosta Shulyupin Note that we have to use IDALs even for direct-access CCWs, as the 334*37002bc6SCosta Shulyupin referenced memory can be located anywhere, including above 2G. 335*37002bc6SCosta Shulyupin 336*37002bc6SCosta Shulyupin* vfio_ccw device driver 337*37002bc6SCosta Shulyupin This driver utilizes the CCW translation APIs and introduces 338*37002bc6SCosta Shulyupin vfio_ccw, which is the driver for the I/O subchannel devices you want 339*37002bc6SCosta Shulyupin to pass through. 340*37002bc6SCosta Shulyupin vfio_ccw implements the following vfio ioctls:: 341*37002bc6SCosta Shulyupin 342*37002bc6SCosta Shulyupin VFIO_DEVICE_GET_INFO 343*37002bc6SCosta Shulyupin VFIO_DEVICE_GET_IRQ_INFO 344*37002bc6SCosta Shulyupin VFIO_DEVICE_GET_REGION_INFO 345*37002bc6SCosta Shulyupin VFIO_DEVICE_RESET 346*37002bc6SCosta Shulyupin VFIO_DEVICE_SET_IRQS 347*37002bc6SCosta Shulyupin 348*37002bc6SCosta Shulyupin This provides an I/O region, so that the user space program can pass a 349*37002bc6SCosta Shulyupin channel program to the kernel, to do further CCW translation before 350*37002bc6SCosta Shulyupin issuing them to a real device. 351*37002bc6SCosta Shulyupin This also provides the SET_IRQ ioctl to setup an event notifier to 352*37002bc6SCosta Shulyupin notify the user space program the I/O completion in an asynchronous 353*37002bc6SCosta Shulyupin way. 354*37002bc6SCosta Shulyupin 355*37002bc6SCosta ShulyupinThe use of vfio-ccw is not limited to QEMU, while QEMU is definitely a 356*37002bc6SCosta Shulyupingood example to get understand how these patches work. Here is a little 357*37002bc6SCosta Shulyupinbit more detail how an I/O request triggered by the QEMU guest will be 358*37002bc6SCosta Shulyupinhandled (without error handling). 359*37002bc6SCosta Shulyupin 360*37002bc6SCosta ShulyupinExplanation: 361*37002bc6SCosta Shulyupin 362*37002bc6SCosta Shulyupin- Q1-Q7: QEMU side process. 363*37002bc6SCosta Shulyupin- K1-K5: Kernel side process. 364*37002bc6SCosta Shulyupin 365*37002bc6SCosta ShulyupinQ1. 366*37002bc6SCosta Shulyupin Get I/O region info during initialization. 367*37002bc6SCosta Shulyupin 368*37002bc6SCosta ShulyupinQ2. 369*37002bc6SCosta Shulyupin Setup event notifier and handler to handle I/O completion. 370*37002bc6SCosta Shulyupin 371*37002bc6SCosta Shulyupin... ... 372*37002bc6SCosta Shulyupin 373*37002bc6SCosta ShulyupinQ3. 374*37002bc6SCosta Shulyupin Intercept a ssch instruction. 375*37002bc6SCosta ShulyupinQ4. 376*37002bc6SCosta Shulyupin Write the guest channel program and ORB to the I/O region. 377*37002bc6SCosta Shulyupin 378*37002bc6SCosta Shulyupin K1. 379*37002bc6SCosta Shulyupin Copy from guest to kernel. 380*37002bc6SCosta Shulyupin K2. 381*37002bc6SCosta Shulyupin Translate the guest channel program to a host kernel space 382*37002bc6SCosta Shulyupin channel program, which becomes runnable for a real device. 383*37002bc6SCosta Shulyupin K3. 384*37002bc6SCosta Shulyupin With the necessary information contained in the orb passed in 385*37002bc6SCosta Shulyupin by QEMU, issue the ccwchain to the device. 386*37002bc6SCosta Shulyupin K4. 387*37002bc6SCosta Shulyupin Return the ssch CC code. 388*37002bc6SCosta ShulyupinQ5. 389*37002bc6SCosta Shulyupin Return the CC code to the guest. 390*37002bc6SCosta Shulyupin 391*37002bc6SCosta Shulyupin... ... 392*37002bc6SCosta Shulyupin 393*37002bc6SCosta Shulyupin K5. 394*37002bc6SCosta Shulyupin Interrupt handler gets the I/O result and write the result to 395*37002bc6SCosta Shulyupin the I/O region. 396*37002bc6SCosta Shulyupin K6. 397*37002bc6SCosta Shulyupin Signal QEMU to retrieve the result. 398*37002bc6SCosta Shulyupin 399*37002bc6SCosta ShulyupinQ6. 400*37002bc6SCosta Shulyupin Get the signal and event handler reads out the result from the I/O 401*37002bc6SCosta Shulyupin region. 402*37002bc6SCosta ShulyupinQ7. 403*37002bc6SCosta Shulyupin Update the irb for the guest. 404*37002bc6SCosta Shulyupin 405*37002bc6SCosta ShulyupinLimitations 406*37002bc6SCosta Shulyupin----------- 407*37002bc6SCosta Shulyupin 408*37002bc6SCosta ShulyupinThe current vfio-ccw implementation focuses on supporting basic commands 409*37002bc6SCosta Shulyupinneeded to implement block device functionality (read/write) of DASD/ECKD 410*37002bc6SCosta Shulyupindevice only. Some commands may need special handling in the future, for 411*37002bc6SCosta Shulyupinexample, anything related to path grouping. 412*37002bc6SCosta Shulyupin 413*37002bc6SCosta ShulyupinDASD is a kind of storage device. While ECKD is a data recording format. 414*37002bc6SCosta ShulyupinMore information for DASD and ECKD could be found here: 415*37002bc6SCosta Shulyupinhttps://en.wikipedia.org/wiki/Direct-access_storage_device 416*37002bc6SCosta Shulyupinhttps://en.wikipedia.org/wiki/Count_key_data 417*37002bc6SCosta Shulyupin 418*37002bc6SCosta ShulyupinTogether with the corresponding work in QEMU, we can bring the passed 419*37002bc6SCosta Shulyupinthrough DASD/ECKD device online in a guest now and use it as a block 420*37002bc6SCosta Shulyupindevice. 421*37002bc6SCosta Shulyupin 422*37002bc6SCosta ShulyupinThe current code allows the guest to start channel programs via 423*37002bc6SCosta ShulyupinSTART SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL, 424*37002bc6SCosta Shulyupinand STORE SUBCHANNEL. 425*37002bc6SCosta Shulyupin 426*37002bc6SCosta ShulyupinCurrently all channel programs are prefetched, regardless of the 427*37002bc6SCosta Shulyupinp-bit setting in the ORB. As a result, self modifying channel 428*37002bc6SCosta Shulyupinprograms are not supported. For this reason, IPL has to be handled as 429*37002bc6SCosta Shulyupina special case by a userspace/guest program; this has been implemented 430*37002bc6SCosta Shulyupinin QEMU's s390-ccw bios as of QEMU 4.1. 431*37002bc6SCosta Shulyupin 432*37002bc6SCosta Shulyupinvfio-ccw supports classic (command mode) channel I/O only. Transport 433*37002bc6SCosta Shulyupinmode (HPF) is not supported. 434*37002bc6SCosta Shulyupin 435*37002bc6SCosta ShulyupinQDIO subchannels are currently not supported. Classic devices other than 436*37002bc6SCosta ShulyupinDASD/ECKD might work, but have not been tested. 437*37002bc6SCosta Shulyupin 438*37002bc6SCosta ShulyupinReference 439*37002bc6SCosta Shulyupin--------- 440*37002bc6SCosta Shulyupin1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) 441*37002bc6SCosta Shulyupin2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) 442*37002bc6SCosta Shulyupin3. https://en.wikipedia.org/wiki/Channel_I/O 443*37002bc6SCosta Shulyupin4. Documentation/arch/s390/cds.rst 444*37002bc6SCosta Shulyupin5. Documentation/driver-api/vfio.rst 445*37002bc6SCosta Shulyupin6. Documentation/driver-api/vfio-mediated-device.rst 446