1.\" $OpenBSD: sosplice.9,v 1.10 2019/07/04 17:42:17 bluhm Exp $ 2.\" 3.\" Copyright (c) 2011-2013 Alexander Bluhm <bluhm@openbsd.org> 4.\" 5.\" Permission to use, copy, modify, and distribute this software for any 6.\" purpose with or without fee is hereby granted, provided that the above 7.\" copyright notice and this permission notice appear in all copies. 8.\" 9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 16.\" 17.Dd $Mdocdate: July 4 2019 $ 18.Dt SOSPLICE 9 19.Os 20.Sh NAME 21.Nm sosplice , 22.Nm somove 23.Nd splice two sockets for zero-copy data transfer 24.Sh SYNOPSIS 25.Ft int 26.Fn sosplice "struct socket *so" "int fd" "off_t max" "struct timeval *tv" 27.Ft int 28.Fn somove "struct socket *so" "int wait" 29.Sh DESCRIPTION 30The function 31.Fn sosplice 32is used to splice together a source and a drain socket. 33The source socket is passed as the 34.Fa so 35argument; 36the file descriptor of the drain is passed in 37.Fa fd . 38If 39.Fa fd 40is negative, an existing splicing gets dissolved. 41If 42.Fa max 43is positive, at most that many bytes will get transferred. 44If 45.Fa tv 46is not NULL, a 47.Xr timeout 9 48is scheduled to dissolve splicing in the case when no data can be 49transferred for the specified period of time. 50Socket splicing can be invoked from userland via the 51.Xr setsockopt 2 52system-call at the 53.Dv SOL_SOCKET 54level with the socket option 55.Dv SO_SPLICE . 56.Pp 57Before connecting both sockets, several checks are executed. 58See the 59.Sx ERRORS 60section for possible failures. 61The connection between both sockets is implemented by setting these 62additional fields in the 63.Vt struct sosplice Va *so_sp 64field in 65.Vt struct socket : 66.Pp 67.Bl -dash -compact -offset indent 68.It 69.Vt struct socket Va *ssp_socket 70links from the source to the drain socket. 71.It 72.Vt struct socket Va *ssp_soback 73links back from the drain to the source socket. 74.It 75.Vt off_t Va ssp_len 76counts the number of bytes spliced so far from this socket. 77.It 78.Vt off_t Va ssp_max 79specifies the maximum number of bytes to splice from this socket if 80non-zero. 81.It 82.Vt struct timeval Va ssp_idletv 83specifies the maximum idle time if non-zero. 84.It 85.Vt struct timeout Va ssp_idleto 86provides storage for the kernel timeout if idle time is used. 87.El 88.Pp 89After connecting both sockets, 90.Fn sosplice 91calls 92.Fn somove 93to transfer the mbufs already in the source receive buffer to the 94drain send buffer. 95Finally the socket buffer flag 96.Dv SB_SPLICE 97is set on both socket buffers, to indicate that the protocol layer 98has to call 99.Fn somove 100whenever data or space is available. 101.Pp 102The function 103.Fn somove 104transfers data from the source's receive buffer to the drain's send 105buffer. 106It must be called at 107.Xr splsoftnet 9 108and 109.Fa so 110must be a spliced source socket. 111It may be necessary to split an mbuf to handle out-of-band data 112inline or when the maximum splice length has been reached. 113If 114.Fa wait 115is 116.Dv M_WAIT , 117splitting mbufs will always succeed. 118For 119.Dv M_DONTWAIT 120the out-of-band property might get lost or a short splice might 121happen. 122In the latter case, less than the given maximum number of bytes are 123transferred and userland has to cope with this. 124Note that a short splice cannot happen if 125.Fn somove 126was called by 127.Fn sosplice . 128So a second 129.Xr setsockopt 2 130after a short splice pointing to the same maximum will always 131succeed. 132.Pp 133Before transferring data, 134.Fn somove 135checks both sockets for errors and that the drain socket is connected. 136If the drain cannot send anymore, an 137.Er EPIPE 138error is set on the source socket. 139The data length to move is limited by the optional maximum splice 140length and the space in the drain's send socket buffer. 141Up to this amount of data is taken out of the source's receive 142socket buffer. 143To avoid splicing loops created by userland, the number of times 144an mbuf may be moved between sockets is limited to 128. 145.Pp 146For atomic protocols, either one complete packet is taken out, or 147nothing is taken at all if: 148the packet is bigger than the drain's send buffer size, in which 149case the splicing gets aborted with an 150.Er EMSGSIZE 151error; 152the packet does not fit into the drain's current send buffer space, 153in which case it is left in the source's receive buffer for later 154processing; 155or the maximum splice length is located within a packet, in which 156case splicing gets dissolved like a short splice. 157All address or control mbufs associated with the taken packet are 158dropped. 159.Pp 160If the maximum splice length has been reached, an mbuf may get 161split for non-atomic protocols. 162Otherwise an mbuf is either moved completely to the send buffer or 163left in the receive buffer for later processing. 164If SO_OOBINLINE is set, out-of-band data will get moved as such 165although this might not be reliable. 166The data is sent out to the drain socket via the protocol function. 167If that fails and the drain socket cannot send anymore, an 168.Er EPIPE 169error is set on the source socket. 170.Pp 171For packet oriented protocols 172.Fn somove 173iterates over the next packet queue. 174.Pp 175If a maximum splice length was specified and at least this amount 176of data has been received from the drain socket, splicing gets 177dissolved. 178In this case, an 179.Er EFBIG 180error is set on the source socket if the maximum amount of data has 181been transferred. 182Userland can process this error to distinguish the full splice from 183a short splice or to react to the completed maximum splice immediately. 184If an idle timeout was specified and no data has been transferred 185for that period of time, the handler 186.Fn soidle 187dissolves splicing and sets an 188.Er ETIMEDOUT 189error on the source socket. 190.Pp 191The function 192.Fn sounsplice 193is called to dissolve the socket splicing if the source socket 194cannot receive anymore and its receive buffer is empty; or if the 195drain socket cannot send anymore; or if the maximum has been reached; 196or if an error occurred; or if the idle timeout has fired. 197.Pp 198If the socket buffer flag 199.Dv SB_SPLICE 200is set, the functions 201.Fn sorwakeup 202and 203.Fn sowwakeup 204will call 205.Fn somove 206to trigger the transfer when new data or buffer space is available. 207While socket splicing is active, any 208.Xr read 2 209from the source socket will block. 210Neither read nor write wakeups will be delivered to the file 211descriptors. 212After dissolving, a read event or a socket error is signaled to 213userland on the source socket. 214If space is available, a write event will be signaled on the drain 215socket. 216.Sh RETURN VALUES 217.Fn sosplice 218returns 0 on success and otherwise the error number. 219.Fn somove 220returns 0 if socket splicing has been finished and 1 if it continues. 221.Sh ERRORS 222.Fn sosplice 223will succeed unless: 224.Bl -tag -width Er 225.It Bq Er EBADF 226The given file descriptor 227.Fa fd 228is not an active descriptor. 229.It Bq Er EBUSY 230The source or the drain socket is already spliced. 231.It Bq Er EINVAL 232The given maximum value 233.Fa max 234is negative. 235.It Bq Er ENOTCONN 236The source socket requires a connection and is neither connected 237nor in the process of connecting to a peer. 238.It Bq Er ENOTCONN 239The drain socket is neither connected nor in the process of connecting 240to a peer. 241.It Bq Er ENOTSOCK 242The given file descriptor 243.Fa fd 244is not a socket. 245.It Bq Er EOPNOTSUPP 246The source or the drain socket is a listen socket. 247.It Bq Er EPROTONOSUPPORT 248The source socket's protocol layer does not have the 249.Dv PR_SPLICE 250flag set. 251Only TCP and UDP socket splicing is supported. 252.It Bq Er EPROTONOSUPPORT 253The drain socket's protocol does not have the same 254.Fa pr_usrreq 255function as the source. 256.It Bq Er EWOULDBLOCK 257The source socket is non-blocking and the receive buffer is already 258locked. 259.El 260.Sh SEE ALSO 261.Xr setsockopt 2 , 262.Xr options 4 , 263.Xr timeout 9 264.Sh HISTORY 265Socket splicing for TCP first appeared in 266.Ox 4.9 ; 267support for UDP was added in 268.Ox 5.3 . 269.Sh AUTHORS 270.An -nosplit 271The idea for socket splicing originally came from 272.An Markus Friedl Aq Mt markus@openbsd.org , 273and 274.An Alexander Bluhm Aq Mt bluhm@openbsd.org 275implemented it. 276.An Mike Belopuhov Aq Mt mikeb@openbsd.org 277added the timeout feature. 278