1.\" $OpenBSD: sosplice.9,v 1.7 2013/07/17 20:21:55 schwarze Exp $ 2.\" 3.\" Copyright (c) 2011-2013 Alexander Bluhm <bluhm@openbsd.org> 4.\" 5.\" Permission to use, copy, modify, and distribute this software for any 6.\" purpose with or without fee is hereby granted, provided that the above 7.\" copyright notice and this permission notice appear in all copies. 8.\" 9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 16.\" 17.Dd $Mdocdate: July 17 2013 $ 18.Dt SOSPLICE 9 19.Os 20.Sh NAME 21.Nm sosplice , 22.Nm somove 23.Nd splice two sockets for zero-copy data transfer 24.Sh SYNOPSIS 25.Ft int 26.Fn sosplice "struct socket *so" "int fd" "off_t max" "struct timeval *tv" 27.Ft int 28.Fn somove "struct socket *so" "int wait" 29.Sh DESCRIPTION 30The function 31.Fn sosplice 32is used to splice together a source and a drain socket. 33The source socket is passed as the 34.Fa so 35argument; 36the file descriptor of the drain is passed in 37.Fa fd . 38If 39.Fa fd 40is negative, an existing splicing gets dissolved. 41If 42.Fa max 43is positive, at most that many bytes will get transferred. 44If 45.Fa tv 46is not NULL, a 47.Xr timeout 9 48is scheduled to dissolve splicing in the case when no data can be 49transferred for the specified period of time. 50Socket splicing can be invoked from userland via the 51.Xr setsockopt 2 52system-call at the 53.Dv SOL_SOCKET 54level with the socket option 55.Dv SO_SPLICE . 56.Pp 57Before connecting both sockets, several checks are executed. 58See the 59.Sx ERRORS 60section for possible failures. 61The connection between both sockets is implemented by setting these 62additional fields in 63.Vt struct socket : 64.Pp 65.Bl -dash -compact -offset indent 66.It 67.Vt struct socket Fa *so_splice 68links from the source to the drain socket. 69.It 70.Vt struct socket Fa *so_spliceback 71links back from the drain to the source socket. 72.It 73.Vt off_t Fa so_splicelen 74counts the number of bytes spliced so far from this socket. 75.It 76.Vt off_t Fa so_splicemax 77specifies the maximum number of bytes to splice from this socket if 78non-zero. 79.It 80.Vt struct timeval Fa so_idletv 81specifies the maximum idle time if non-zero. 82.It 83.Vt struct timeout Fa so_idleto 84provides storage for the kernel timeout if idle time is used. 85.El 86.Pp 87After connecting both sockets, 88.Fn sosplice 89calls 90.Fn somove 91to transfer the mbufs already in the source receive buffer to the 92drain send buffer. 93Finally the socket buffer flag 94.Dv SB_SPLICE 95is set on both socket buffers, to indicate that the protocol layer 96has to call 97.Fn somove 98whenever data or space is available. 99.Pp 100The function 101.Fn somove 102transfers data from the source's receive buffer to the drain's send 103buffer. 104It must be called at 105.Xr splsoftnet 9 106and 107.Fa so 108must be a spliced drain socket. 109It may be necessary to split an mbuf to handle out-of-band data 110inline or when the maximum splice length has been reached. 111If 112.Fa wait 113is 114.Dv M_WAIT , 115splitting mbufs will always succeed. 116For 117.Dv M_DONTWAIT 118the out-of-band property might get lost or a short splice might 119happen. 120In the latter case, less than the given maximum number of bytes are 121transferred and userland has to cope with this. 122Note that a short splice cannot happen if 123.Fn somove 124was called by 125.Fn sosplice . 126So a second 127.Xr setsockopt 2 128after a short splice pointing to the same maximum will always 129succeed. 130.Pp 131Before transferring data, 132.Fn somove 133checks both sockets for errors and that the drain socket is connected. 134If the drain cannot send anymore, an 135.Er EPIPE 136error is set on the source socket. 137The data length to move is limited by the optional maximum splice 138length and the space in the drain's send socket buffer. 139Up to this amount of data is taken out of the source's receive 140socket buffer. 141.Pp 142For atomic protocols, either one complete packet is taken out, or 143nothing is taken at all if: 144the packet is bigger than the drain's send buffer size, in which 145case the splicing gets aborted with an 146.Er EMSGSIZE 147error; 148the packet does not fit into the drain's current send buffer space, 149in which case it is left in the source's receive buffer for later 150processing; 151or the maximum splice length is located within a packet, in which 152case splicing gets dissolved like a short splice. 153All address or control mbufs associated with the taken packet are 154dropped. 155.Pp 156If the maximum splice length has been reached, an mbuf may get 157split for non-atomic protocols. 158Otherwise an mbuf is either moved completely to the send buffer or 159left in the receive buffer for later processing. 160If SO_OOBINLINE is set, out-of-band data will get moved as such 161although this might not be reliable. 162The data is sent out to the drain socket via the protocol function. 163If that fails and the drain socket cannot send anymore, an 164.Er EPIPE 165error is set on the source socket. 166.Pp 167For packet oriented protocols 168.Fn somove 169iterates over the next packet queue. 170.Pp 171If a maximum splice length was specified and at least this amount 172of data has been received from the drain socket, splicing gets 173dissolved. 174In this case, an 175.Er EFBIG 176error is set on the source socket if the maximum amount of data has 177been transferred. 178Userland can process this error to distinguish the full splice from 179a short splice or to react to the completed maximum splice immediately. 180If an idle timeout was specified and no data has been transferred 181for that period of time, the handler 182.Fn soidle 183dissolves splicing and sets an 184.Er ETIMEDOUT 185error on the source socket. 186.Pp 187The function 188.Fn sounsplice 189is called to dissolve the socket splicing if the source socket 190cannot receive anymore and its receive buffer is empty; or if the 191drain socket cannot send anymore; or if the maximum has been reached; 192or if an error occurred; or if the idle timeout has fired. 193.Pp 194If the socket buffer flag 195.Dv SB_SPLICE 196is set, the functions 197.Fn sorwakeup 198and 199.Fn sowwakeup 200will call 201.Fn somove 202to trigger the transfer when new data or buffer space is available. 203While socket splicing is active, any 204.Xr read 2 205from the source socket will block and the wakeup will not be delivered 206to the file descriptor. 207A read event or a socket error is signaled to userland after 208dissolving. 209.Sh RETURN VALUES 210.Fn sosplice 211returns 0 on success and otherwise the error number. 212.Fn somove 213returns 0 if socket splicing has been finished and 1 if it continues. 214.Sh ERRORS 215.Fn sosplice 216will succeed unless: 217.Bl -tag -width Er 218.It Bq Er EBADF 219The given file descriptor 220.Fa fd 221is not an active descriptor. 222.It Bq Er EBUSY 223The source or the drain socket is already spliced. 224.It Bq Er EINVAL 225The given maximum value 226.Fa max 227is negative. 228.It Bq Er ENOTCONN 229The source socket requires a connection and is neither connected 230nor in the process of connecting to a peer. 231.It Bq Er ENOTCONN 232The drain socket is neither connected nor in the process of connecting 233to a peer. 234.It Bq Er ENOTSOCK 235The given file descriptor 236.Fa fd 237is not a socket. 238.It Bq Er EOPNOTSUPP 239The source or the drain socket is a listen socket. 240.It Bq Er EPROTONOSUPPORT 241The source socket's protocol layer does not have the 242.Dv PR_SPLICE 243flag set. 244Only TCP and UDP socket splicing is supported. 245.It Bq Er EPROTONOSUPPORT 246The drain socket's protocol does not have the same 247.Fa pr_usrreq 248function as the source. 249.It Bq Er EWOULDBLOCK 250The source socket is non-blocking and the receive buffer is already 251locked. 252.El 253.Sh SEE ALSO 254.Xr setsockopt 2 , 255.Xr options 4 , 256.Xr timeout 9 257.Sh HISTORY 258Socket splicing for TCP first appeared in 259.Ox 4.9 ; 260support for UDP was added in 261.Ox 5.3 . 262.Sh AUTHORS 263.An -nosplit 264The idea for socket splicing originally came from 265.An Markus Friedl Aq Mt markus@openbsd.org , 266and 267.An Alexander Bluhm Aq Mt bluhm@openbsd.org 268implemented it. 269.An Mike Belopuhov Aq Mt mikeb@openbsd.org 270added the timeout feature. 271