xref: /openbsd/share/man/man9/sosplice.9 (revision 09467b48)
1.\"	$OpenBSD: sosplice.9,v 1.10 2019/07/04 17:42:17 bluhm Exp $
2.\"
3.\" Copyright (c) 2011-2013 Alexander Bluhm <bluhm@openbsd.org>
4.\"
5.\" Permission to use, copy, modify, and distribute this software for any
6.\" purpose with or without fee is hereby granted, provided that the above
7.\" copyright notice and this permission notice appear in all copies.
8.\"
9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16.\"
17.Dd $Mdocdate: July 4 2019 $
18.Dt SOSPLICE 9
19.Os
20.Sh NAME
21.Nm sosplice ,
22.Nm somove
23.Nd splice two sockets for zero-copy data transfer
24.Sh SYNOPSIS
25.Ft int
26.Fn sosplice "struct socket *so" "int fd" "off_t max" "struct timeval *tv"
27.Ft int
28.Fn somove "struct socket *so" "int wait"
29.Sh DESCRIPTION
30The function
31.Fn sosplice
32is used to splice together a source and a drain socket.
33The source socket is passed as the
34.Fa so
35argument;
36the file descriptor of the drain is passed in
37.Fa fd .
38If
39.Fa fd
40is negative, an existing splicing gets dissolved.
41If
42.Fa max
43is positive, at most that many bytes will get transferred.
44If
45.Fa tv
46is not NULL, a
47.Xr timeout 9
48is scheduled to dissolve splicing in the case when no data can be
49transferred for the specified period of time.
50Socket splicing can be invoked from userland via the
51.Xr setsockopt 2
52system-call at the
53.Dv SOL_SOCKET
54level with the socket option
55.Dv SO_SPLICE .
56.Pp
57Before connecting both sockets, several checks are executed.
58See the
59.Sx ERRORS
60section for possible failures.
61The connection between both sockets is implemented by setting these
62additional fields in the
63.Vt struct sosplice Va *so_sp
64field in
65.Vt struct socket :
66.Pp
67.Bl -dash -compact -offset indent
68.It
69.Vt struct socket Va *ssp_socket
70links from the source to the drain socket.
71.It
72.Vt struct socket Va *ssp_soback
73links back from the drain to the source socket.
74.It
75.Vt off_t Va ssp_len
76counts the number of bytes spliced so far from this socket.
77.It
78.Vt off_t Va ssp_max
79specifies the maximum number of bytes to splice from this socket if
80non-zero.
81.It
82.Vt struct timeval Va ssp_idletv
83specifies the maximum idle time if non-zero.
84.It
85.Vt struct timeout Va ssp_idleto
86provides storage for the kernel timeout if idle time is used.
87.El
88.Pp
89After connecting both sockets,
90.Fn sosplice
91calls
92.Fn somove
93to transfer the mbufs already in the source receive buffer to the
94drain send buffer.
95Finally the socket buffer flag
96.Dv SB_SPLICE
97is set on both socket buffers, to indicate that the protocol layer
98has to call
99.Fn somove
100whenever data or space is available.
101.Pp
102The function
103.Fn somove
104transfers data from the source's receive buffer to the drain's send
105buffer.
106It must be called at
107.Xr splsoftnet 9
108and
109.Fa so
110must be a spliced source socket.
111It may be necessary to split an mbuf to handle out-of-band data
112inline or when the maximum splice length has been reached.
113If
114.Fa wait
115is
116.Dv M_WAIT ,
117splitting mbufs will always succeed.
118For
119.Dv M_DONTWAIT
120the out-of-band property might get lost or a short splice might
121happen.
122In the latter case, less than the given maximum number of bytes are
123transferred and userland has to cope with this.
124Note that a short splice cannot happen if
125.Fn somove
126was called by
127.Fn sosplice .
128So a second
129.Xr setsockopt 2
130after a short splice pointing to the same maximum will always
131succeed.
132.Pp
133Before transferring data,
134.Fn somove
135checks both sockets for errors and that the drain socket is connected.
136If the drain cannot send anymore, an
137.Er EPIPE
138error is set on the source socket.
139The data length to move is limited by the optional maximum splice
140length and the space in the drain's send socket buffer.
141Up to this amount of data is taken out of the source's receive
142socket buffer.
143To avoid splicing loops created by userland, the number of times
144an mbuf may be moved between sockets is limited to 128.
145.Pp
146For atomic protocols, either one complete packet is taken out, or
147nothing is taken at all if:
148the packet is bigger than the drain's send buffer size, in which
149case the splicing gets aborted with an
150.Er EMSGSIZE
151error;
152the packet does not fit into the drain's current send buffer space,
153in which case it is left in the source's receive buffer for later
154processing;
155or the maximum splice length is located within a packet, in which
156case splicing gets dissolved like a short splice.
157All address or control mbufs associated with the taken packet are
158dropped.
159.Pp
160If the maximum splice length has been reached, an mbuf may get
161split for non-atomic protocols.
162Otherwise an mbuf is either moved completely to the send buffer or
163left in the receive buffer for later processing.
164If SO_OOBINLINE is set, out-of-band data will get moved as such
165although this might not be reliable.
166The data is sent out to the drain socket via the protocol function.
167If that fails and the drain socket cannot send anymore, an
168.Er EPIPE
169error is set on the source socket.
170.Pp
171For packet oriented protocols
172.Fn somove
173iterates over the next packet queue.
174.Pp
175If a maximum splice length was specified and at least this amount
176of data has been received from the drain socket, splicing gets
177dissolved.
178In this case, an
179.Er EFBIG
180error is set on the source socket if the maximum amount of data has
181been transferred.
182Userland can process this error to distinguish the full splice from
183a short splice or to react to the completed maximum splice immediately.
184If an idle timeout was specified and no data has been transferred
185for that period of time, the handler
186.Fn soidle
187dissolves splicing and sets an
188.Er ETIMEDOUT
189error on the source socket.
190.Pp
191The function
192.Fn sounsplice
193is called to dissolve the socket splicing if the source socket
194cannot receive anymore and its receive buffer is empty; or if the
195drain socket cannot send anymore; or if the maximum has been reached;
196or if an error occurred; or if the idle timeout has fired.
197.Pp
198If the socket buffer flag
199.Dv SB_SPLICE
200is set, the functions
201.Fn sorwakeup
202and
203.Fn sowwakeup
204will call
205.Fn somove
206to trigger the transfer when new data or buffer space is available.
207While socket splicing is active, any
208.Xr read 2
209from the source socket will block.
210Neither read nor write wakeups will be delivered to the file
211descriptors.
212After dissolving, a read event or a socket error is signaled to
213userland on the source socket.
214If space is available, a write event will be signaled on the drain
215socket.
216.Sh RETURN VALUES
217.Fn sosplice
218returns 0 on success and otherwise the error number.
219.Fn somove
220returns 0 if socket splicing has been finished and 1 if it continues.
221.Sh ERRORS
222.Fn sosplice
223will succeed unless:
224.Bl -tag -width Er
225.It Bq Er EBADF
226The given file descriptor
227.Fa fd
228is not an active descriptor.
229.It Bq Er EBUSY
230The source or the drain socket is already spliced.
231.It Bq Er EINVAL
232The given maximum value
233.Fa max
234is negative.
235.It Bq Er ENOTCONN
236The source socket requires a connection and is neither connected
237nor in the process of connecting to a peer.
238.It Bq Er ENOTCONN
239The drain socket is neither connected nor in the process of connecting
240to a peer.
241.It Bq Er ENOTSOCK
242The given file descriptor
243.Fa fd
244is not a socket.
245.It Bq Er EOPNOTSUPP
246The source or the drain socket is a listen socket.
247.It Bq Er EPROTONOSUPPORT
248The source socket's protocol layer does not have the
249.Dv PR_SPLICE
250flag set.
251Only TCP and UDP socket splicing is supported.
252.It Bq Er EPROTONOSUPPORT
253The drain socket's protocol does not have the same
254.Fa pr_usrreq
255function as the source.
256.It Bq Er EWOULDBLOCK
257The source socket is non-blocking and the receive buffer is already
258locked.
259.El
260.Sh SEE ALSO
261.Xr setsockopt 2 ,
262.Xr options 4 ,
263.Xr timeout 9
264.Sh HISTORY
265Socket splicing for TCP first appeared in
266.Ox 4.9 ;
267support for UDP was added in
268.Ox 5.3 .
269.Sh AUTHORS
270.An -nosplit
271The idea for socket splicing originally came from
272.An Markus Friedl Aq Mt markus@openbsd.org ,
273and
274.An Alexander Bluhm Aq Mt bluhm@openbsd.org
275implemented it.
276.An Mike Belopuhov Aq Mt mikeb@openbsd.org
277added the timeout feature.
278