xref: /dragonfly/share/man/man4/tcp.4 (revision 9c600e7d)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD: src/share/man/man4/tcp.4,v 1.11.2.14 2002/12/29 16:35:38 schweikh Exp $
34.\" $DragonFly: src/share/man/man4/tcp.4,v 1.2 2003/06/17 04:36:59 dillon Exp $
35.\"
36.Dd February 14, 1995
37.Dt TCP 4
38.Os
39.Sh NAME
40.Nm tcp
41.Nd Internet Transmission Control Protocol
42.Sh SYNOPSIS
43.In sys/types.h
44.In sys/socket.h
45.In netinet/in.h
46.Ft int
47.Fn socket AF_INET SOCK_STREAM 0
48.Sh DESCRIPTION
49The
50.Tn TCP
51protocol provides reliable, flow-controlled, two-way
52transmission of data.  It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.  TCP uses the standard
56Internet address format and, in addition, provides a per-host
57collection of
58.Dq port addresses .
59Thus, each address is composed
60of an Internet address specifying the host and network, with
61a specific
62.Tn TCP
63port on the host identifying the peer entity.
64.Pp
65Sockets utilizing the tcp protocol are either
66.Dq active
67or
68.Dq passive .
69Active sockets initiate connections to passive
70sockets.  By default
71.Tn TCP
72sockets are created active; to create a
73passive socket the
74.Xr listen 2
75system call must be used
76after binding the socket with the
77.Xr bind 2
78system call.  Only
79passive sockets may use the
80.Xr accept 2
81call to accept incoming connections.  Only active sockets may
82use the
83.Xr connect 2
84call to initiate connections.
85.Tn TCP
86also supports a more datagram-like mode, called Transaction
87.Tn TCP ,
88which is described in
89.Xr ttcp 4 .
90.Pp
91Passive sockets may
92.Dq underspecify
93their location to match
94incoming connection requests from multiple networks.  This
95technique, termed
96.Dq wildcard addressing ,
97allows a single
98server to provide service to clients on multiple networks.
99To create a socket which listens on all networks, the Internet
100address
101.Dv INADDR_ANY
102must be bound.  The
103.Tn TCP
104port may still be specified
105at this time; if the port is not specified the system will assign one.
106Once a connection has been established the socket's address is
107fixed by the peer entity's location.   The address assigned the
108socket is the address associated with the network interface
109through which packets are being transmitted and received.  Normally
110this address corresponds to the peer entity's network.
111.Pp
112.Tn TCP
113supports a number of socket options which can be set with
114.Xr setsockopt 2
115and tested with
116.Xr getsockopt 2 :
117.Bl -tag -width TCP_NODELAYx
118.It Dv TCP_NODELAY
119Under most circumstances,
120.Tn TCP
121sends data when it is presented;
122when outstanding data has not yet been acknowledged, it gathers
123small amounts of output to be sent in a single packet once
124an acknowledgement is received.
125For a small number of clients, such as window systems
126that send a stream of mouse events which receive no replies,
127this packetization may cause significant delays.
128The boolean option
129.Dv TCP_NODELAY
130defeats this algorithm.
131.It Dv TCP_MAXSEG
132By default, a sender\- and receiver-TCP
133will negotiate among themselves to determine the maximum segment size
134to be used for each connection.  The
135.Dv TCP_MAXSEG
136option allows the user to determine the result of this negotiation,
137and to reduce it if desired.
138.It Dv TCP_NOOPT
139.Tn TCP
140usually sends a number of options in each packet, corresponding to
141various
142.Tn TCP
143extensions which are provided in this implementation.  The boolean
144option
145.Dv TCP_NOOPT
146is provided to disable
147.Tn TCP
148option use on a per-connection basis.
149.It Dv TCP_NOPUSH
150By convention, the sender-TCP
151will set the
152.Dq push
153bit and begin transmission immediately (if permitted) at the end of
154every user call to
155.Xr write 2
156or
157.Xr writev 2 .
158The
159.Dv TCP_NOPUSH
160option is provided to allow servers to easily make use of Transaction
161TCP (see
162.Xr ttcp 4 ) .
163When the option is set to a non-zero value,
164.Tn TCP
165will delay sending any data at all until either the socket is closed,
166or the internal send buffer is filled.
167.El
168.Pp
169The option level for the
170.Xr setsockopt 2
171call is the protocol number for
172.Tn TCP ,
173available from
174.Xr getprotobyname 3 ,
175or
176.Dv IPPROTO_TCP .
177All options are declared in
178.Aq Pa netinet/tcp.h .
179.Pp
180Options at the
181.Tn IP
182transport level may be used with
183.Tn TCP ;
184see
185.Xr ip 4 .
186Incoming connection requests that are source-routed are noted,
187and the reverse source route is used in responding.
188.Sh MIB VARIABLES
189The
190.Nm
191protocol implements a number of variables in the
192.Li net.inet
193branch of the
194.Xr sysctl 3
195MIB.
196.Bl -tag -width TCPCTL_DO_RFC1644
197.It Dv TCPCTL_DO_RFC1323
198.Pq tcp.rfc1323
199Implement the window scaling and timestamp options of RFC 1323
200(default true).
201.It Dv TCPCTL_DO_RFC1644
202.Pq tcp.rfc1644
203Implement Transaction
204.Tn TCP ,
205as described in RFC 1644.
206.It Dv TCPCTL_MSSDFLT
207.Pq tcp.mssdflt
208The default value used for the maximum segment size
209.Pq Dq MSS
210when no advice to the contrary is received from MSS negotiation.
211.It Dv TCPCTL_SENDSPACE
212.Pq tcp.sendspace
213Maximum TCP send window.
214.It Dv TCPCTL_RECVSPACE
215.Pq tcp.recvspace
216Maximum TCP receive window.
217.It tcp.log_in_vain
218Log any connection attempts to ports where there is not a socket
219accepting connections.
220The value of 1 limits the logging to SYN (connection establishment)
221packets only.
222That of 2 results in any TCP packets to closed ports being logged.
223Any value unlisted above disables the logging
224(default is 0, i.e., the logging is disabled).
225.It tcp.slowstart_flightsize
226The number of packets allowed to be in-flight during the
227.Tn TCP
228slow-start phase on a non-local network.
229.It tcp.local_slowstart_flightsize
230The number of packets allowed to be in-flight during the
231.Tn TCP
232slow-start phase to local machines in the same subnet.
233.It tcp.msl
234The Maximum Segment Lifetime for a packet.
235.It tcp.keepinit
236Timeout for new, non-established TCP connections.
237.It tcp.keepidle
238Amount of time the connection should be idle before keepalive
239probes (if enabled) are sent.
240.It tcp.keepintvl
241The interval between keepalive probes sent to remote machines.
242After
243.Dv TCPTV_KEEPCNT
244(default 8) probes are sent, with no response, the connection is dropped.
245.It tcp.always_keepalive
246Assume that
247.Dv SO_KEEPALIVE
248is set on all
249.Tn TCP
250connections, the kernel will
251periodically send a packet to the remote host to verify the connection
252is still up.
253.It tcp.icmp_may_rst
254Certain
255.Tn ICMP
256unreachable messages may abort connections in
257.Tn SYN-SENT
258state.
259.It tcp.do_tcpdrain
260Flush packets in the
261.Tn TCP
262reassembly queue if the system is low on mbufs.
263.It tcp.blackhole
264If enabled, disable sending of RST when a connection is attempted
265to a port where there is not a socket accepting connections.
266See
267.Xr blackhole 4 .
268.It tcp.delayed_ack
269Delay ACK to try and piggyback it onto a data packet.
270.It tcp.delacktime
271Maximum amount of time before a delayed ACK is sent.
272.It tcp.newreno
273Enable TCP NewReno Fast Recovery algorithm,
274as described in RFC 2582.
275.It tcp.path_mtu_discovery
276Enable Path MTU Discovery
277.It tcp.tcbhashsize
278Size of the
279.Tn TCP
280control-block hashtable
281(read-only).
282This may be tuned using the kernel option
283.Dv TCBHASHSIZE
284or by setting
285.Va net.inet.tcp.tcbhashsize
286in the
287.Xr loader 8 .
288.It tcp.pcbcount
289Number of active process control blocks
290(read-only).
291.It tcp.syncookies
292Determines whether or not syn cookies should be generated for
293outbound syn-ack packets.  Syn cookies are a great help during
294syn flood attacks, and are enabled by default.
295.It tcp.isn_reseed_interval
296The interval (in seconds) specifying how often the secret data used in
297RFC 1948 initial sequence number calculations should be reseeded.
298By default, this variable is set to zero, indicating that
299no reseeding will occur.
300Reseeding should not be necessary, and will break
301.Dv TIME_WAIT
302recycling for a few minutes.
303.It tcp.inet.tcp.rexmit_{min,slop}
304Adjust the retransmit timer calculation for TCP.  The slop is
305typically added to the raw calculation to take into account
306occasional variances that the SRTT (smoothed round trip time)
307is unable to accomodate, while the minimum specifies an
308absolute minimum.  While a number of TCP RFCs suggest a 1
309second minimum these RFCs tend to focus on streaming behavior
310and fail to deal with the fact that a 1 second minimum has severe
311detrimental effects over lossy interactive connections, such
312as a 802.11b wireless link, and over very fast but lossy
313connections for those cases not covered by the fast retransmit
314code.  For this reason we suggest changing the slop to 200ms and
315setting the minimum to something out of the way, like 20ms,
316which gives you an effective minimum of 200ms (similar to Linux).
317.It tcp.inflight_enable
318Enable
319.Tn TCP
320bandwidth delay product limiting.  An attempt will be made to calculate
321the bandwidth delay product for each individual TCP connection and limit
322the amount of inflight data being transmitted to avoid building up
323unnecessary packets in the network.  This option is recommended if you
324are serving a lot of data over connections with high bandwidth-delay
325products, such as modems, GigE links, and fast long-haul WANs, and/or
326you have configured your machine to accomodate large TCP windows.  In such
327situations, without this option, you may experience high interactive
328latencies or packet loss due to the overloading of intermediate routers
329and switches.  Note that bandwidth delay product limiting only effects
330the transmit side of a TCP connection.
331.It tcp.inflight_debug
332Enable debugging for the bandwidth delay product algorithm.  This may
333default to on (1) so if you enable the algorithm you should probably also
334disable debugging by setting this variable to 0.
335.It tcp.inflight_min
336This puts an lower bound on the bandwidth delay product window, in bytes.
337A value of 1024 is typically used for debugging.  6000-16000 is more typical
338in a production installation.  Setting this value too low may result in
339slow ramp-up times for bursty connections.  Setting this value too high
340effectively disables the algorithm.
341.It tcp.inflight_max
342This puts an upper bound on the bandwidth delay product window, in bytes.
343This value should not generally be modified but may be used to set a
344global per-connection limit on queued data, potentially allowing you to
345intentionally set a less then optimum limit to smooth data flow over a
346network while still being able to specify huge internal TCP buffers.
347.It tcp.inflight_stab
348The bandwidth delay product algorithm requires a slightly larger window
349then it otherwise calculates for stability.  This parameter determines the
350extra window in maximal packets / 10.  The default value of 20 represents
3512 maximal packets.  Reducing this value is not recommended but you may
352come across a situation with very slow links where the ping time
353reduction of the default inflight code is not sufficient.  If this case
354occurs you should first try reducing tcp.inflight_min and, if that does not
355work, reduce both tcp.inflight_min and tcp.inflight_stab, trying values of
35615, 10, or 5 for the latter.  Never use a value less then 5.  Reducing
357tcp.inflight_stab can lead to upwards of a 20% underutilization of the link
358as well as reducing the algorithm's ability to adapt to changing
359situations and should only be done as a last resort.
360.El
361.Sh ERRORS
362A socket operation may fail with one of the following errors returned:
363.Bl -tag -width Er
364.It Bq Er EISCONN
365when trying to establish a connection on a socket which
366already has one;
367.It Bq Er ENOBUFS
368when the system runs out of memory for
369an internal data structure;
370.It Bq Er ETIMEDOUT
371when a connection was dropped
372due to excessive retransmissions;
373.It Bq Er ECONNRESET
374when the remote peer
375forces the connection to be closed;
376.It Bq Er ECONNREFUSED
377when the remote
378peer actively refuses connection establishment (usually because
379no process is listening to the port);
380.It Bq Er EADDRINUSE
381when an attempt
382is made to create a socket with a port which has already been
383allocated;
384.It Bq Er EADDRNOTAVAIL
385when an attempt is made to create a
386socket with a network address for which no network interface
387exists.
388.It Bq Er EAFNOSUPPORT
389when an attempt is made to bind or connect a socket to a multicast
390address.
391.El
392.Sh SEE ALSO
393.Xr getsockopt 2 ,
394.Xr socket 2 ,
395.Xr sysctl 3 ,
396.Xr blackhole 4 ,
397.Xr inet 4 ,
398.Xr intro 4 ,
399.Xr ip 4 ,
400.Xr ttcp 4
401.Rs
402.%A V. Jacobson
403.%A R. Braden
404.%A D. Borman
405.%T "TCP Extensions for High Performance"
406.%O RFC 1323
407.Re
408.Rs
409.%A R. Braden
410.%T "T/TCP \- TCP Extensions for Transactions"
411.%O RFC 1644
412.Re
413.Sh HISTORY
414The
415.Nm
416protocol appeared in
417.Bx 4.2 .
418The RFC 1323 extensions for window scaling and timestamps were added
419in
420.Bx 4.4 .
421