xref: /dragonfly/share/man/man4/tcp.4 (revision 0db87cb7)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. Neither the name of the University nor the names of its contributors
13.\"    may be used to endorse or promote products derived from this software
14.\"    without specific prior written permission.
15.\"
16.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
19.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
26.\" SUCH DAMAGE.
27.\"
28.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
29.\" $FreeBSD: src/share/man/man4/tcp.4,v 1.11.2.14 2002/12/29 16:35:38 schweikh Exp $
30.\" $DragonFly: src/share/man/man4/tcp.4,v 1.9 2008/10/17 11:30:24 swildner Exp $
31.\"
32.Dd February 14, 1995
33.Dt TCP 4
34.Os
35.Sh NAME
36.Nm tcp
37.Nd Internet Transmission Control Protocol
38.Sh SYNOPSIS
39.In sys/types.h
40.In sys/socket.h
41.In netinet/in.h
42.Ft int
43.Fn socket AF_INET SOCK_STREAM 0
44.Sh DESCRIPTION
45The
46.Tn TCP
47protocol provides reliable, flow-controlled, two-way
48transmission of data.  It is a byte-stream protocol used to
49support the
50.Dv SOCK_STREAM
51abstraction.  TCP uses the standard
52Internet address format and, in addition, provides a per-host
53collection of
54.Dq port addresses .
55Thus, each address is composed
56of an Internet address specifying the host and network, with
57a specific
58.Tn TCP
59port on the host identifying the peer entity.
60.Pp
61Sockets utilizing the tcp protocol are either
62.Dq active
63or
64.Dq passive .
65Active sockets initiate connections to passive
66sockets.  By default
67.Tn TCP
68sockets are created active; to create a
69passive socket the
70.Xr listen 2
71system call must be used
72after binding the socket with the
73.Xr bind 2
74system call.  Only
75passive sockets may use the
76.Xr accept 2
77call to accept incoming connections.  Only active sockets may
78use the
79.Xr connect 2
80call to initiate connections.
81.Pp
82Passive sockets may
83.Dq underspecify
84their location to match
85incoming connection requests from multiple networks.  This
86technique, termed
87.Dq wildcard addressing ,
88allows a single
89server to provide service to clients on multiple networks.
90To create a socket which listens on all networks, the Internet
91address
92.Dv INADDR_ANY
93must be bound.  The
94.Tn TCP
95port may still be specified
96at this time; if the port is not specified the system will assign one.
97Once a connection has been established the socket's address is
98fixed by the peer entity's location.   The address assigned the
99socket is the address associated with the network interface
100through which packets are being transmitted and received.  Normally
101this address corresponds to the peer entity's network.
102.Pp
103.Tn TCP
104supports a number of socket options which can be set with
105.Xr setsockopt 2
106and tested with
107.Xr getsockopt 2 :
108.Bl -tag -width TCP_NODELAYx
109.It Dv TCP_NODELAY
110Under most circumstances,
111.Tn TCP
112sends data when it is presented;
113when outstanding data has not yet been acknowledged, it gathers
114small amounts of output to be sent in a single packet once
115an acknowledgement is received.
116For a small number of clients, such as window systems
117that send a stream of mouse events which receive no replies,
118this packetization may cause significant delays.
119The boolean option
120.Dv TCP_NODELAY
121defeats this algorithm.
122.It Dv TCP_MAXSEG
123By default, a sender\- and receiver-TCP
124will negotiate among themselves to determine the maximum segment size
125to be used for each connection.  The
126.Dv TCP_MAXSEG
127option allows the user to determine the result of this negotiation,
128and to reduce it if desired.
129.It Dv TCP_NOOPT
130.Tn TCP
131usually sends a number of options in each packet, corresponding to
132various
133.Tn TCP
134extensions which are provided in this implementation.  The boolean
135option
136.Dv TCP_NOOPT
137is provided to disable
138.Tn TCP
139option use on a per-connection basis.
140.It Dv TCP_NOPUSH
141By convention, the sender-TCP
142will set the
143.Dq push
144bit and begin transmission immediately (if permitted) at the end of
145every user call to
146.Xr write 2
147or
148.Xr writev 2 .
149When the
150.Dv TCP_NOPUSH
151option is set to a non-zero value,
152.Tn TCP
153will delay sending any data at all until either the socket is closed,
154or the internal send buffer is filled.
155.It Dv TCP_SIGNATURE_ENABLE
156This option enables the use of MD5 digests (also known as TCP-MD5)
157on writes to the specified socket.
158In the current release, only outgoing traffic is digested;
159digests on incoming traffic are not verified.
160The current default behavior for the system is to respond to a system
161advertising this option with TCP-MD5; this may change.
162.Pp
163One common use for this in a DragonFlyBSD router deployment is to enable
164based routers to interwork with Cisco equipment at peering points.
165Support for this feature conforms to RFC 2385.
166Only IPv4 (AF_INET) sessions are supported.
167.Pp
168In order for this option to function correctly, it is necessary for the
169administrator to add a tcp-md5 key entry to the system's security
170associations database (SADB) using the
171.Xr setkey 8
172utility.
173This entry must have an SPI of 0x1000 and can therefore only be specified
174on a per-host basis at this time.
175.Pp
176If an SADB entry cannot be found for the destination, the outgoing traffic
177will have an invalid digest option prepended, and the following error message
178will be visible on the system console:
179.Em "tcpsignature_compute: SADB lookup failed for %d.%d.%d.%d" .
180.It Dv TCP_KEEPINIT
181If a
182.Tn TCP
183connection cannot be established within a period of time,
184.Tn TCP
185will time out the connection attempt.
186The
187.Dv TCP_KEEPINIT
188option specifies the number of milliseconds to wait
189before the connection attempt times out.
190The default value for
191.Dv TCP_KEEPINIT
192is tcp.keepinit milliseconds.
193For the accepted sockets, the
194.Dv TCP_KEEPINIT
195option value is inherited from the listening socket.
196.It Dv TCP_KEEPIDLE
197When the
198.Dv SO_KEEPALIVE
199option is enabled,
200.Tn TCP
201sends a keepalive probe to the remote system of a connection
202that has been idle for a period of time.
203The
204.Dv TCP_KEEPIDLE
205specifies the number of milliseconds before
206.Tn TCP
207will send the initial keepalive probe.
208The default value for
209.Dv TCP_KEEPIDLE
210is tcp.keepidle milliseconds.
211For the accepted sockets,
212the
213.Dv TCP_KEEPIDLE
214option value is inherited from the listening socket.
215.It Dv TCP_KEEPINTVL
216When the
217.Dv SO_KEEPALIVE
218option is enabled,
219.Tn TCP
220sends a keepalive probe to the remote system of a connection
221that has been idle for a period of time.
222The
223.Dv TCP_KEEPINTVL
224option specifies the number of milliseconds to wait
225before retransmitting a keepalive probe.
226The default value for
227.Dv TCP_KEEPINTVL
228is tcp.keepintvl milliseconds.
229For the accepted sockets,
230the
231.Dv TCP_KEEPINTVL
232option value is inherited from the listening socket.
233.It Dv TCP_KEEPCNT
234When the
235.Dv SO_KEEPALIVE
236option is enabled,
237.Tn TCP
238sends a keepalive probe to the remote system of a connection
239that has been idle for a period of time.
240The
241.Dv TCP_KEEPCNT
242option specifies the maximum number of keepalive
243probes to be sent before dropping the connection.
244The default value for
245.Dv TCP_KEEPCNT
246is tcp.keepcnt milliseconds.
247For the accepted sockets,
248the
249.Dv TCP_KEEPCNT
250option value is inherited from the listening socket.
251.El
252.Pp
253The option level for the
254.Xr setsockopt 2
255call is the protocol number for
256.Tn TCP ,
257available from
258.Xr getprotobyname 3 ,
259or
260.Dv IPPROTO_TCP .
261All options are declared in
262.In netinet/tcp.h .
263.Pp
264Options at the
265.Tn IP
266transport level may be used with
267.Tn TCP ;
268see
269.Xr ip 4 .
270Incoming connection requests that are source-routed are noted,
271and the reverse source route is used in responding.
272.Sh MIB VARIABLES
273The
274.Nm
275protocol implements a number of variables in the
276.Li net.inet
277branch of the
278.Xr sysctl 3
279MIB.
280.Bl -tag -width TCPCTL_DO_RFC1644
281.It Dv TCPCTL_DO_RFC1323
282.Pq tcp.rfc1323
283Implement the window scaling and timestamp options of RFC 1323
284(default true).
285.It Dv TCPCTL_MSSDFLT
286.Pq tcp.mssdflt
287The default value used for the maximum segment size
288.Pq Dq MSS
289when no advice to the contrary is received from MSS negotiation.
290.It Dv TCPCTL_SENDSPACE
291.Pq tcp.sendspace
292Maximum TCP send window.
293.It Dv TCPCTL_RECVSPACE
294.Pq tcp.recvspace
295Maximum TCP receive window.
296.It tcp.log_in_vain
297Log any connection attempts to ports where there is not a socket
298accepting connections.
299The value of 1 limits the logging to SYN (connection establishment)
300packets only.
301That of 2 results in any TCP packets to closed ports being logged.
302Any value unlisted above disables the logging
303(default is 0, i.e., the logging is disabled).
304.It tcp.msl
305The Maximum Segment Lifetime for a packet.
306.It tcp.keepinit
307Timeout for new, non-established TCP connections.
308.It tcp.keepidle
309Amount of time the connection should be idle before keepalive
310probes (if enabled) are sent.
311.It tcp.keepintvl
312The interval between keepalive probes sent to remote machines.
313After
314tcp.keepcnt
315(default 8) probes are sent, with no response, the connection is dropped.
316.It tcp.keepcnt
317The maximum number of keepalive probes to be sent
318before dropping the connection.
319.It tcp.always_keepalive
320Assume that
321.Dv SO_KEEPALIVE
322is set on all
323.Tn TCP
324connections, the kernel will
325periodically send a packet to the remote host to verify the connection
326is still up.
327.It tcp.icmp_may_rst
328Certain
329.Tn ICMP
330unreachable messages may abort connections in
331.Tn SYN-SENT
332state.
333.It tcp.do_tcpdrain
334Flush packets in the
335.Tn TCP
336reassembly queue if the system is low on mbufs.
337.It tcp.blackhole
338If enabled, disable sending of RST when a connection is attempted
339to a port where there is not a socket accepting connections.
340See
341.Xr blackhole 4 .
342.It tcp.delayed_ack
343Delay ACK to try and piggyback it onto a data packet.
344.It tcp.delacktime
345Maximum amount of time before a delayed ACK is sent.
346.It tcp.newreno
347Enable TCP NewReno Fast Recovery algorithm,
348as described in RFC 2582.
349.It tcp.path_mtu_discovery
350Enables Path MTU Discovery.  PMTU Discovery is helpful for avoiding
351IP fragmentation when tranferring lots of data to the same client.
352For web servers, where most of the connections are short and to
353different clients, PMTU Discovery actually hurts performance due
354to unnecessary retransmissions.  Turn this on only if most of your
355TCP connections are long transfers or are repeatedly to the same
356set of clients.
357.It tcp.tcbhashsize
358Size of the
359.Tn TCP
360control-block hashtable
361(read-only).
362This may be tuned using the kernel option
363.Dv TCBHASHSIZE
364or by setting
365.Va net.inet.tcp.tcbhashsize
366in the
367.Xr loader 8 .
368.It tcp.pcbcount
369Number of active process control blocks
370(read-only).
371.It tcp.syncookies
372Determines whether or not syn cookies should be generated for
373outbound syn-ack packets.  Syn cookies are a great help during
374syn flood attacks, and are enabled by default.
375.It tcp.isn_reseed_interval
376The interval (in seconds) specifying how often the secret data used in
377RFC 1948 initial sequence number calculations should be reseeded.
378By default, this variable is set to zero, indicating that
379no reseeding will occur.
380Reseeding should not be necessary, and will break
381.Dv TIME_WAIT
382recycling for a few minutes.
383.It tcp.inet.tcp.rexmit_{min,slop}
384Adjust the retransmit timer calculation for TCP.  The slop is
385typically added to the raw calculation to take into account
386occasional variances that the SRTT (smoothed round trip time)
387is unable to accommodate, while the minimum specifies an
388absolute minimum.  While a number of TCP RFCs suggest a 1
389second minimum these RFCs tend to focus on streaming behavior
390and fail to deal with the fact that a 1 second minimum has severe
391detrimental effects over lossy interactive connections, such
392as a 802.11b wireless link, and over very fast but lossy
393connections for those cases not covered by the fast retransmit
394code.  For this reason we suggest changing the slop to 200ms and
395setting the minimum to something out of the way, like 20ms,
396which gives you an effective minimum of 200ms (similar to Linux).
397.It tcp.inflight_enable
398Enable
399.Tn TCP
400bandwidth delay product limiting.  An attempt will be made to calculate
401the bandwidth delay product for each individual TCP connection and limit
402the amount of inflight data being transmitted to avoid building up
403unnecessary packets in the network.  This option is recommended if you
404are serving a lot of data over connections with high bandwidth-delay
405products, such as modems, GigE links, and fast long-haul WANs, and/or
406you have configured your machine to accommodate large TCP windows.  In such
407situations, without this option, you may experience high interactive
408latencies or packet loss due to the overloading of intermediate routers
409and switches.  Note that bandwidth delay product limiting only affects
410the transmit side of a TCP connection.
411.It tcp.inflight_debug
412Enable debugging for the bandwidth delay product algorithm.  This may
413default to on (1) so if you enable the algorithm you should probably also
414disable debugging by setting this variable to 0.
415.It tcp.inflight_min
416This puts an lower bound on the bandwidth delay product window, in bytes.
417A value of 1024 is typically used for debugging.  6000-16000 is more typical
418in a production installation.  Setting this value too low may result in
419slow ramp-up times for bursty connections.  Setting this value too high
420effectively disables the algorithm.
421.It tcp.inflight_max
422This puts an upper bound on the bandwidth delay product window, in bytes.
423This value should not generally be modified but may be used to set a
424global per-connection limit on queued data, potentially allowing you to
425intentionally set a less than optimum limit to smooth data flow over a
426network while still being able to specify huge internal TCP buffers.
427.It tcp.inflight_stab
428This value stabilizes the bwnd (write window) calculation at high speeds
429by increasing the bandwidth calculation in 1/10% increments.  The default
430value of 50 represents a +5% increase.  In addition, bwnd is further increased
431by a fixed 2*maxseg bytes to stabilize the algorithm at low speeds.
432Changing the stab value is not recommended, but you may come across
433situations where tuning is beneficial.
434However, our recommendation for tuning is to stick with only adjusting
435tcp.inflight_min.
436Reducing tcp.inflight_stab too much can lead to upwards of a 20%
437underutilization of the link and prevent the algorithm from properly adapting
438to changing situations.  Increasing tcp.inflight_stab too much can lead to
439an excessive packet buffering situation.
440.El
441.Sh ERRORS
442A socket operation may fail with one of the following errors returned:
443.Bl -tag -width Er
444.It Bq Er EISCONN
445when trying to establish a connection on a socket which
446already has one;
447.It Bq Er ENOBUFS
448when the system runs out of memory for
449an internal data structure;
450.It Bq Er ETIMEDOUT
451when a connection was dropped
452due to excessive retransmissions;
453.It Bq Er ECONNRESET
454when the remote peer
455forces the connection to be closed;
456.It Bq Er ECONNREFUSED
457when the remote
458peer actively refuses connection establishment (usually because
459no process is listening to the port);
460.It Bq Er EADDRINUSE
461when an attempt
462is made to create a socket with a port which has already been
463allocated;
464.It Bq Er EADDRNOTAVAIL
465when an attempt is made to create a
466socket with a network address for which no network interface
467exists.
468.It Bq Er EAFNOSUPPORT
469when an attempt is made to bind or connect a socket to a multicast
470address.
471.El
472.Sh SEE ALSO
473.Xr getsockopt 2 ,
474.Xr socket 2 ,
475.Xr sysctl 3 ,
476.Xr blackhole 4 ,
477.Xr inet 4 ,
478.Xr intro 4 ,
479.Xr ip 4 ,
480.Xr setkey 8
481.Rs
482.%A V. Jacobson
483.%A R. Braden
484.%A D. Borman
485.%T "TCP Extensions for High Performance"
486.%O RFC 1323
487.Re
488.Rs
489.%A "A. Heffernan"
490.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
491.%O "RFC 2385"
492.Re
493.Sh HISTORY
494The
495.Nm
496protocol appeared in
497.Bx 4.2 .
498The RFC 1323 extensions for window scaling and timestamps were added
499in
500.Bx 4.4 .
501