xref: /dragonfly/share/man/man4/tcp.4 (revision 52f9f0d9)
1.\" Copyright (c) 1983, 1991, 1993
2.\"	The Regents of the University of California.  All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\" 3. All advertising materials mentioning features or use of this software
13.\"    must display the following acknowledgement:
14.\"	This product includes software developed by the University of
15.\"	California, Berkeley and its contributors.
16.\" 4. Neither the name of the University nor the names of its contributors
17.\"    may be used to endorse or promote products derived from this software
18.\"    without specific prior written permission.
19.\"
20.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
30.\" SUCH DAMAGE.
31.\"
32.\"     From: @(#)tcp.4	8.1 (Berkeley) 6/5/93
33.\" $FreeBSD: src/share/man/man4/tcp.4,v 1.11.2.14 2002/12/29 16:35:38 schweikh Exp $
34.\" $DragonFly: src/share/man/man4/tcp.4,v 1.9 2008/10/17 11:30:24 swildner Exp $
35.\"
36.Dd February 14, 1995
37.Dt TCP 4
38.Os
39.Sh NAME
40.Nm tcp
41.Nd Internet Transmission Control Protocol
42.Sh SYNOPSIS
43.In sys/types.h
44.In sys/socket.h
45.In netinet/in.h
46.Ft int
47.Fn socket AF_INET SOCK_STREAM 0
48.Sh DESCRIPTION
49The
50.Tn TCP
51protocol provides reliable, flow-controlled, two-way
52transmission of data.  It is a byte-stream protocol used to
53support the
54.Dv SOCK_STREAM
55abstraction.  TCP uses the standard
56Internet address format and, in addition, provides a per-host
57collection of
58.Dq port addresses .
59Thus, each address is composed
60of an Internet address specifying the host and network, with
61a specific
62.Tn TCP
63port on the host identifying the peer entity.
64.Pp
65Sockets utilizing the tcp protocol are either
66.Dq active
67or
68.Dq passive .
69Active sockets initiate connections to passive
70sockets.  By default
71.Tn TCP
72sockets are created active; to create a
73passive socket the
74.Xr listen 2
75system call must be used
76after binding the socket with the
77.Xr bind 2
78system call.  Only
79passive sockets may use the
80.Xr accept 2
81call to accept incoming connections.  Only active sockets may
82use the
83.Xr connect 2
84call to initiate connections.
85.Pp
86Passive sockets may
87.Dq underspecify
88their location to match
89incoming connection requests from multiple networks.  This
90technique, termed
91.Dq wildcard addressing ,
92allows a single
93server to provide service to clients on multiple networks.
94To create a socket which listens on all networks, the Internet
95address
96.Dv INADDR_ANY
97must be bound.  The
98.Tn TCP
99port may still be specified
100at this time; if the port is not specified the system will assign one.
101Once a connection has been established the socket's address is
102fixed by the peer entity's location.   The address assigned the
103socket is the address associated with the network interface
104through which packets are being transmitted and received.  Normally
105this address corresponds to the peer entity's network.
106.Pp
107.Tn TCP
108supports a number of socket options which can be set with
109.Xr setsockopt 2
110and tested with
111.Xr getsockopt 2 :
112.Bl -tag -width TCP_NODELAYx
113.It Dv TCP_NODELAY
114Under most circumstances,
115.Tn TCP
116sends data when it is presented;
117when outstanding data has not yet been acknowledged, it gathers
118small amounts of output to be sent in a single packet once
119an acknowledgement is received.
120For a small number of clients, such as window systems
121that send a stream of mouse events which receive no replies,
122this packetization may cause significant delays.
123The boolean option
124.Dv TCP_NODELAY
125defeats this algorithm.
126.It Dv TCP_MAXSEG
127By default, a sender\- and receiver-TCP
128will negotiate among themselves to determine the maximum segment size
129to be used for each connection.  The
130.Dv TCP_MAXSEG
131option allows the user to determine the result of this negotiation,
132and to reduce it if desired.
133.It Dv TCP_NOOPT
134.Tn TCP
135usually sends a number of options in each packet, corresponding to
136various
137.Tn TCP
138extensions which are provided in this implementation.  The boolean
139option
140.Dv TCP_NOOPT
141is provided to disable
142.Tn TCP
143option use on a per-connection basis.
144.It Dv TCP_NOPUSH
145By convention, the sender-TCP
146will set the
147.Dq push
148bit and begin transmission immediately (if permitted) at the end of
149every user call to
150.Xr write 2
151or
152.Xr writev 2 .
153When the
154.Dv TCP_NOPUSH
155option is set to a non-zero value,
156.Tn TCP
157will delay sending any data at all until either the socket is closed,
158or the internal send buffer is filled.
159.It Dv TCP_SIGNATURE_ENABLE
160This option enables the use of MD5 digests (also known as TCP-MD5)
161on writes to the specified socket.
162In the current release, only outgoing traffic is digested;
163digests on incoming traffic are not verified.
164The current default behavior for the system is to respond to a system
165advertising this option with TCP-MD5; this may change.
166.Pp
167One common use for this in a DragonFlyBSD router deployment is to enable
168based routers to interwork with Cisco equipment at peering points.
169Support for this feature conforms to RFC 2385.
170Only IPv4 (AF_INET) sessions are supported.
171.Pp
172In order for this option to function correctly, it is necessary for the
173administrator to add a tcp-md5 key entry to the system's security
174associations database (SADB) using the
175.Xr setkey 8
176utility.
177This entry must have an SPI of 0x1000 and can therefore only be specified
178on a per-host basis at this time.
179.Pp
180If an SADB entry cannot be found for the destination, the outgoing traffic
181will have an invalid digest option prepended, and the following error message
182will be visible on the system console:
183.Em "tcpsignature_compute: SADB lookup failed for %d.%d.%d.%d" .
184.It Dv TCP_KEEPINIT
185If a
186.Tn TCP
187connection cannot be established within a period of time,
188.Tn TCP
189will time out the connection attempt.
190The
191.Dv TCP_KEEPINIT
192option specifies the number of milliseconds to wait
193before the connection attempt times out.
194The default value for
195.Dv TCP_KEEPINIT
196is tcp.keepinit milliseconds.
197For the accepted sockets, the
198.Dv TCP_KEEPINIT
199option value is inherited from the listening socket.
200.It Dv TCP_KEEPIDLE
201When the
202.Dv SO_KEEPALIVE
203option is enabled,
204.Tn TCP
205sends a keepalive probe to the remote system of a connection
206that has been idle for a period of time.
207The
208.Dv TCP_KEEPIDLE
209specifies the number of milliseconds before
210.Tn TCP
211will send the initial keepalive probe.
212The default value for
213.Dv TCP_KEEPIDLE
214is tcp.keepidle milliseconds.
215For the accepted sockets,
216the
217.Dv TCP_KEEPIDLE
218option value is inherited from the listening socket.
219.It Dv TCP_KEEPINTVL
220When the
221.Dv SO_KEEPALIVE
222option is enabled,
223.Tn TCP
224sends a keepalive probe to the remote system of a connection
225that has been idle for a period of time.
226The
227.Dv TCP_KEEPINTVL
228option specifies the number of milliseconds to wait
229before retransmitting a keepalive probe.
230The default value for
231.Dv TCP_KEEPINTVL
232is tcp.keepintvl milliseconds.
233For the accepted sockets,
234the
235.Dv TCP_KEEPINTVL
236option value is inherited from the listening socket.
237.It Dv TCP_KEEPCNT
238When the
239.Dv SO_KEEPALIVE
240option is enabled,
241.Tn TCP
242sends a keepalive probe to the remote system of a connection
243that has been idle for a period of time.
244The
245.Dv TCP_KEEPCNT
246option specifies the maximum number of keepalive
247probes to be sent before dropping the connection.
248The default value for
249.Dv TCP_KEEPCNT
250is tcp.keepcnt milliseconds.
251For the accepted sockets,
252the
253.Dv TCP_KEEPCNT
254option value is inherited from the listening socket.
255.El
256.Pp
257The option level for the
258.Xr setsockopt 2
259call is the protocol number for
260.Tn TCP ,
261available from
262.Xr getprotobyname 3 ,
263or
264.Dv IPPROTO_TCP .
265All options are declared in
266.In netinet/tcp.h .
267.Pp
268Options at the
269.Tn IP
270transport level may be used with
271.Tn TCP ;
272see
273.Xr ip 4 .
274Incoming connection requests that are source-routed are noted,
275and the reverse source route is used in responding.
276.Sh MIB VARIABLES
277The
278.Nm
279protocol implements a number of variables in the
280.Li net.inet
281branch of the
282.Xr sysctl 3
283MIB.
284.Bl -tag -width TCPCTL_DO_RFC1644
285.It Dv TCPCTL_DO_RFC1323
286.Pq tcp.rfc1323
287Implement the window scaling and timestamp options of RFC 1323
288(default true).
289.It Dv TCPCTL_MSSDFLT
290.Pq tcp.mssdflt
291The default value used for the maximum segment size
292.Pq Dq MSS
293when no advice to the contrary is received from MSS negotiation.
294.It Dv TCPCTL_SENDSPACE
295.Pq tcp.sendspace
296Maximum TCP send window.
297.It Dv TCPCTL_RECVSPACE
298.Pq tcp.recvspace
299Maximum TCP receive window.
300.It tcp.log_in_vain
301Log any connection attempts to ports where there is not a socket
302accepting connections.
303The value of 1 limits the logging to SYN (connection establishment)
304packets only.
305That of 2 results in any TCP packets to closed ports being logged.
306Any value unlisted above disables the logging
307(default is 0, i.e., the logging is disabled).
308.It tcp.msl
309The Maximum Segment Lifetime for a packet.
310.It tcp.keepinit
311Timeout for new, non-established TCP connections.
312.It tcp.keepidle
313Amount of time the connection should be idle before keepalive
314probes (if enabled) are sent.
315.It tcp.keepintvl
316The interval between keepalive probes sent to remote machines.
317After
318tcp.keepcnt
319(default 8) probes are sent, with no response, the connection is dropped.
320.It tcp.keepcnt
321The maximum number of keepalive probes to be sent
322before dropping the connection.
323.It tcp.always_keepalive
324Assume that
325.Dv SO_KEEPALIVE
326is set on all
327.Tn TCP
328connections, the kernel will
329periodically send a packet to the remote host to verify the connection
330is still up.
331.It tcp.icmp_may_rst
332Certain
333.Tn ICMP
334unreachable messages may abort connections in
335.Tn SYN-SENT
336state.
337.It tcp.do_tcpdrain
338Flush packets in the
339.Tn TCP
340reassembly queue if the system is low on mbufs.
341.It tcp.blackhole
342If enabled, disable sending of RST when a connection is attempted
343to a port where there is not a socket accepting connections.
344See
345.Xr blackhole 4 .
346.It tcp.delayed_ack
347Delay ACK to try and piggyback it onto a data packet.
348.It tcp.delacktime
349Maximum amount of time before a delayed ACK is sent.
350.It tcp.newreno
351Enable TCP NewReno Fast Recovery algorithm,
352as described in RFC 2582.
353.It tcp.path_mtu_discovery
354Enables Path MTU Discovery.  PMTU Discovery is helpful for avoiding
355IP fragmentation when tranferring lots of data to the same client.
356For web servers, where most of the connections are short and to
357different clients, PMTU Discovery actually hurts performance due
358to unnecessary retransmissions.  Turn this on only if most of your
359TCP connections are long transfers or are repeatedly to the same
360set of clients.
361.It tcp.tcbhashsize
362Size of the
363.Tn TCP
364control-block hashtable
365(read-only).
366This may be tuned using the kernel option
367.Dv TCBHASHSIZE
368or by setting
369.Va net.inet.tcp.tcbhashsize
370in the
371.Xr loader 8 .
372.It tcp.pcbcount
373Number of active process control blocks
374(read-only).
375.It tcp.syncookies
376Determines whether or not syn cookies should be generated for
377outbound syn-ack packets.  Syn cookies are a great help during
378syn flood attacks, and are enabled by default.
379.It tcp.isn_reseed_interval
380The interval (in seconds) specifying how often the secret data used in
381RFC 1948 initial sequence number calculations should be reseeded.
382By default, this variable is set to zero, indicating that
383no reseeding will occur.
384Reseeding should not be necessary, and will break
385.Dv TIME_WAIT
386recycling for a few minutes.
387.It tcp.inet.tcp.rexmit_{min,slop}
388Adjust the retransmit timer calculation for TCP.  The slop is
389typically added to the raw calculation to take into account
390occasional variances that the SRTT (smoothed round trip time)
391is unable to accommodate, while the minimum specifies an
392absolute minimum.  While a number of TCP RFCs suggest a 1
393second minimum these RFCs tend to focus on streaming behavior
394and fail to deal with the fact that a 1 second minimum has severe
395detrimental effects over lossy interactive connections, such
396as a 802.11b wireless link, and over very fast but lossy
397connections for those cases not covered by the fast retransmit
398code.  For this reason we suggest changing the slop to 200ms and
399setting the minimum to something out of the way, like 20ms,
400which gives you an effective minimum of 200ms (similar to Linux).
401.It tcp.inflight_enable
402Enable
403.Tn TCP
404bandwidth delay product limiting.  An attempt will be made to calculate
405the bandwidth delay product for each individual TCP connection and limit
406the amount of inflight data being transmitted to avoid building up
407unnecessary packets in the network.  This option is recommended if you
408are serving a lot of data over connections with high bandwidth-delay
409products, such as modems, GigE links, and fast long-haul WANs, and/or
410you have configured your machine to accommodate large TCP windows.  In such
411situations, without this option, you may experience high interactive
412latencies or packet loss due to the overloading of intermediate routers
413and switches.  Note that bandwidth delay product limiting only affects
414the transmit side of a TCP connection.
415.It tcp.inflight_debug
416Enable debugging for the bandwidth delay product algorithm.  This may
417default to on (1) so if you enable the algorithm you should probably also
418disable debugging by setting this variable to 0.
419.It tcp.inflight_min
420This puts an lower bound on the bandwidth delay product window, in bytes.
421A value of 1024 is typically used for debugging.  6000-16000 is more typical
422in a production installation.  Setting this value too low may result in
423slow ramp-up times for bursty connections.  Setting this value too high
424effectively disables the algorithm.
425.It tcp.inflight_max
426This puts an upper bound on the bandwidth delay product window, in bytes.
427This value should not generally be modified but may be used to set a
428global per-connection limit on queued data, potentially allowing you to
429intentionally set a less than optimum limit to smooth data flow over a
430network while still being able to specify huge internal TCP buffers.
431.It tcp.inflight_stab
432The bandwidth delay product algorithm requires a slightly larger window
433than it otherwise calculates for stability.  This parameter determines the
434extra window in maximal packets / 10.  The default value of 20 represents
4352 maximal packets.  Reducing this value is not recommended but you may
436come across a situation with very slow links where the ping time
437reduction of the default inflight code is not sufficient.  If this case
438occurs you should first try reducing tcp.inflight_min and, if that does not
439work, reduce both tcp.inflight_min and tcp.inflight_stab, trying values of
44015, 10, or 5 for the latter.  Never use a value less than 5.  Reducing
441tcp.inflight_stab can lead to upwards of a 20% underutilization of the link
442as well as reducing the algorithm's ability to adapt to changing
443situations and should only be done as a last resort.
444.El
445.Sh ERRORS
446A socket operation may fail with one of the following errors returned:
447.Bl -tag -width Er
448.It Bq Er EISCONN
449when trying to establish a connection on a socket which
450already has one;
451.It Bq Er ENOBUFS
452when the system runs out of memory for
453an internal data structure;
454.It Bq Er ETIMEDOUT
455when a connection was dropped
456due to excessive retransmissions;
457.It Bq Er ECONNRESET
458when the remote peer
459forces the connection to be closed;
460.It Bq Er ECONNREFUSED
461when the remote
462peer actively refuses connection establishment (usually because
463no process is listening to the port);
464.It Bq Er EADDRINUSE
465when an attempt
466is made to create a socket with a port which has already been
467allocated;
468.It Bq Er EADDRNOTAVAIL
469when an attempt is made to create a
470socket with a network address for which no network interface
471exists.
472.It Bq Er EAFNOSUPPORT
473when an attempt is made to bind or connect a socket to a multicast
474address.
475.El
476.Sh SEE ALSO
477.Xr getsockopt 2 ,
478.Xr socket 2 ,
479.Xr sysctl 3 ,
480.Xr blackhole 4 ,
481.Xr inet 4 ,
482.Xr intro 4 ,
483.Xr ip 4 ,
484.Xr setkey 8
485.Rs
486.%A V. Jacobson
487.%A R. Braden
488.%A D. Borman
489.%T "TCP Extensions for High Performance"
490.%O RFC 1323
491.Re
492.Rs
493.%A "A. Heffernan"
494.%T "Protection of BGP Sessions via the TCP MD5 Signature Option"
495.%O "RFC 2385"
496.Re
497.Sh HISTORY
498The
499.Nm
500protocol appeared in
501.Bx 4.2 .
502The RFC 1323 extensions for window scaling and timestamps were added
503in
504.Bx 4.4 .
505