1.\" Copyright (c) 1983, 1991, 1993 2.\" The Regents of the University of California. All rights reserved. 3.\" 4.\" Redistribution and use in source and binary forms, with or without 5.\" modification, are permitted provided that the following conditions 6.\" are met: 7.\" 1. Redistributions of source code must retain the above copyright 8.\" notice, this list of conditions and the following disclaimer. 9.\" 2. Redistributions in binary form must reproduce the above copyright 10.\" notice, this list of conditions and the following disclaimer in the 11.\" documentation and/or other materials provided with the distribution. 12.\" 3. Neither the name of the University nor the names of its contributors 13.\" may be used to endorse or promote products derived from this software 14.\" without specific prior written permission. 15.\" 16.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND 17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 19.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 26.\" SUCH DAMAGE. 27.\" 28.\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 29.\" $FreeBSD: src/share/man/man4/tcp.4,v 1.11.2.14 2002/12/29 16:35:38 schweikh Exp $ 30.\" $DragonFly: src/share/man/man4/tcp.4,v 1.9 2008/10/17 11:30:24 swildner Exp $ 31.\" 32.Dd February 14, 1995 33.Dt TCP 4 34.Os 35.Sh NAME 36.Nm tcp 37.Nd Internet Transmission Control Protocol 38.Sh SYNOPSIS 39.In sys/types.h 40.In sys/socket.h 41.In netinet/in.h 42.Ft int 43.Fn socket AF_INET SOCK_STREAM 0 44.Sh DESCRIPTION 45The 46.Tn TCP 47protocol provides reliable, flow-controlled, two-way 48transmission of data. It is a byte-stream protocol used to 49support the 50.Dv SOCK_STREAM 51abstraction. TCP uses the standard 52Internet address format and, in addition, provides a per-host 53collection of 54.Dq port addresses . 55Thus, each address is composed 56of an Internet address specifying the host and network, with 57a specific 58.Tn TCP 59port on the host identifying the peer entity. 60.Pp 61Sockets utilizing the tcp protocol are either 62.Dq active 63or 64.Dq passive . 65Active sockets initiate connections to passive 66sockets. By default 67.Tn TCP 68sockets are created active; to create a 69passive socket the 70.Xr listen 2 71system call must be used 72after binding the socket with the 73.Xr bind 2 74system call. Only 75passive sockets may use the 76.Xr accept 2 77call to accept incoming connections. Only active sockets may 78use the 79.Xr connect 2 80call to initiate connections. 81.Pp 82Passive sockets may 83.Dq underspecify 84their location to match 85incoming connection requests from multiple networks. This 86technique, termed 87.Dq wildcard addressing , 88allows a single 89server to provide service to clients on multiple networks. 90To create a socket which listens on all networks, the Internet 91address 92.Dv INADDR_ANY 93must be bound. The 94.Tn TCP 95port may still be specified 96at this time; if the port is not specified the system will assign one. 97Once a connection has been established the socket's address is 98fixed by the peer entity's location. The address assigned the 99socket is the address associated with the network interface 100through which packets are being transmitted and received. Normally 101this address corresponds to the peer entity's network. 102.Pp 103.Tn TCP 104supports a number of socket options which can be set with 105.Xr setsockopt 2 106and tested with 107.Xr getsockopt 2 : 108.Bl -tag -width TCP_NODELAYx 109.It Dv TCP_NODELAY 110Under most circumstances, 111.Tn TCP 112sends data when it is presented; 113when outstanding data has not yet been acknowledged, it gathers 114small amounts of output to be sent in a single packet once 115an acknowledgement is received. 116For a small number of clients, such as window systems 117that send a stream of mouse events which receive no replies, 118this packetization may cause significant delays. 119The boolean option 120.Dv TCP_NODELAY 121defeats this algorithm. 122.It Dv TCP_MAXSEG 123By default, a sender\- and receiver-TCP 124will negotiate among themselves to determine the maximum segment size 125to be used for each connection. The 126.Dv TCP_MAXSEG 127option allows the user to determine the result of this negotiation, 128and to reduce it if desired. 129.It Dv TCP_NOOPT 130.Tn TCP 131usually sends a number of options in each packet, corresponding to 132various 133.Tn TCP 134extensions which are provided in this implementation. The boolean 135option 136.Dv TCP_NOOPT 137is provided to disable 138.Tn TCP 139option use on a per-connection basis. 140.It Dv TCP_NOPUSH 141By convention, the sender-TCP 142will set the 143.Dq push 144bit and begin transmission immediately (if permitted) at the end of 145every user call to 146.Xr write 2 147or 148.Xr writev 2 . 149When the 150.Dv TCP_NOPUSH 151option is set to a non-zero value, 152.Tn TCP 153will delay sending any data at all until either the socket is closed, 154or the internal send buffer is filled. 155.It Dv TCP_SIGNATURE_ENABLE 156This option enables the use of MD5 digests (also known as TCP-MD5) 157on writes to the specified socket. 158In the current release, only outgoing traffic is digested; 159digests on incoming traffic are not verified. 160The current default behavior for the system is to respond to a system 161advertising this option with TCP-MD5; this may change. 162.Pp 163One common use for this in a DragonFlyBSD router deployment is to enable 164based routers to interwork with Cisco equipment at peering points. 165Support for this feature conforms to RFC 2385. 166Only IPv4 (AF_INET) sessions are supported. 167.Pp 168In order for this option to function correctly, it is necessary for the 169administrator to add a tcp-md5 key entry to the system's security 170associations database (SADB) using the 171.Xr setkey 8 172utility. 173This entry must have an SPI of 0x1000 and can therefore only be specified 174on a per-host basis at this time. 175.Pp 176If an SADB entry cannot be found for the destination, the outgoing traffic 177will have an invalid digest option prepended, and the following error message 178will be visible on the system console: 179.Em "tcpsignature_compute: SADB lookup failed for %d.%d.%d.%d" . 180.It Dv TCP_KEEPINIT 181If a 182.Tn TCP 183connection cannot be established within a period of time, 184.Tn TCP 185will time out the connection attempt. 186The 187.Dv TCP_KEEPINIT 188option specifies the number of milliseconds to wait 189before the connection attempt times out. 190The default value for 191.Dv TCP_KEEPINIT 192is tcp.keepinit milliseconds. 193For the accepted sockets, the 194.Dv TCP_KEEPINIT 195option value is inherited from the listening socket. 196.It Dv TCP_KEEPIDLE 197When the 198.Dv SO_KEEPALIVE 199option is enabled, 200.Tn TCP 201sends a keepalive probe to the remote system of a connection 202that has been idle for a period of time. 203The 204.Dv TCP_KEEPIDLE 205specifies the number of milliseconds before 206.Tn TCP 207will send the initial keepalive probe. 208The default value for 209.Dv TCP_KEEPIDLE 210is tcp.keepidle milliseconds. 211For the accepted sockets, 212the 213.Dv TCP_KEEPIDLE 214option value is inherited from the listening socket. 215.It Dv TCP_KEEPINTVL 216When the 217.Dv SO_KEEPALIVE 218option is enabled, 219.Tn TCP 220sends a keepalive probe to the remote system of a connection 221that has been idle for a period of time. 222The 223.Dv TCP_KEEPINTVL 224option specifies the number of milliseconds to wait 225before retransmitting a keepalive probe. 226The default value for 227.Dv TCP_KEEPINTVL 228is tcp.keepintvl milliseconds. 229For the accepted sockets, 230the 231.Dv TCP_KEEPINTVL 232option value is inherited from the listening socket. 233.It Dv TCP_KEEPCNT 234When the 235.Dv SO_KEEPALIVE 236option is enabled, 237.Tn TCP 238sends a keepalive probe to the remote system of a connection 239that has been idle for a period of time. 240The 241.Dv TCP_KEEPCNT 242option specifies the maximum number of keepalive 243probes to be sent before dropping the connection. 244The default value for 245.Dv TCP_KEEPCNT 246is tcp.keepcnt milliseconds. 247For the accepted sockets, 248the 249.Dv TCP_KEEPCNT 250option value is inherited from the listening socket. 251.El 252.Pp 253The option level for the 254.Xr setsockopt 2 255call is the protocol number for 256.Tn TCP , 257available from 258.Xr getprotobyname 3 , 259or 260.Dv IPPROTO_TCP . 261All options are declared in 262.In netinet/tcp.h . 263.Pp 264Options at the 265.Tn IP 266transport level may be used with 267.Tn TCP ; 268see 269.Xr ip 4 . 270Incoming connection requests that are source-routed are noted, 271and the reverse source route is used in responding. 272.Sh MIB VARIABLES 273The 274.Nm 275protocol implements a number of variables in the 276.Li net.inet 277branch of the 278.Xr sysctl 3 279MIB. 280.Bl -tag -width TCPCTL_DO_RFC1644 281.It Dv TCPCTL_DO_RFC1323 282.Pq tcp.rfc1323 283Implement the window scaling and timestamp options of RFC 1323 284(default true). 285.It Dv TCPCTL_MSSDFLT 286.Pq tcp.mssdflt 287The default value used for the maximum segment size 288.Pq Dq MSS 289when no advice to the contrary is received from MSS negotiation. 290.It Dv TCPCTL_SENDSPACE 291.Pq tcp.sendspace 292Maximum TCP send window. 293.It Dv TCPCTL_RECVSPACE 294.Pq tcp.recvspace 295Maximum TCP receive window. 296.It tcp.log_in_vain 297Log any connection attempts to ports where there is not a socket 298accepting connections. 299The value of 1 limits the logging to SYN (connection establishment) 300packets only. 301That of 2 results in any TCP packets to closed ports being logged. 302Any value unlisted above disables the logging 303(default is 0, i.e., the logging is disabled). 304.It tcp.msl 305The Maximum Segment Lifetime for a packet. 306.It tcp.keepinit 307Timeout for new, non-established TCP connections. 308.It tcp.keepidle 309Amount of time the connection should be idle before keepalive 310probes (if enabled) are sent. 311.It tcp.keepintvl 312The interval between keepalive probes sent to remote machines. 313After 314tcp.keepcnt 315(default 8) probes are sent, with no response, the connection is dropped. 316.It tcp.keepcnt 317The maximum number of keepalive probes to be sent 318before dropping the connection. 319.It tcp.always_keepalive 320Assume that 321.Dv SO_KEEPALIVE 322is set on all 323.Tn TCP 324connections, the kernel will 325periodically send a packet to the remote host to verify the connection 326is still up. 327.It tcp.icmp_may_rst 328Certain 329.Tn ICMP 330unreachable messages may abort connections in 331.Tn SYN-SENT 332state. 333.It tcp.do_tcpdrain 334Flush packets in the 335.Tn TCP 336reassembly queue if the system is low on mbufs. 337.It tcp.blackhole 338If enabled, disable sending of RST when a connection is attempted 339to a port where there is not a socket accepting connections. 340See 341.Xr blackhole 4 . 342.It tcp.delayed_ack 343Delay ACK to try and piggyback it onto a data packet. 344.It tcp.delacktime 345Maximum amount of time before a delayed ACK is sent. 346.It tcp.newreno 347Enable TCP NewReno Fast Recovery algorithm, 348as described in RFC 2582. 349.It tcp.path_mtu_discovery 350Enables Path MTU Discovery. PMTU Discovery is helpful for avoiding 351IP fragmentation when tranferring lots of data to the same client. 352For web servers, where most of the connections are short and to 353different clients, PMTU Discovery actually hurts performance due 354to unnecessary retransmissions. Turn this on only if most of your 355TCP connections are long transfers or are repeatedly to the same 356set of clients. 357.It tcp.tcbhashsize 358Size of the 359.Tn TCP 360control-block hashtable 361(read-only). 362This may be tuned using the kernel option 363.Dv TCBHASHSIZE 364or by setting 365.Va net.inet.tcp.tcbhashsize 366in the 367.Xr loader 8 . 368.It tcp.pcbcount 369Number of active process control blocks 370(read-only). 371.It tcp.syncookies 372Determines whether or not syn cookies should be generated for 373outbound syn-ack packets. Syn cookies are a great help during 374syn flood attacks, and are enabled by default. 375.It tcp.isn_reseed_interval 376The interval (in seconds) specifying how often the secret data used in 377RFC 1948 initial sequence number calculations should be reseeded. 378By default, this variable is set to zero, indicating that 379no reseeding will occur. 380Reseeding should not be necessary, and will break 381.Dv TIME_WAIT 382recycling for a few minutes. 383.It tcp.inet.tcp.rexmit_{min,slop} 384Adjust the retransmit timer calculation for TCP. The slop is 385typically added to the raw calculation to take into account 386occasional variances that the SRTT (smoothed round trip time) 387is unable to accommodate, while the minimum specifies an 388absolute minimum. While a number of TCP RFCs suggest a 1 389second minimum these RFCs tend to focus on streaming behavior 390and fail to deal with the fact that a 1 second minimum has severe 391detrimental effects over lossy interactive connections, such 392as a 802.11b wireless link, and over very fast but lossy 393connections for those cases not covered by the fast retransmit 394code. For this reason we suggest changing the slop to 200ms and 395setting the minimum to something out of the way, like 20ms, 396which gives you an effective minimum of 200ms (similar to Linux). 397.It tcp.inflight_enable 398Enable 399.Tn TCP 400bandwidth delay product limiting. An attempt will be made to calculate 401the bandwidth delay product for each individual TCP connection and limit 402the amount of inflight data being transmitted to avoid building up 403unnecessary packets in the network. This option is recommended if you 404are serving a lot of data over connections with high bandwidth-delay 405products, such as modems, GigE links, and fast long-haul WANs, and/or 406you have configured your machine to accommodate large TCP windows. In such 407situations, without this option, you may experience high interactive 408latencies or packet loss due to the overloading of intermediate routers 409and switches. Note that bandwidth delay product limiting only affects 410the transmit side of a TCP connection. 411.It tcp.inflight_debug 412Enable debugging for the bandwidth delay product algorithm. This may 413default to on (1) so if you enable the algorithm you should probably also 414disable debugging by setting this variable to 0. 415.It tcp.inflight_min 416This puts an lower bound on the bandwidth delay product window, in bytes. 417A value of 1024 is typically used for debugging. 6000-16000 is more typical 418in a production installation. Setting this value too low may result in 419slow ramp-up times for bursty connections. Setting this value too high 420effectively disables the algorithm. 421.It tcp.inflight_max 422This puts an upper bound on the bandwidth delay product window, in bytes. 423This value should not generally be modified but may be used to set a 424global per-connection limit on queued data, potentially allowing you to 425intentionally set a less than optimum limit to smooth data flow over a 426network while still being able to specify huge internal TCP buffers. 427.It tcp.inflight_stab 428This value stabilizes the bwnd (write window) calculation at high speeds 429by increasing the bandwidth calculation in 1/10% increments. The default 430value of 50 represents a +5% increase. In addition, bwnd is further increased 431by a fixed 2*maxseg bytes to stabilize the algorithm at low speeds. 432Changing the stab value is not recommended, but you may come across 433situations where tuning is beneficial. 434However, our recommendation for tuning is to stick with only adjusting 435tcp.inflight_min. 436Reducing tcp.inflight_stab too much can lead to upwards of a 20% 437underutilization of the link and prevent the algorithm from properly adapting 438to changing situations. Increasing tcp.inflight_stab too much can lead to 439an excessive packet buffering situation. 440.El 441.Sh ERRORS 442A socket operation may fail with one of the following errors returned: 443.Bl -tag -width Er 444.It Bq Er EISCONN 445when trying to establish a connection on a socket which 446already has one; 447.It Bq Er ENOBUFS 448when the system runs out of memory for 449an internal data structure; 450.It Bq Er ETIMEDOUT 451when a connection was dropped 452due to excessive retransmissions; 453.It Bq Er ECONNRESET 454when the remote peer 455forces the connection to be closed; 456.It Bq Er ECONNREFUSED 457when the remote 458peer actively refuses connection establishment (usually because 459no process is listening to the port); 460.It Bq Er EADDRINUSE 461when an attempt 462is made to create a socket with a port which has already been 463allocated; 464.It Bq Er EADDRNOTAVAIL 465when an attempt is made to create a 466socket with a network address for which no network interface 467exists. 468.It Bq Er EAFNOSUPPORT 469when an attempt is made to bind or connect a socket to a multicast 470address. 471.El 472.Sh SEE ALSO 473.Xr getsockopt 2 , 474.Xr socket 2 , 475.Xr sysctl 3 , 476.Xr blackhole 4 , 477.Xr inet 4 , 478.Xr intro 4 , 479.Xr ip 4 , 480.Xr setkey 8 481.Rs 482.%A V. Jacobson 483.%A R. Braden 484.%A D. Borman 485.%T "TCP Extensions for High Performance" 486.%O RFC 1323 487.Re 488.Rs 489.%A "A. Heffernan" 490.%T "Protection of BGP Sessions via the TCP MD5 Signature Option" 491.%O "RFC 2385" 492.Re 493.Sh HISTORY 494The 495.Nm 496protocol appeared in 497.Bx 4.2 . 498The RFC 1323 extensions for window scaling and timestamps were added 499in 500.Bx 4.4 . 501