tcp_output.c - OpenGrok history log for /dragonfly/sys/netinet/tcp

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# b272101a	30-Oct-2023	Aaron LI <aly@aaronly.me>	Various minor whitespace cleanups Accumulated along the way.
# 410f8572	22-Dec-2023	Aaron LI <aly@aaronly.me>	kernel: Replace the deprecated m_copy() with m_copym()
# 8a93af2a	08-Jul-2023	Matthew Dillon <dillon@apollo.backplane.com>	network - Remove host-order translations of ipv4 ip_off and ip_len * Do not translate ip_off and ip_len to host order and then back again in the network stack. The fields are now left in network network - Remove host-order translations of ipv4 ip_off and ip_len * Do not translate ip_off and ip_len to host order and then back again in the network stack. The fields are now left in network order. show more ...
Revision tags: v6.4.0, v6.4.0rc1, v6.5.0, v6.2.2, v6.2.1, v6.2.0, v6.3.0, v6.0.1, v6.0.0, v6.0.0rc1, v6.1.0, v5.8.3, v5.8.2, v5.8.1, v5.8.0, v5.9.0, v5.8.0rc1, v5.6.3
# c443c74f	22-Oct-2019	zrj <rimvydas.jasinskas@gmail.com>	<net/if_var.h>: Remove last explicit dependency on <sys/malloc.h>. These kernel sources pass M_NOWAIT flag to m_copym() and friends. Mark that it was for M_NOWAIT visibility.
# febebf83	20-Oct-2019	zrj <rimvydas.jasinskas@gmail.com>	kernel: Minor whitespace cleanup in few sources (part 2). Separated from next.
Revision tags: v5.6.2, v5.6.1, v5.6.0, v5.6.0rc1, v5.7.0, v5.4.3, v5.4.2, v5.4.1, v5.4.0, v5.5.0, v5.4.0rc1, v5.2.2, v5.2.1, v5.2.0, v5.3.0, v5.2.0rc
# bff82488	20-Mar-2018	Aaron LI <aly@aaronly.me>	<net/if.h>: Do not include <net/if_var.h> for _KERNEL * Clean up an ancient leftover: do not include <net/if_var.h> from <net/if.h> for kernel stuffs. * Adjust various files to include the necess <net/if.h>: Do not include <net/if_var.h> for _KERNEL * Clean up an ancient leftover: do not include <net/if_var.h> from <net/if.h> for kernel stuffs. * Adjust various files to include the necessary <net/if_var.h> header. NOTE: I have also tested removing the inclusion of <net/if.h> from <net/if_var.h>, therefore add <net/if.h> inclusion for those files that need it but only included <net/if_var.h>. For some files, the header inclusion orderings are also adjusted. show more ...
# 755d70b8	21-Apr-2018	Sascha Wildner <saw@online.de>	Remove IPsec and related code from the system. It was unmaintained ever since we inherited it from FreeBSD 4.8. In fact, we had two implementations from that time: IPSEC and FAST_IPSEC. FAST_IPSEC Remove IPsec and related code from the system. It was unmaintained ever since we inherited it from FreeBSD 4.8. In fact, we had two implementations from that time: IPSEC and FAST_IPSEC. FAST_IPSEC is the implementation to which FreeBSD has moved since, but it didn't even build in DragonFly. Fixes for dports have been committed to DeltaPorts. Requested-by: dillon Dports-testing-and-fixing: zrj show more ...
Revision tags: v5.0.2, v5.0.1, v5.0.0, v5.0.0rc2, v5.1.0, v5.0.0rc1, v4.8.1, v4.8.0, v4.6.2, v4.9.0, v4.8.0rc
# 76a9ffca	21-Dec-2016	Sepherosa Ziehau <sephe@dragonflybsd.org>	ip: Set mbuf hash for output IP packets. This paves the way to implement Flow-Queue-Codel.
Revision tags: v4.6.1, v4.6.0, v4.6.0rc2, v4.6.0rc, v4.7.0
# 1bdd592f	30-May-2016	Sepherosa Ziehau <sephe@dragonflybsd.org>	tcp: Don't prematurely drop receiving-only connections. If the connection was persistent and receiving-only, several (12) sporadic device insufficient buffers would cause the connection be dropped p tcp: Don't prematurely drop receiving-only connections. If the connection was persistent and receiving-only, several (12) sporadic device insufficient buffers would cause the connection be dropped prematurely: Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is started. No one will stop this retransmission timer for receiving- only connection, so the retransmission timer promises to expire and t_rxtshift is promised to be increased. And t_rxtshift will not be reset to 0, since no RTT measurement will be done for receiving-only connection. If this receiving-only connection lived long enough, and it suffered 12 sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this receiving-only connection would be dropped prematurely by the retransmission timer. We now assert that for data segments, SYNs or FINs either rexmit or persist timer was wired upon ENOBUFS. And don't set rexmit timer for other cases, i.e. ENOBUFS upon ACKs. And we no longer penalize send window upon ENOBUFS. Obtained-from: FreeBSD r300981 show more ...
Revision tags: v4.4.3, v4.4.2, v4.4.1, v4.4.0, v4.5.0, v4.4.0rc, v4.2.4, v4.3.1
# 01a777f0	17-Jul-2015	Matthew Dillon <dillon@apollo.backplane.com>	kernel - MFC 160de052b2 from FreeBSD (persist timer) Avoid a situation where we do not set persist timer after a zero window condition. If you send a 0-length packet, but there is data is the socke kernel - MFC 160de052b2 from FreeBSD (persist timer) Avoid a situation where we do not set persist timer after a zero window condition. If you send a 0-length packet, but there is data is the socket buffer, and neither the rexmt or persist timer is already set, then activate the persist timer. Author: hiren <hiren@FreeBSD.org> Taken-from: FreeBSD show more ...
Revision tags: v4.2.3, v4.2.1, v4.2.0, v4.0.6, v4.3.0, v4.2.0rc, v4.0.5, v4.0.4
# b5523eac	19-Feb-2015	Sascha Wildner <saw@online.de>	kernel: Move us to using M_NOWAIT and M_WAITOK for mbuf functions. The main reason is that our having to use the MB_WAIT and MB_DONTWAIT flags was a recurring issue when porting drivers from FreeBSD kernel: Move us to using M_NOWAIT and M_WAITOK for mbuf functions. The main reason is that our having to use the MB_WAIT and MB_DONTWAIT flags was a recurring issue when porting drivers from FreeBSD because it tended to get forgotten and the code would compile anyway with the wrong constants. And since MB_WAIT and MB_DONTWAIT ended up as ocflags for an objcache_get() or objcache_reclaimlist call (which use M_WAITOK and M_NOWAIT), it was just one big converting back and forth with some sanitization in between. This commit allows M_* again for the mbuf functions and keeps the sanitizing as it was before: when M_WAITOK is among the passed flags, objcache functions will be called with M_WAITOK and when it is absent, they will be called with M_NOWAIT. All other flags are scrubbed by the MB_OCFLAG() macro which does the same as the former MBTOM(). Approved-by: dillon show more ...
Revision tags: v4.0.3, v4.0.2
# b92efbf5	25-Dec-2014	Sepherosa Ziehau <sephe@dragonflybsd.org>	tcp: Enable path mtu discovery by default This also eases the adoption of the RFC6864.
# 727ccde8	18-Dec-2014	Sepherosa Ziehau <sephe@dragonflybsd.org>	inet/inet6: Remove the v4-mapped address support This greatly simplies the code (even the IPv4 code) and avoids all kinds of possible port theft. INPCB: - Nuke IN6P_IPV6_V6ONLY, which is always on inet/inet6: Remove the v4-mapped address support This greatly simplies the code (even the IPv4 code) and avoids all kinds of possible port theft. INPCB: - Nuke IN6P_IPV6_V6ONLY, which is always on after this commit. - Change inp_vflag into inp_af (AF_INET or AF_INET6), since the socket is either IPv6 or IPv4, but never both. Set inpcb.inp_af in in_pcballoc() instead of in every pru_attach methods. Add INP_ISIPV4() and INP_ISIPV6() macros to check inpcb family (socket family and inpcb.inp_af are same). - Nuke the convoluted code in in_pcbbind() and in6_pcbbind() which is used to allow wildcard binding to accepting IPv4 connections on IPv6 wildcard bound sockets. - Nuke the code in in_pcblookup_pkthash() to match IPv4 faddr with IPv6 wildcard bound socket. - Nuke in6_mapped_{peeraddr,sockaddr,savefaddr}(); use in6_{setpeeraddr, setsockaddr,savefaddr}() directly. - Nuke v4-mapped address convertion functions. - Don't allow binding to v4-mapped address in in6_pcbind(). - Don't allow connecting to v4-mapped address in in6_pcbconnect(). TCP: - Nuke the code in tcp_output() which takes care of the IP header TTL setting for v4-mapped IPv6 socket. - Don't allow binding to v4-mapped address (through in6_pcbbind()). - Don't allow connecting to v4-mapped address and nuke the related code (PRUC_NAMALLOC etc.). - Nuke the code (PRUC_FALLBACK etc.) to fallback to IPv4 connection if IPv6 connection fails, which is wrong. - Nuke the code for v4-mapped IPv6 socket in tcp6_soport(). UDP: - Nuke the code for v4-mapped IPv6 socket in udp_input() and udp_append(). - Don't allow binding to v4-mapped address (through in6_pcbbind()). - Don't allow connecting to v4-mapped address. - Don't allow sending datagrams to v4-mapped address and nuke the related code in udp6_output(). - Nuke the code for v4-mapped IPv6 socket in udp6_disconnect() RIP: - Don't allow sending packets to v4-mapped address. - Don't allow binding to v4-mapped address. - Don't allow connecting to v4-mapped address. Misc fixup: - Don't force rip pru_attach method to return 0. If in_pcballoc() fails, just return the error code. show more ...
Revision tags: v4.0.1, v4.0.0, v4.0.0rc3, v4.0.0rc2, v4.0.0rc, v4.1.0, v3.8.2
# b0c17823	16-Jul-2014	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Add feature to allow sendbuf_auto to decrease the buffer size * sysctl net.inet.tcp.sendbuf_auto (defaults to 1) is now able to decrease the tcp buffer size as well as increase it. * Inf kernel - Add feature to allow sendbuf_auto to decrease the buffer size * sysctl net.inet.tcp.sendbuf_auto (defaults to 1) is now able to decrease the tcp buffer size as well as increase it. * Inflight bwnd data is used to determine how much to decrease the buffer. Inflight is enabled by default. If you disable it with (net.inet.tcp.inflight_enable=0), sendbuf_auto will not be able to adjust buffer sizes down. * Set net.inet.tcp.sendbuf_min (default 32768) to set the floor for any downward adjustment. * Set net.inet.tcp.sendbuf_auto=2 to disable the decrease feature. show more ...
Revision tags: v3.8.1, v3.6.3, v3.8.0, v3.8.0rc2, v3.9.0, v3.8.0rc, v3.6.2, v3.6.1, v3.6.0, v3.7.1, v3.6.0rc, v3.7.0
# cec73927	05-Sep-2013	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Change time_second to time_uptime for all expiration calculations * Vet the entire kernel and change use cases for expiration calculations using time_second to use time_uptime instead. * kernel - Change time_second to time_uptime for all expiration calculations * Vet the entire kernel and change use cases for expiration calculations using time_second to use time_uptime instead. * Protects these expiration calculations from step changes in the wall time, particularly needed for route table entries. * Probably requires further variable type adjustments but the use of time_uptime instead if time_second is highly unlikely to ever overrun any demotions to int still present. show more ...
Revision tags: v3.4.3
# 4cc8caef	08-Jun-2013	Sepherosa Ziehau <sephe@dragonflybsd.org>	altq: Implement two level "rough" priority queue for plain sub-queue The "rough" part comes from two sources: - Hardware queue could be deep, normally 512 or more even for GigE - Round robin on the altq: Implement two level "rough" priority queue for plain sub-queue The "rough" part comes from two sources: - Hardware queue could be deep, normally 512 or more even for GigE - Round robin on the transmission queues is used by all of the multiple transmission queue capable hardwares supported by DragonFly as of this commit. These two sources affect the packet priority set by DragonFly. DragonFly's "rough" prority queue has only two level, i.e. high priority and normal priority, which should be enough. Each queue has its own header. The normal priority queue will be dequeue only when there is no packets in the high priority queue. During enqueue, if the sub-queue is full and the high priority queue length is less than half of the sub- queue length (both packet count and byte count), drop-head will be applied on the normal priority queue. M_PRIO mbuf flag is added to mark that the mbuf is destined for the high priority queue. Currently TCP uses it to prioritize SYN, SYN\|ACK, and pure ACK w/o FIN and RST. This behaviour could be turn off by net.inet.tcp.prio_synack, which is on by default. The performance improvement! The test environment: All three boxes are using Intel i7-2600 w/ HT enabled +-----+ \| \| +->- emx1 \| B \| TCP_MAERTS +-----+ \| \| \| \| \| \| +-----+ \| A \| bnx0 ---+ \| \| \| +-----+ +-----+ \| \| \| +-<- emx1 \| C \| TCP_STREAM/TCP_RR \| \| +-----+ A's kernel has this commit compiled. bnx0 has all four transmission queues enabled. For bnx0, the hardware's transmission queue round-robin is on TSO segment boundry. Some base line measurement: B<--A TCP_MAERTS (raw stats) (128 client): 984 Mbps (tcp_stream -H A -l 15 -i 128 -r) C-->A TCP_STREAM (128 client): 942 Mbps (tcp_stream -H A -l 15 -i 128) C-->A TCP_CC (768 client): 221199 conns/s (tcp_cc -H A -l 15 -i 768) To effectively measure the TCP_CC, the prefix route's MSL is changed to 10ms: route change 10.1.0.0/24 -msl 10 All stats gather in the following measurement are below the base line measurement (well, they should be). C-->A TCP_CC improvement, during test B<--A TCP_MAERTS is running: TCP_MAERTS(raw) TCP_CC TSO prio_synack=1 948 Mbps 15988 conns/s TSO prio_synack=0 965 Mbps 8867 conns/s non-TSO prio_synack=1 943 Mbps 18128 conns/s non-TSO prio_synack=0 959 Mbps 11371 conns/s * 80% TCP_CC performance improvement w/ TSO and 60% w/o TSO! C-->A TCP_STREAM improvement, during test B<--A TCP_MAERTS is running: TCP_MAERTS(raw) TCP_STREAM TSO prio_synack=1 969 Mbps 920 Mbps TSO prio_synack=0 969 Mbps 865 Mbps non-TSO prio_synack=1 969 Mbps 920 Mbps non-TSO prio_synack=0 969 Mbps 879 Mbps * 6% TCP_STREAM performance improvement w/ TSO and 4% w/o TSO. show more ...
# dc71b7ab	31-May-2013	Justin C. Sherrill <justin@shiningsilence.com>	Correct BSD License clause numbering from 1-2-4 to 1-2-3. Apparently everyone's doing it: http://svnweb.freebsd.org/base?view=revision&revision=251069 Submitted-by: "Eitan Adler" <lists at eitanadl Correct BSD License clause numbering from 1-2-4 to 1-2-3. Apparently everyone's doing it: http://svnweb.freebsd.org/base?view=revision&revision=251069 Submitted-by: "Eitan Adler" <lists at eitanadler.com> show more ...
Revision tags: v3.4.2
# 2702099d	06-May-2013	Justin C. Sherrill <justin@shiningsilence.com>	Remove advertising clause from all that isn't contrib or userland bin. By: Eitan Adler <lists@eitanadler.com>
# 5337421c	02-May-2013	Sepherosa Ziehau <sephe@dragonflybsd.org>	netisr: Inline netisr_cpuport() and netisr_curport() These two functions do nothing more than just return pointer to the element in the array. Per our header file naming convention, put these two f netisr: Inline netisr_cpuport() and netisr_curport() These two functions do nothing more than just return pointer to the element in the array. Per our header file naming convention, put these two functions in net/netisr2.h show more ...
# ec7f7fc8	28-Apr-2013	Sepherosa Ziehau <sephe@dragonflybsd.org>	netisr: Function renaming; no functional changes This cleans up code for keeping input packets' hash instead of masking the hash with ncpus2_mask. netisr_hashport(), which maps packet hash to netis netisr: Function renaming; no functional changes This cleans up code for keeping input packets' hash instead of masking the hash with ncpus2_mask. netisr_hashport(), which maps packet hash to netisr port, will be added soon. show more ...
Revision tags: v3.4.0, v3.4.1, v3.4.0rc, v3.5.0
# 6999cd81	26-Feb-2013	Matthew Dillon <dillon@apollo.backplane.com>	kernel - Beef up lwkt_dropmsg() API and fix deadlock in so_async_rcvd() Beef up the lwkt_dropmsg() API. The API now conditionally returns success (0) or an error (ENOENT). * so_pru_rcvd_async kernel - Beef up lwkt_dropmsg() API and fix deadlock in so_async_rcvd() Beef up the lwkt_dropmsg() API. The API now conditionally returns success (0) or an error (ENOENT). * so_pru_rcvd_async() improperly calls lwkt_sendmsg() with a spinlock held. This is not legal. Hack up lwkt_sendmsg() a bit to resolve. show more ...
# d3d26ea5	23-Jan-2013	Sepherosa Ziehau <sephe@dragonflybsd.org>	tcp: Add comment about "fairsend"
# 2fb3a851	17-Jan-2013	Sepherosa Ziehau <sephe@dragonflybsd.org>	tcp: Improve sender-sender and sender-receiver fairness on the same netisr Yield to other senders or receivers on the same netisr if the current TCP stream has sent certain amount of segments (curre tcp: Improve sender-sender and sender-receiver fairness on the same netisr Yield to other senders or receivers on the same netisr if the current TCP stream has sent certain amount of segments (currently 4) and is going to burst more segments. sysctl net.inet.tcp.fairsend could be used to tune how many segements are allowed to burst. For TSO capable devices, their TSO aggregate size limit could also affect the number of segments allowed to burst. Set net.inet.tcp.fairsend to 0 will allow single TCP stream to burst as much as it wants (the old TCP sender's behaviour). "Fairsend" is performed at the places that do not affect segment sending during congestion control: - User requested output path - ACK input path Measured improvement in the following setup: +---+ +---+ \| \|<-----------\| B \| \| \| +---+ \| A \| \| \| +---+ \| \|----------->\| C \| +---+ +---+ A (i7-2600, w/ HT enabled), 82571EB B (e3-1230, w/ HT enabled), 82574L C (e3-1230, w/ HT enabled), 82574L The performance stats are gathered from 'systat -if 1' When A runs 8 TCP senders to C and 8 TCP receivers from B, sending performance are same ~975Mbps, however, the receiving performance before this commit stumbles between 670Mbps and 850Mbps; w/ "fairsend" receiving performance stays at 981Mbps. When A runs 16 TCP senders to C and 16 TCP receivers from B, sending performance are same ~975Mbps, however, the receiving performance before this commit goes from 960Mbps to 980Mbps; w/ "fairsend" receiving performance stays at 981Mbps stably. When there are more senders and receivers running on A, there is no noticable performance difference on either sending or receiving between non-"fairsend" and "fairsend", because senders are no longer being able to do continuous large burst. "Fairsend" also improves Jain's fairness index between various amount of senders (8 ~ 128) a little bit (sending only tests). show more ...
# e41e61d5	16-Jan-2013	Sepherosa Ziehau <sephe@dragonflybsd.org>	tcp/tso: Add per-device TSO aggregation size limit - Prevent possible TSO large burst, when it is inappropriate (plenty of >24 segements bursts were observered, even when 32 parallel sending TCP tcp/tso: Add per-device TSO aggregation size limit - Prevent possible TSO large burst, when it is inappropriate (plenty of >24 segements bursts were observered, even when 32 parallel sending TCP streams are running on the same GigE NIC). TSO large burst has following drawbacks on a single TX queue, even on the devices that are multiple TX queues capable: o Delay other senders' packet transmission quite a lot. o Has negative effect on TCP receivers, which sends ACKs. o Cause buffer bloat in software sending queues, whose upper limit is based on "packet count". o Packet scheduler's decision could be less effective. On the other hand, TSO large burst could improve CPU usage. - Improve fairness between multiple TX queues on the devices that are multiple TX queues capable but only fetch data on TSO large packet boundary instead of TCP segment boundary. Drivers could supply their own TSO aggregation size limit. If driver does not set it, the default value is 6000 (4 segments if MTU is 1500). The default value increases CPU usage a little bit: on i7-2600 w/ HT enabled, single TCP sending stream, CPU usage increases from 14%~17% to 17%~20%. User could configure TSO aggregation size limit by using ifconfig(8): ifconfig ifaceX tsolen _n_ show more ...
# 4f483122	07-Jan-2013	Sascha Wildner <saw@online.de>	kernel/tcp_{input,output}: Remove some unused variables.
12 3 4