#
a46d0ea5 |
| 03-Jun-2024 |
Jason Xing <kernelxing@tencent.com> |
tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB
According to RFC 1213, we should also take CLOSE-WAIT sockets into consideration:
"tcpCurrEstab OBJECT-TYPE ... The number of TCP connect
tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB
According to RFC 1213, we should also take CLOSE-WAIT sockets into consideration:
"tcpCurrEstab OBJECT-TYPE ... The number of TCP connections for which the current state is either ESTABLISHED or CLOSE- WAIT."
After this, CurrEstab counter will display the total number of ESTABLISHED and CLOSE-WAIT sockets.
The logic of counting When we increment the counter? a) if we change the state to ESTABLISHED. b) if we change the state from SYN-RECEIVED to CLOSE-WAIT.
When we decrement the counter? a) if the socket leaves ESTABLISHED and will never go into CLOSE-WAIT, say, on the client side, changing from ESTABLISHED to FIN-WAIT-1. b) if the socket leaves CLOSE-WAIT, say, on the server side, changing from CLOSE-WAIT to LAST-ACK.
Please note: there are two chances that old state of socket can be changed to CLOSE-WAIT in tcp_fin(). One is SYN-RECV, the other is ESTABLISHED. So we have to take care of the former case.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a535d594 |
| 30-May-2024 |
Jakub Kicinski <kuba@kernel.org> |
net: tls: fix marking packets as decrypted
For TLS offload we mark packets with skb->decrypted to make sure they don't escape the host without getting encrypted first. The crypto state lives in the
net: tls: fix marking packets as decrypted
For TLS offload we mark packets with skb->decrypted to make sure they don't escape the host without getting encrypted first. The crypto state lives in the socket, so it may get detached by a call to skb_orphan(). As a safety check - the egress path drops all packets with skb->decrypted and no "crypto-safe" socket.
The skb marking was added to sendpage only (and not sendmsg), because tls_device injected data into the TCP stack using sendpage. This special case was missed when sendpage got folded into sendmsg.
Fixes: c5c37af6ecad ("tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240530232607.82686-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
c084ebd7 |
| 09-May-2024 |
Matthieu Baerts (NGI0) <matttbe@kernel.org> |
tcp: socket option to check for MPTCP fallback to TCP
A way for an application to know if an MPTCP connection fell back to TCP is to use getsockopt(MPTCP_INFO) and look for errors. The issue with th
tcp: socket option to check for MPTCP fallback to TCP
A way for an application to know if an MPTCP connection fell back to TCP is to use getsockopt(MPTCP_INFO) and look for errors. The issue with this technique is that the same errors -- EOPNOTSUPP (IPv4) and ENOPROTOOPT (IPv6) -- are returned if there was a fallback, *or* if the kernel doesn't support this socket option. The userspace then has to look at the kernel version to understand what the errors mean.
It is not clean, and it doesn't take into account older kernels where the socket option has been backported. A cleaner way would be to expose this info to the TCP socket level. In case of MPTCP socket where no fallback happened, the socket options for the TCP level will be handled in MPTCP code, in mptcp_getsockopt_sol_tcp(). If not, that will be in TCP code, in do_tcp_getsockopt(). So MPTCP simply has to set the value 1, while TCP has to set 0.
If the socket option is not supported, one of these two errors will be reported: - EOPNOTSUPP (95 - Operation not supported) for MPTCP sockets - ENOPROTOOPT (92 - Protocol not available) for TCP sockets, e.g. on the socket received after an 'accept()', when the client didn't request to use MPTCP: this socket will be a TCP one, even if the listen socket was an MPTCP one.
With this new option, the kernel can return a clear answer to both "Is this kernel new enough to tell me the fallback status?" and "If it is new enough, is it currently a TCP or MPTCP socket?" questions, while not breaking the previous method.
Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://lore.kernel.org/r/20240509-upstream-net-next-20240509-mptcp-tcp_is_mptcp-v1-1-f846df999202@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
94062790 |
| 01-May-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
TCP_SYN_RECV state is really special, it is only used by cross-syn connections, mostly used by fuzzers.
In the following crash [1], syzbo
tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
TCP_SYN_RECV state is really special, it is only used by cross-syn connections, mostly used by fuzzers.
In the following crash [1], syzbot managed to trigger a divide by zero in tcp_rcv_space_adjust()
A socket makes the following state transitions, without ever calling tcp_init_transfer(), meaning tcp_init_buffer_space() is also not called.
TCP_CLOSE connect() TCP_SYN_SENT TCP_SYN_RECV shutdown() -> tcp_shutdown(sk, SEND_SHUTDOWN) TCP_FIN_WAIT1
To fix this issue, change tcp_shutdown() to not perform a TCP_SYN_RECV -> TCP_FIN_WAIT1 transition, which makes no sense anyway.
When tcp_rcv_state_process() later changes socket state from TCP_SYN_RECV to TCP_ESTABLISH, then look at sk->sk_shutdown to finally enter TCP_FIN_WAIT1 state, and send a FIN packet from a sane socket state.
This means tcp_send_fin() can now be called from BH context, and must use GFP_ATOMIC allocations.
[1] divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 5084 Comm: syz-executor358 Not tainted 6.9.0-rc6-syzkaller-00022-g98369dccd2f8 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:tcp_rcv_space_adjust+0x2df/0x890 net/ipv4/tcp_input.c:767 Code: e3 04 4c 01 eb 48 8b 44 24 38 0f b6 04 10 84 c0 49 89 d5 0f 85 a5 03 00 00 41 8b 8e c8 09 00 00 89 e8 29 c8 48 0f af c3 31 d2 <48> f7 f1 48 8d 1c 43 49 8d 96 76 08 00 00 48 89 d0 48 c1 e8 03 48 RSP: 0018:ffffc900031ef3f0 EFLAGS: 00010246 RAX: 0c677a10441f8f42 RBX: 000000004fb95e7e RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000027d4b11f R08: ffffffff89e535a4 R09: 1ffffffff25e6ab7 R10: dffffc0000000000 R11: ffffffff8135e920 R12: ffff88802a9f8d30 R13: dffffc0000000000 R14: ffff88802a9f8d00 R15: 1ffff1100553f2da FS: 00005555775c0380(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1155bf2304 CR3: 000000002b9f2000 CR4: 0000000000350ef0 Call Trace: <TASK> tcp_recvmsg_locked+0x106d/0x25a0 net/ipv4/tcp.c:2513 tcp_recvmsg+0x25d/0x920 net/ipv4/tcp.c:2578 inet6_recvmsg+0x16a/0x730 net/ipv6/af_inet6.c:680 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x109/0x280 net/socket.c:1068 ____sys_recvmsg+0x1db/0x470 net/socket.c:2803 ___sys_recvmsg net/socket.c:2845 [inline] do_recvmmsg+0x474/0xae0 net/socket.c:2939 __sys_recvmmsg net/socket.c:3018 [inline] __do_sys_recvmmsg net/socket.c:3041 [inline] __se_sys_recvmmsg net/socket.c:3034 [inline] __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7faeb6363db9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffcc1997168 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007faeb6363db9 RDX: 0000000000000001 RSI: 0000000020000bc0 RDI: 0000000000000005 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000001c R10: 0000000000000122 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240501125448.896529-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
f3d93817 |
| 29-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
net: add <net/proto_memory.h>
Move some proto memory definitions out of <net/sock.h>
Very few files need them, and following patch will include <net/hotdata.h> from <net/proto_memory.h>
Signed-off
net: add <net/proto_memory.h>
Move some proto memory definitions out of <net/sock.h>
Very few files need them, and following patch will include <net/hotdata.h> from <net/proto_memory.h>
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
dda4d96a |
| 29-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: move tcp_out_of_memory() to net/ipv4/tcp.c
tcp_out_of_memory() has a single caller: tcp_check_oom().
Following patch will also make sk_memory_allocated() not anymore visible from <net/sock.h>
tcp: move tcp_out_of_memory() to net/ipv4/tcp.c
tcp_out_of_memory() has a single caller: tcp_check_oom().
Following patch will also make sk_memory_allocated() not anymore visible from <net/sock.h> and <net/tcp.h>
Add const qualifier to sock argument of tcp_out_of_memory() and tcp_check_oom().
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
a86a0661 |
| 29-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
net: move sysctl_max_skb_frags to net_hotdata
sysctl_max_skb_frags is used in TCP and MPTCP fast paths, move it to net_hodata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google
net: move sysctl_max_skb_frags to net_hotdata
sysctl_max_skb_frags is used in TCP and MPTCP fast paths, move it to net_hodata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240429134025.1233626-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
5691276b |
| 25-Apr-2024 |
Jason Xing <kernelxing@tencent.com> |
rstreason: prepare for active reset
Like what we did to passive reset: only passing possible reset reason in each active reset path.
No functional changes.
Signed-off-by: Jason Xing <kernelxing@te
rstreason: prepare for active reset
Like what we did to passive reset: only passing possible reset reason in each active reset path.
No functional changes.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
05ea4916 |
| 09-Apr-2024 |
Jon Maloy <jmaloy@redhat.com> |
tcp: add support for SO_PEEK_OFF socket option
When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when re
tcp: add support for SO_PEEK_OFF socket option
When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when reading files. Currently, it is not possible to do that.
In this commit, we add support for the SO_PEEK_OFF socket option for TCP, in a similar way it is done for Unix Domain sockets.
In the iperf3 log examples shown below, we can observe a throughput improvement of 15-20 % in the direction host->namespace when using the protocol splicer 'pasta' (https://passt.top). This is a consistent result.
pasta(1) and passt(1) implement user-mode networking for network namespaces (containers) and virtual machines by means of a translation layer between Layer-2 network interface and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo).
Received, pending TCP data to the container/guest is kept in kernel buffers until acknowledged, so the tool routinely needs to fetch new data from socket, skipping data that was already sent.
At the moment this is implemented using a dummy buffer passed to recvmsg(). With this change, we don't need a dummy buffer and the related buffer copy (copy_to_user()) anymore.
passt and pasta are supported in KubeVirt and libvirt/qemu.
jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF not supported by kernel.
jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 44822 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated jmaloy@freyr:~/passt# logout [ perf record: Woken up 23 times to write data ] [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] jmaloy@freyr:~/passt$
jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF supported by kernel.
jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 52084 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated logout [ perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] jmaloy@freyr:~/passt$
The perf record confirms this result. Below, we can observe that the CPU spends significantly less time in the function ____sys_recvmsg() when we have offset support.
Without offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
With offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg
Suggested-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240409152805.913891-1-jmaloy@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
9b9fd458 |
| 09-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: tweak tcp_sock_write_txrx size assertion
I forgot 32bit arches might have 64bit alignment for u64 fields.
tcp_sock_write_txrx group does not contain pointers, but two u64 fields. It is possibl
tcp: tweak tcp_sock_write_txrx size assertion
I forgot 32bit arches might have 64bit alignment for u64 fields.
tcp_sock_write_txrx group does not contain pointers, but two u64 fields. It is possible that on 32bit kernel, a 32bit hole is before tp->tcp_clock_cache.
I will try to remember a group can be bigger on 32bit kernels in the future.
With help from Vladimir Oltean.
Fixes: d2c3a7eb1afa ("tcp: more struct tcp_sock adjustments") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404082207.HCEdQhUO-lkp@intel.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20240409140914.4105429-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
41eecbd7 |
| 07-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
TCP can transform a TIMEWAIT socket into a SYN_RECV one from a SYN packet, and the ISN of the SYNACK packet is normally generated using
tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field
TCP can transform a TIMEWAIT socket into a SYN_RECV one from a SYN packet, and the ISN of the SYNACK packet is normally generated using TIMEWAIT tw_snd_nxt :
tcp_timewait_state_process() ... u32 isn = tcptw->tw_snd_nxt + 65535 + 2; if (isn == 0) isn++; TCP_SKB_CB(skb)->tcp_tw_isn = isn; return TCP_TW_SYN;
This SYN packet also bypasses normal checks against listen queue being full or not.
tcp_conn_request() ... __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn; ... /* TW buckets are converted to open requests without * limitations, they conserve resources and peer is * evidently real one. */ if ((syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name); if (!want_cookie) goto drop; }
This was using TCP_SKB_CB(skb)->tcp_tw_isn field in skb.
Unfortunately this field has been accidentally cleared after the call to tcp_timewait_state_process() returning TCP_TW_SYN.
Using a field in TCP_SKB_CB(skb) for a temporary state is overkill.
Switch instead to a per-cpu variable.
As a bonus, we do not have to clear tcp_tw_isn in TCP receive fast path. It is temporarily set then cleared only in the TCP_TW_SYN dance.
Fixes: 4ad19de8774e ("net: tcp6: fix double call of tcp_v6_fill_cb()") Fixes: eeea10b83a13 ("tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
d2c3a7eb |
| 05-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: more struct tcp_sock adjustments
tp->recvmsg_inq is used from tcp recvmsg() thus should be in tcp_sock_read_rx group.
tp->tcp_clock_cache and tp->tcp_mstamp are written both in rx and tx paths
tcp: more struct tcp_sock adjustments
tp->recvmsg_inq is used from tcp recvmsg() thus should be in tcp_sock_read_rx group.
tp->tcp_clock_cache and tp->tcp_mstamp are written both in rx and tx paths, thus are better placed in tcp_sock_write_txrx group.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
f410cbea |
| 04-Apr-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: annotate data-races around tp->window_clamp
tp->window_clamp can be read locklessly, add READ_ONCE() and WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by
tcp: annotate data-races around tp->window_clamp
tp->window_clamp can be read locklessly, add READ_ONCE() and WRITE_ONCE() annotations.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240404114231.2195171-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
151c9c72 |
| 22-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: properly terminate timers for kernel sockets
We had various syzbot reports about tcp timers firing after the corresponding netns has been dismantled.
Fortunately Josef Bacik could trigger the
tcp: properly terminate timers for kernel sockets
We had various syzbot reports about tcp timers firing after the corresponding netns has been dismantled.
Fortunately Josef Bacik could trigger the issue more often, and could test a patch I wrote two years ago.
When TCP sockets are closed, we call inet_csk_clear_xmit_timers() to 'stop' the timers.
inet_csk_clear_xmit_timers() can be called from any context, including when socket lock is held. This is the reason it uses sk_stop_timer(), aka del_timer(). This means that ongoing timers might finish much later.
For user sockets, this is fine because each running timer holds a reference on the socket, and the user socket holds a reference on the netns.
For kernel sockets, we risk that the netns is freed before timer can complete, because kernel sockets do not hold reference on the netns.
This patch adds inet_csk_clear_xmit_timers_sync() function that using sk_stop_timer_sync() to make sure all timers are terminated before the kernel socket is released. Modules using kernel sockets close them in their netns exit() handler.
Also add sock_not_owned_by_me() helper to get LOCKDEP support : inet_csk_clear_xmit_timers_sync() must not be called while socket lock is held.
It is very possible we can revert in the future commit 3a58f13a881e ("net: rds: acquire refcount on TCP sockets") which attempted to solve the issue in rds only. (net/smc/af_smc.c and net/mptcp/subflow.c have similar code)
We probably can remove the check_net() tests from tcp_out_of_resources() and __tcp_close() in the future.
Reported-by: Josef Bacik <josef@toxicpanda.com> Closes: https://lore.kernel.org/netdev/20240314210740.GA2823176@perftesting/ Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Fixes: 8a68173691f0 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") Link: https://lore.kernel.org/bpf/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Josef Bacik <josef@toxicpanda.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/20240322135732.1535772-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
683a67da |
| 08-Mar-2024 |
Jason Xing <kernelxing@tencent.com> |
tcp: annotate a data-race around sysctl_tcp_wmem[0]
When reading wmem[0], it could be changed concurrently without READ_ONCE() protection. So add one annotation here.
Signed-off-by: Jason Xing <ker
tcp: annotate a data-race around sysctl_tcp_wmem[0]
When reading wmem[0], it could be changed concurrently without READ_ONCE() protection. So add one annotation here.
Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
716edc97 |
| 07-Mar-2024 |
Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> |
tcp: fix incorrect parameter validation in the do_tcp_getsockopt() function
The 'len' variable can't be negative when assigned the result of 'min_t' because all 'min_t' parameters are cast to unsign
tcp: fix incorrect parameter validation in the do_tcp_getsockopt() function
The 'len' variable can't be negative when assigned the result of 'min_t' because all 'min_t' parameters are cast to unsigned int, and then the minimum one is chosen.
To fix the logic, check 'len' as read from 'optlen', where the types of relevant variables are (signed) int.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
490a79fa |
| 06-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.co
net: introduce include/net/rps.h
Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
345a6e26 |
| 01-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: align tcp_sock_write_rx group
Stephen Rothwell and kernel test robot reported that some arches (parisc, hexagon) and/or compilers would not like blamed commit.
Lets make sure tcp_sock_write_rx
tcp: align tcp_sock_write_rx group
Stephen Rothwell and kernel test robot reported that some arches (parisc, hexagon) and/or compilers would not like blamed commit.
Lets make sure tcp_sock_write_rx group does not start with a hole.
While we are at it, correct tcp_sock_write_tx CACHELINE_ASSERT_GROUP_SIZE() since after the blamed commit, we went to 105 bytes.
Fixes: 99123622050f ("tcp: remove some holes in struct tcp_sock") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/netdev/20240301121108.5d39e4f9@canb.auug.org.au/ Closes: https://lore.kernel.org/oe-kbuild-all/202403011451.csPYOS3C-lkp@intel.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Simon Horman <horms@kernel.org> # build-tested Link: https://lore.kernel.org/r/20240301171945.2958176-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
119ff048 |
| 08-Feb-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: move tp->scaling_ratio to tcp_sock_read_txrx group
tp->scaling_ratio is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables")
tcp: move tp->scaling_ratio to tcp_sock_read_txrx group
tp->scaling_ratio is a read mostly field, used in rx and tx fast paths.
Fixes: d5fed5addb2b ("tcp: reorganize tcp_sock fast path variables") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Coco Li <lixiaoyan@google.com> Cc: Wei Wang <weiwan@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
577e4432 |
| 25-Jan-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: add sanity checks to rx zerocopy
TCP rx zerocopy intent is to map pages initially allocated from NIC drivers, not pages owned by a fs.
This patch adds to can_map_frag() these additional checks
tcp: add sanity checks to rx zerocopy
TCP rx zerocopy intent is to map pages initially allocated from NIC drivers, not pages owned by a fs.
This patch adds to can_map_frag() these additional checks:
- Page must not be a compound one. - page->mapping must be NULL.
This fixes the panic reported by ZhangPeng.
syzbot was able to loopback packets built with sendfile(), mapping pages owned by an ext4 file to TCP rx zerocopy.
r3 = socket$inet_tcp(0x2, 0x1, 0x0) mmap(&(0x7f0000ff9000/0x4000)=nil, 0x4000, 0x0, 0x12, r3, 0x0) r4 = socket$inet_tcp(0x2, 0x1, 0x0) bind$inet(r4, &(0x7f0000000000)={0x2, 0x4e24, @multicast1}, 0x10) connect$inet(r4, &(0x7f00000006c0)={0x2, 0x4e24, @empty}, 0x10) r5 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00', 0x181e42, 0x0) fallocate(r5, 0x0, 0x0, 0x85b8) sendfile(r4, r5, 0x0, 0x8ba0) getsockopt$inet_tcp_TCP_ZEROCOPY_RECEIVE(r4, 0x6, 0x23, &(0x7f00000001c0)={&(0x7f0000ffb000/0x3000)=nil, 0x3000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f0000000440)=0x40) r6 = openat$dir(0xffffffffffffff9c, &(0x7f00000000c0)='./file0\x00', 0x181e42, 0x0)
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Link: https://lore.kernel.org/netdev/5106a58e-04da-372a-b836-9d3d0bd2507b@huawei.com/T/ Reported-and-bisected-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: linux-mm@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
7267e8dc |
| 19-Jan-2024 |
Salvatore Dipietro <dipiets@amazon.com> |
tcp: Add memory barrier to tcp_push()
On CPUs with weak memory models, reads and updates performed by tcp_push to the sk variables can get reordered leaving the socket throttled when it should not.
tcp: Add memory barrier to tcp_push()
On CPUs with weak memory models, reads and updates performed by tcp_push to the sk variables can get reordered leaving the socket throttled when it should not. The tasklet running tcp_wfree() may also not observe the memory updates in time and will skip flushing any packets throttled by tcp_push(), delaying the sending. This can pathologically cause 40ms extra latency due to bad interactions with delayed acks.
Adding a memory barrier in tcp_push removes the bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86 since not affected.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
import java.io.IOException; import java.io.OutputStreamWriter; import java.io.PrintWriter; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse;
public class HelloWorldServlet extends HttpServlet { @Override protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType("text/html;charset=utf-8"); OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8"); String s = "a".repeat(3096); osw.write(s,0,s.length()); osw.flush(); } }
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+ values is observed while, with the patch, the extra latency disappears.
No patch and tcp_autocorking=1 ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello ... 50.000% 0.91ms 75.000% 1.13ms 90.000% 1.46ms 99.000% 1.74ms 99.900% 1.89ms 99.990% 41.95ms <<< 40+ ms extra latency 99.999% 48.32ms 100.000% 48.96ms
With patch and tcp_autocorking=1 ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello ... 50.000% 0.90ms 75.000% 1.13ms 90.000% 1.45ms 99.000% 1.72ms 99.900% 1.83ms 99.990% 2.11ms <<< no 40+ ms extra latency 99.999% 2.53ms 100.000% 2.62ms
Patch has been also tested on x86 (m7i.2xlarge instance) which it is not affected by this issue and the patch doesn't introduce any additional delay.
Fixes: 7aa5470c2c09 ("tcp: tsq: move tsq_flags close to sk_wmem_alloc") Signed-off-by: Salvatore Dipietro <dipiets@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
965c00e4 |
| 04-Dec-2023 |
Dmitry Safonov <dima@arista.com> |
net/tcp: Limit TCP_AO_REPAIR to non-listen sockets
Listen socket is not an established TCP connection, so setsockopt(TCP_AO_REPAIR) doesn't have any impact.
Restrict this uAPI for listen sockets.
net/tcp: Limit TCP_AO_REPAIR to non-listen sockets
Listen socket is not an established TCP connection, so setsockopt(TCP_AO_REPAIR) doesn't have any impact.
Restrict this uAPI for listen sockets.
Fixes: faadfaba5e01 ("net/tcp: Add TCP_AO_REPAIR") Signed-off-by: Dmitry Safonov <dima@arista.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
d5fed5ad |
| 04-Dec-2023 |
Coco Li <lixiaoyan@google.com> |
tcp: reorganize tcp_sock fast path variables
The variables are organized according in the following way:
- TX read-mostly hotpath cache lines - TXRX read-mostly hotpath cache lines - RX read-mostly
tcp: reorganize tcp_sock fast path variables
The variables are organized according in the following way:
- TX read-mostly hotpath cache lines - TXRX read-mostly hotpath cache lines - RX read-mostly hotpath cache lines - TX read-write hotpath cache line - TXRX read-write hotpath cache line - RX read-write hotpath cache line
Fastpath cachelines end after rcvq_space.
Cache line boundaries are enforced only between read-mostly and read-write. That is, if read-mostly tx cachelines bleed into read-mostly txrx cachelines, we do not care. We care about the boundaries between read and write cachelines because we want to prevent false sharing.
Fast path variables span cache lines before change: 12 Fast path variables span cache lines after change: 8
Suggested-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231204201232.520025-3-lixiaoyan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
58d3aade |
| 04-Dec-2023 |
Paolo Abeni <pabeni@redhat.com> |
tcp: fix mid stream window clamp.
After the blamed commit below, if the user-space application performs window clamping when tp->rcv_wnd is 0, the TCP socket will never be able to announce a non 0 r
tcp: fix mid stream window clamp.
After the blamed commit below, if the user-space application performs window clamping when tp->rcv_wnd is 0, the TCP socket will never be able to announce a non 0 receive window, even after completely emptying the receive buffer and re-setting the window clamp to higher values.
Refactor tcp_set_window_clamp() to address the issue: when the user decreases the current clamp value, set rcv_ssthresh according to the same logic used at buffer initialization, but ensuring reserved mem provisioning.
To avoid code duplication factor-out the relevant bits from tcp_adjust_rcv_ssthresh() in a new helper and reuse it in the above scenario.
When increasing the clamp value, give the rcv_ssthresh a chance to grow according to previously implemented heuristic.
Fixes: 3aa7857fe1d7 ("tcp: enable mid stream window clamp") Reported-by: David Gibson <david@gibson.dropbear.id.au> Reported-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/705dad54e6e6e9a010e571bf58e0b35a8ae70503.1701706073.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
9fd7874c |
| 04-Dec-2023 |
Jens Axboe <axboe@kernel.dk> |
iov_iter: replace import_single_range() with import_ubuf()
With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range
iov_iter: replace import_single_range() with import_ubuf()
With the removal of the 'iov' argument to import_single_range(), the two functions are now fully identical. Convert the import_single_range() callers to import_ubuf(), and remove the former fully.
Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20231204174827.1258875-3-axboe@kernel.dk Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|