Commit graph

69223 commits

Author SHA1 Message Date
Xin Long
9cc341286e dn_route: set rt neigh to blackhole_netdev instead of loopback_dev in ifdown
Like other places in ipv4/6 dst ifdown, change to use blackhole_netdev
instead of pernet loopback_dev in dn dst ifdown.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/0cdf10e5a4af509024f08644919121fb71645bc2.1652751029.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17 18:03:23 -07:00
Guangguan Wang
793a7df630 net/smc: rdma write inline if qp has sufficient inline space
Rdma write with inline flag when sending small packages,
whose length is shorter than the qp's max_inline_data, can
help reducing latency.

In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency.

Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
		msg_size:1:2K:*2 -t 30 -vu tcp_lat

The results shown below:
msgsize     before       after
1B          11.2 us      10.6 us (-0.6 us)
2B          11.2 us      10.7 us (-0.5 us)
4B          11.3 us      10.7 us (-0.6 us)
8B          11.2 us      10.6 us (-0.6 us)
16B         11.3 us      10.7 us (-0.6 us)
32B         11.3 us      10.6 us (-0.7 us)
64B         11.2 us      11.2 us (0 us)
128B        11.2 us      11.2 us (0 us)
256B        11.2 us      11.2 us (0 us)
512B        11.4 us      11.3 us (-0.1 us)
1KB         11.4 us      11.5 us (0.1 us)
2KB         11.5 us      11.5 us (0 us)

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17 17:34:12 -07:00
Guangguan Wang
b632eb0697 net/smc: send cdc msg inline if qp has sufficient inline space
As cdc msg's length is 44B, cdc msgs can be sent inline in
most rdma devices, which can help reducing sending latency.

In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency.

Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
		msg_size:1:2K:*2 -t 30 -vu tcp_lat

The results shown below:
msgsize     before       after
1B          11.9 us      11.2 us (-0.7 us)
2B          11.7 us      11.2 us (-0.5 us)
4B          11.7 us      11.3 us (-0.4 us)
8B          11.6 us      11.2 us (-0.4 us)
16B         11.7 us      11.3 us (-0.4 us)
32B         11.7 us      11.3 us (-0.4 us)
64B         11.7 us      11.2 us (-0.5 us)
128B        11.6 us      11.2 us (-0.4 us)
256B        11.8 us      11.2 us (-0.6 us)
512B        11.8 us      11.4 us (-0.4 us)
1KB         11.9 us      11.4 us (-0.5 us)
2KB         12.1 us      11.5 us (-0.6 us)

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-17 17:34:12 -07:00
Lu Wei
a968c799eb ax25: merge repeat codes in ax25_dev_device_down()
Merge repeat codes to reduce the duplication.

Signed-off-by: Lu Wei <luwei32@huawei.com>
Link: https://lore.kernel.org/r/20220516062804.254742-1-luwei32@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-05-17 12:24:16 +02:00
Jakub Kicinski
f7b88d9ae9 linux-can-next-for-5.19-20220516
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEBsvAIBsPu6mG7thcrX5LkNig010FAmKCr/0THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCtfkuQ2KDTXbVWCAClh9ywNEPiSxXsB2k9EdVoijz0hv6I
 y6YLFPfHxc8w0ewOnBxJOBeLhVjIE2K3n81NaEudheBBImWV8CGlFPHu5sI4S01b
 aFcAuLT6SOat52xzkh44WGGjW517+EZM1Of1xzPhKdnR0nLvva52SpJlYwgLrkCN
 O6r30cDuU558H4g2BNxavbbNFPzL7HhkBpHmgWQg4q2p7LvLuTE6sN2CvNEgUWYM
 nIyK6P/FIyy3l7M7E873WhIEGNGRnAXRBEPXivRmKTlw0wizcvPxlWKm4YxTA0WL
 30fw58Hy4x6SZD+7+sP0QhdjiYOofgMtZFZn/LaJYE+rfaYLsFUH3A/2
 =tUrE
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2022-05-16

the first 2 patches are by me and target the CAN raw protocol. The 1st
removes an unneeded assignment, the other one adds support for
SO_TXTIME/SCM_TXTIME.

Oliver Hartkopp contributes 2 patches for the ISOTP protocol. The 1st
adds support for transmission without flow control, the other let's
bind() return an error on incorrect CAN ID formatting.

Geert Uytterhoeven contributes a patch to clean up ctucanfd's Kconfig
file.

Vincent Mailhol's patch for the slcan driver uses the proper function
to check for invalid CAN frames in the xmit callback.

The next patch is by Geert Uytterhoeven and makes the interrupt-names
of the renesas,rcar-canfd dt bindings mandatory.

A patch by my update the ctucanfd dt bindings to include the common
CAN controller bindings.

The last patch is by Akira Yokosawa and fixes a breakage the
ctucanfd's documentation.

* tag 'linux-can-next-for-5.19-20220516' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  docs: ctucanfd: Use 'kernel-figure' directive instead of 'figure'
  dt-bindings: can: ctucanfd: include common CAN controller bindings
  dt-bindings: can: renesas,rcar-canfd: Make interrupt-names required
  can: slcan: slc_xmit(): use can_dropped_invalid_skb() instead of manual check
  can: ctucanfd: Let users select instead of depend on CAN_CTUCANFD
  can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting
  can: isotp: add support for transmission without flow control
  can: raw: add support for SO_TXTIME/SCM_TXTIME
  can: raw: raw_sendmsg(): remove not needed setting of skb->sk
====================

Link: https://lore.kernel.org/r/20220516202625.1129281-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16 17:06:21 -07:00
Florian Westphal
ea1e301d04 mptcp: sockopt: add TCP_DEFER_ACCEPT support
Support this via passthrough to the underlying tcp listener socket.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/271
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16 13:11:31 -07:00
Paolo Abeni
0ea5374255 Revert "mptcp: add data lock for sk timers"
This reverts commit 4293248c67.

Additional locks are not needed, all the touched sections
are already under mptcp socket lock protection.

Fixes: 4293248c67 ("mptcp: add data lock for sk timers")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-16 13:11:31 -07:00
Oliver Hartkopp
2aa39889c4 can: isotp: isotp_bind(): return -EINVAL on incorrect CAN ID formatting
Commit 3ea566422c ("can: isotp: sanitize CAN ID checks in
isotp_bind()") checks the given CAN ID address information by
sanitizing the input values.

This check (silently) removes obsolete bits by masking the given CAN
IDs.

Derek Will suggested to give a feedback to the application programmer
when the 'sanitizing' was actually needed which means the programmer
provided CAN ID content in a wrong format (e.g. SFF CAN IDs with a CAN
ID > 0x7FF).

Link: https://lore.kernel.org/all/20220515181633.76671-1-socketcan@hartkopp.net
Suggested-by: Derek Will <derekrobertwill@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16 22:03:45 +02:00
Oliver Hartkopp
9f39d36530 can: isotp: add support for transmission without flow control
Usually the ISO 15765-2 protocol is a point-to-point protocol to transfer
segmented PDUs to a dedicated receiver. This receiver sends a flow control
message to specify protocol options and timings (e.g. block size / STmin).

The so called functional addressing communication allows a 1:N
communication but is limited to a single frame length.

This new CAN_ISOTP_CF_BROADCAST allows an unconfirmed 1:N communication
with PDU length that would not fit into a single frame. This feature is
not covered by the ISO 15765-2 standard.

Link: https://lore.kernel.org/all/20220507115558.19065-1-socketcan@hartkopp.net
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16 22:03:45 +02:00
Marc Kleine-Budde
51a0d5e511 can: raw: add support for SO_TXTIME/SCM_TXTIME
This patch calls into sock_cmsg_send() to parse the user supplied
control information into a struct sockcm_cookie. Then assign the
requested transmit time to the skb.

This makes it possible to use the Earliest TXTIME First (ETF) packet
scheduler with the CAN_RAW protocol. The user can send a CAN_RAW frame
with a TXTIME and the kernel (with the ETF scheduler) will take care
of sending it to the network interface.

Link: https://lore.kernel.org/all/20220502091946.1916211-3-mkl@pengutronix.de
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16 22:03:45 +02:00
Marc Kleine-Budde
2af84932b3 can: raw: raw_sendmsg(): remove not needed setting of skb->sk
The skb in raw_sendmsg() is allocated with sock_alloc_send_skb(),
which subsequently calls sock_alloc_send_pskb() -> skb_set_owner_w(),
which assigns "skb->sk = sk".

This patch removes the not needed setting of skb->sk.

Link: https://lore.kernel.org/all/20220502091946.1916211-2-mkl@pengutronix.de
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2022-05-16 22:03:45 +02:00
Eric Dumazet
9098765002 net: call skb_defer_free_flush() before each napi_poll()
skb_defer_free_flush() can consume cpu cycles,
it seems better to call it in the inner loop:

- Potentially frees page/skb that will be reallocated while hot.

- Account for the cpu cycles in the @time_limit determination.

- Keep softnet_data.defer_count small to reduce chances for
  skb_attempt_defer_free() to send an IPI.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 11:33:59 +01:00
Eric Dumazet
39564c3fdc net: add skb_defer_max sysctl
commit 68822bdf76 ("net: generalize skb freeing
deferral to per-cpu lists") added another per-cpu
cache of skbs. It was expected to be small,
and an IPI was forced whenever the list reached 128
skbs.

We might need to be able to control more precisely
queue capacity and added latency.

An IPI is generated whenever queue reaches half capacity.

Default value of the new limit is 64.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 11:33:59 +01:00
Eric Dumazet
2db60eed1a net: use napi_consume_skb() in skb_defer_free_flush()
skb_defer_free_flush() runs from softirq context,
we have the opportunity to refill the napi_alloc_cache,
and/or use kmem_cache_free_bulk() when this cache is full.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 11:33:59 +01:00
Eric Dumazet
97e719a82b net: fix possible race in skb_attempt_defer_free()
A cpu can observe sd->defer_count reaching 128,
and call smp_call_function_single_async()

Problem is that the remote CPU can clear sd->defer_count
before the IPI is run/acknowledged.

Other cpus can queue more packets and also decide
to call smp_call_function_single_async() while the pending
IPI was not yet delivered.

This is a common issue with smp_call_function_single_async().
Callers must ensure correct synchronization and serialization.

I triggered this issue while experimenting smaller threshold.
Performing the call to smp_call_function_single_async()
under sd->defer_lock protection did not solve the problem.

Commit 5a18ceca63 ("smp: Allow smp_call_function_single_async()
to insert locked csd") replaced an informative WARN_ON_ONCE()
with a return of -EBUSY, which is often ignored.
Test of CSD_FLAG_LOCK presence is racy anyway.

Fixes: 68822bdf76 ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 11:33:59 +01:00
Menglong Dong
f8319dfd1b net: tcp: reset 'drop_reason' to NOT_SPCIFIED in tcp_v{4,6}_rcv()
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv()
and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the
return value of tcp_inbound_md5_hash().

And it can panic the kernel with NULL pointer in
net_dm_packet_report_size() if the reason is 0, as drop_reasons[0]
is NULL.

Fixes: 1330b6ef33 ("skb: make drop reason booleanable")
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:47:44 +01:00
Menglong Dong
20bbcd0a94 net: skb: check the boundrary of drop reason in kfree_skb_reason()
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after
we make it the return value of the functions with return type of enum
skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can
be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason().

So we check the range of drop reason in kfree_skb_reason() with
DEBUG_NET_WARN_ON_ONCE().

Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:47:43 +01:00
Menglong Dong
a3af33abd9 net: dm: check the boundary of skb drop reasons
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not
small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(),
but it can't avoid it to be 0, which is invalid and can cause NULL
pointer in drop_reasons.

Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'.

Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:47:43 +01:00
Guangguan Wang
3aba103006 net/smc: align the connect behaviour with TCP
Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:40:07 +01:00
Eric Dumazet
eda090c31f inet: rename INET_MATCH()
This is no longer a macro, but an inlined function.

INET_MATCH() -> inet_match()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
5d368f0328 ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()
INET6_MATCH() runs without holding a lock on the socket.

We probably need to annotate most reads.

This patch makes INET6_MATCH() an inline function
to ease our changes.

v2: inline function only defined if IS_ENABLED(CONFIG_IPV6)
    Change the name to inet6_match(), this is no longer a macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
ff0094030f l2tp: use add READ_ONCE() to fetch sk->sk_bound_dev_if
Use READ_ONCE() in paths not holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
70f87de9fa net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()
sk->sk_bound_dev_if can change under us, use READ_ONCE() annotation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
d2c135619c inet: add READ_ONCE(sk->sk_bound_dev_if) in inet_csk_bind_conflict()
inet_csk_bind_conflict() can access sk->sk_bound_dev_if for
unlocked sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
36f7cec4f3 dccp: use READ_ONCE() to read sk->sk_bound_dev_if
When reading listener sk->sk_bound_dev_if locklessly,
we must use READ_ONCE().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
e5fccaa1eb net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_if
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val),
because other cpus/threads might locklessly read this field.

sock_getbindtodevice(), sock_getsockopt() need READ_ONCE()
because they run without socket lock held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
a20ea29807 sctp: read sk->sk_bound_dev_if once in sctp_rcv()
sctp_rcv() reads sk->sk_bound_dev_if twice while the socket
is not locked. Another cpu could change this field under us.

Fixes: 0fd9a65a76 ("[SCTP] Support SO_BINDTODEVICE socket option on incoming packets.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:06 +01:00
Eric Dumazet
4c971d2f35 net: annotate races around sk->sk_bound_dev_if
UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while
this field can be changed by another thread.

Adds minimal annotations to avoid KCSAN splats for UDP.
Following patches will add more annotations to potential lockless readers.

BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg

write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0:
 __ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221
 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576
 __sys_connect_file net/socket.c:1900 [inline]
 __sys_connect+0x197/0x1b0 net/socket.c:1917
 __do_sys_connect net/socket.c:1927 [inline]
 __se_sys_connect net/socket.c:1924 [inline]
 __x64_sys_connect+0x3d/0x50 net/socket.c:1924
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1:
 udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436
 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg net/socket.c:725 [inline]
 ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
 ___sys_sendmsg net/socket.c:2467 [inline]
 __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
 __do_sys_sendmmsg net/socket.c:2582 [inline]
 __se_sys_sendmmsg net/socket.c:2579 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0x00000000 -> 0xffffff9b

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G        W         5.18.0-rc1-syzkaller-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

I chose to not add Fixes: tag because race has minor consequences
and stable teams busy enough.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:31:05 +01:00
Coco Li
80e425b613 ipv6: Add hop-by-hop header to jumbograms in ip6_output
Instead of simply forcing a 0 payload_len in IPv6 header,
implement RFC 2675 and insert a custom extension header.

Note that only TCP stack is currently potentially generating
jumbograms, and that this extension header is purely local,
it wont be sent on a physical link.

This is needed so that packet capture (tcpdump and friends)
can properly dissect these large packets.

Signed-off-by: Coco Li <lixiaoyan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Alexander Duyck
0fe79f28bf net: allow gro_max_size to exceed 65536
Allow the gro_max_size to exceed a value larger than 65536.

There weren't really any external limitations that prevented this other
than the fact that IPv4 only supports a 16 bit length field. Since we have
the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
via a constant rather than relying on gro_max_size.

[edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Eric Dumazet
81fbc81213 ipv6/gro: insert temporary HBH/jumbo header
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build
BIG TCP ipv6 packets (bigger than 64K).

This patch changes ipv6_gro_complete() to insert a HBH/jumbo header
so that resulting packet can go through IPv6/TCP stacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Eric Dumazet
09f3d1a3a5 ipv6/gso: remove temporary HBH/jumbo header
ipv6 tcp and gro stacks will soon be able to build big TCP packets,
with an added temporary Hop By Hop header.

If GSO is involved for these large packets, we need to remove
the temporary HBH header before segmentation happens.

v2: perform HBH removal from ipv6_gso_segment() instead of
    skb_segment() (Alexander feedback)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Eric Dumazet
9957b38b5e tcp_cubic: make hystart_ack_delay() aware of BIG TCP
hystart_ack_delay() had the assumption that a TSO packet
would not be bigger than GSO_MAX_SIZE.

This will no longer be true.

We should use sk->sk_gso_max_size instead.

This reduces chances of spurious Hystart ACK train detections.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:56 +01:00
Alexander Duyck
7c4e983c4f net: allow gso_max_size to exceed 65536
The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.

With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.

So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.

v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
               in a new sk_trim_gso_size() helper.
               netif_set_tso_max_size() caps the requested TSO size
               with GSO_MAX_SIZE.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:55 +01:00
Eric Dumazet
89527be8d8 net: add IFLA_TSO_{MAX_SIZE|SEGS} attributes
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS
are used to report to user-space the device TSO limits.

ip -d link sh dev eth1
...
   tso_max_size 65536 tso_max_segs 65535

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:18:55 +01:00
David S. Miller
1a01a07517 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

This is v2 including deadlock fix in conntrack ecache rework
reported by Jakub Kicinski.

The following patchset contains Netfilter updates for net-next,
mostly updates to conntrack from Florian Westphal.

1) Add a dedicated list for conntrack event redelivery.

2) Include event redelivery list in conntrack dumps of dying type.

3) Remove per-cpu dying list for event redelivery, not used anymore.

4) Add netns .pre_exit to cttimeout to zap timeout objects before
   synchronize_rcu() call.

5) Remove nf_ct_unconfirmed_destroy.

6) Add generation id for conntrack extensions for conntrack
   timeout and helpers.

7) Detach timeout policy from conntrack on cttimeout module removal.

8) Remove __nf_ct_unconfirmed_destroy.

9) Remove unconfirmed list.

10) Remove unconditional local_bh_disable in init_conntrack().

11) Consolidate conntrack iterator nf_ct_iterate_cleanup().

12) Detect if ctnetlink listeners exist to short-circuit event
    path early.

13) Un-inline nf_ct_ecache_ext_add().

14) Add nf_conntrack_events autodetect ctnetlink listener mode
    and make it default.

15) Add nf_ct_ecache_exist() to check for event cache extension.

16) Extend flowtable reverse route lookup to include source, iif,
    tos and mark, from Sven Auhagen.

17) Do not verify zero checksum UDP packets in nf_reject,
    from Kevin Mitchell.

====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16 10:10:37 +01:00
Jakub Kicinski
2c5f153647 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2022-05-13

1) Cleanups for the code behind the XFRM offload API. This is a
   preparation for the extension of the API for policy offload.
   From Leon Romanovsky.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: drop not needed flags variable in XFRM offload struct
  net/mlx5e: Use XFRM state direction instead of flags
  netdevsim: rely on XFRM state direction instead of flags
  ixgbe: propagate XFRM offload state direction instead of flags
  xfrm: store and rely on direction to construct offload flags
  xfrm: rename xfrm_state_offload struct to allow reuse
  xfrm: delete not used number of external headers
  xfrm: free not used XFRM_ESP_NO_TRAILER flag
====================

Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13 10:25:08 -07:00
Kevin Mitchell
4f9bd53084 netfilter: conntrack: skip verification of zero UDP checksum
The checksum is optional for UDP packets. However nf_reject would
previously require a valid checksum to elicit a response such as
ICMP_DEST_UNREACH.

Add some logic to nf_reject_verify_csum to determine if a UDP packet has
a zero checksum and should therefore not be verified.

Signed-off-by: Kevin Mitchell <kevmitch@arista.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Sven Auhagen
3412e16418 netfilter: flowtable: nft_flow_route use more data for reverse route
When creating a flow table entry, the reverse route is looked
up based on the current packet.
There can be scenarios where the user creates a custom ip rule
to route the traffic differently.
In order to support those scenarios, the lookup needs to add
more information based on the current packet.
The patch adds multiple new information to the route lookup.

Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal
90d1daa458 netfilter: conntrack: add nf_conntrack_events autodetect mode
This adds the new nf_conntrack_events=2 mode and makes it the
default.

This leverages the earlier flag in struct net to allow to avoid
the event extension as long as no event listener is active in
the namespace.

This avoids, for most cases, allocation of ct->ext area.
A followup patch will take further advantage of this by avoiding
calls down into the event framework if the extension isn't present.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal
b0a7ab4a77 netfilter: conntrack: un-inline nf_ct_ecache_ext_add
Only called when new ct is allocated or the extension isn't present.
This function will be extended, place this in the conntrack module
instead of inlining.

The callers already depend on nf_conntrack module.
Return value is changed to bool, noone used the returned pointer.

Make sure that the core drops the newly allocated conntrack
if the extension is requested but can't be added.
This makes it necessary to ifdef the section, as the stub
always returns false we'd drop every new conntrack if the
the ecache extension is disabled in kconfig.

Add from data path (xt_CT, nft_ct) is unchanged.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Florian Westphal
2794cdb0b9 netfilter: nfnetlink: allow to detect if ctnetlink listeners exist
At this time, every new conntrack gets the 'event cache extension'
enabled for it.

This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults
to 1.

Changing the default to 0 means that commands that rely on the event
notification extension, e.g. 'conntrack -E' or conntrackd, stop working.

We COULD detect if there is a listener by means of
'nfnetlink_has_listeners()' and only add the extension if this is true.

The downside is a dependency from conntrack module to nfnetlink module.

This adds a different way: inc/dec a counter whenever a ctnetlink group
is being (un)subscribed and toggle a flag in struct net.

Next patches will take advantage of this and will only add the event
extension if the flag is set.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:28 +02:00
Pablo Neira Ayuso
8169ff5840 netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()
This patch adds a structure to collect all the context data that is
passed to the cleanup iterator.

 struct nf_ct_iter_data {
       struct net *net;
       void *data;
       u32 portid;
       int report;
 };

There is a netns field that allows to clean up conntrack entries
specifically owned by the specified netns.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:27 +02:00
Florian Westphal
0bcfbafbcd netfilter: conntrack: avoid unconditional local_bh_disable
Now that the conntrack entry isn't placed on the pcpu list anymore the
bh only needs to be disabled in the 'expectation present' case.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:56:27 +02:00
Florian Westphal
8a75a2c174 netfilter: conntrack: remove unconfirmed list
It has no function anymore and can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:53:27 +02:00
Florian Westphal
ace53fdc26 netfilter: conntrack: remove __nf_ct_unconfirmed_destroy
Its not needed anymore:

A. If entry is totally new, then the rcu-protected resource
must already have been removed from global visibility before call
to nf_ct_iterate_destroy.

B. If entry was allocated before, but is not yet in the hash table
   (uncofirmed case), genid gets incremented and synchronize_rcu() call
   makes sure access has completed.

C. Next attempt to peek at extension area will fail for unconfirmed
  conntracks, because ext->genid != genid.

D. Conntracks in the hash are iterated as before.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:17 +02:00
Florian Westphal
42df4fb9b1 netfilter: cttimeout: decouple unlink and free on netns destruction
Increment the extid on module removal; this makes sure that even
in extreme cases any old uncofirmed entry that happened to be kept
e.g. on nfnetlink_queue list will not trip over a stale timeout
reference.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
c56716c69c netfilter: extensions: introduce extension genid count
Multiple netfilter extensions store pointers to external data
in their extension area struct.

Examples:
1. Timeout policies
2. Connection tracking helpers.

No references are taken for these.

When a helper or timeout policy is removed, the conntrack table gets
traversed and affected extensions are cleared.

Conntrack entries not yet in the hashtable are referenced via a special
list, the unconfirmed list.

On removal of a policy or connection tracking helper, the unconfirmed
list gets traversed an all entries are marked as dying, this prevents
them from getting committed to the table at insertion time: core checks
for dying bit, if set, the conntrack entry gets destroyed at confirm
time.

The disadvantage is that each new conntrack has to be added to the percpu
unconfirmed list, and each insertion needs to remove it from this list.
The list is only ever needed when a policy or helper is removed -- a rare
occurrence.

Add a generation ID count: Instead of adding to the list and then
traversing that list on policy/helper removal, increment a counter
that is stored in the extension area.

For unconfirmed conntracks, the extension has the genid valid at ct
allocation time.

Removal of a helper/policy etc. increments the counter.
At confirmation time, validate that ext->genid == global_id.

If the stored number is not the same, do not allow the conntrack
insertion, just like as if a confirmed-list traversal would have flagged
the entry as dying.

After insertion, the genid is no longer relevant (conntrack entries
are now reachable via the conntrack table iterators and is set to 0.

This allows removal of the percpu unconfirmed list.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
17438b42ce netfilter: remove nf_ct_unconfirmed_destroy helper
This helper tags connections not yet in the conntrack table as
dying.  These nf_conn entries will be dropped instead when the
core attempts to insert them from the input or postrouting
'confirm' hook.

After the previous change, the entries get unlinked from the
list earlier, so that by the time the actual exit hook runs,
new connections no longer have a timeout policy assigned.

Its enough to walk the hashtable instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00
Florian Westphal
78222bacfc netfilter: cttimeout: decouple unlink and free on netns destruction
Make it so netns pre_exit unlinks the objects from the pernet list, so
they cannot be found anymore.

netns core issues a synchronize_rcu() before calling the exit hooks so
any the time the exit hooks run unconfirmed nf_conn entries have been
free'd or they have been committed to the hashtable.

The exit hook still tags unconfirmed entries as dying, this can
now be removed in a followup change.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13 18:52:16 +02:00