networking changes for the 5.10 merge window

Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
This commit is contained in:
Linus Torvalds 2020-10-15 18:42:13 -07:00
commit 9ff9b0d392
2301 changed files with 130033 additions and 50954 deletions

View file

@ -124,6 +124,7 @@ enum bpf_cmd {
BPF_ENABLE_STATS,
BPF_ITER_CREATE,
BPF_LINK_DETACH,
BPF_PROG_BIND_MAP,
};
enum bpf_map_type {
@ -155,6 +156,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP_HASH,
BPF_MAP_TYPE_STRUCT_OPS,
BPF_MAP_TYPE_RINGBUF,
BPF_MAP_TYPE_INODE_STORAGE,
};
/* Note that tracing related programs such as
@ -345,19 +347,45 @@ enum bpf_link_type {
/* The verifier internal test flag. Behavior is undefined */
#define BPF_F_TEST_STATE_FREQ (1U << 3)
/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will
* restrict map and helper usage for such programs. Sleepable BPF programs can
* only be attached to hooks where kernel execution context allows sleeping.
* Such programs are allowed to use helpers that may sleep like
* bpf_copy_from_user().
*/
#define BPF_F_SLEEPABLE (1U << 4)
/* When BPF ldimm64's insn[0].src_reg != 0 then this can have
* two extensions:
* the following extensions:
*
* insn[0].src_reg: BPF_PSEUDO_MAP_FD BPF_PSEUDO_MAP_VALUE
* insn[0].imm: map fd map fd
* insn[1].imm: 0 offset into value
* insn[0].off: 0 0
* insn[1].off: 0 0
* ldimm64 rewrite: address of map address of map[0]+offset
* verifier type: CONST_PTR_TO_MAP PTR_TO_MAP_VALUE
* insn[0].src_reg: BPF_PSEUDO_MAP_FD
* insn[0].imm: map fd
* insn[1].imm: 0
* insn[0].off: 0
* insn[1].off: 0
* ldimm64 rewrite: address of map
* verifier type: CONST_PTR_TO_MAP
*/
#define BPF_PSEUDO_MAP_FD 1
/* insn[0].src_reg: BPF_PSEUDO_MAP_VALUE
* insn[0].imm: map fd
* insn[1].imm: offset into value
* insn[0].off: 0
* insn[1].off: 0
* ldimm64 rewrite: address of map[0]+offset
* verifier type: PTR_TO_MAP_VALUE
*/
#define BPF_PSEUDO_MAP_VALUE 2
/* insn[0].src_reg: BPF_PSEUDO_BTF_ID
* insn[0].imm: kernel btd id of VAR
* insn[1].imm: 0
* insn[0].off: 0
* insn[1].off: 0
* ldimm64 rewrite: address of the kernel variable
* verifier type: PTR_TO_BTF_ID or PTR_TO_MEM, depending on whether the var
* is struct/union.
*/
#define BPF_PSEUDO_BTF_ID 3
/* when bpf_call->src_reg == BPF_PSEUDO_CALL, bpf_call->imm == pc-relative
* offset to another bpf function
@ -404,6 +432,12 @@ enum {
/* Enable memory-mapping BPF map */
BPF_F_MMAPABLE = (1U << 10),
/* Share perf_event among processes */
BPF_F_PRESERVE_ELEMS = (1U << 11),
/* Create a map that is suitable to be an inner map with dynamic max entries */
BPF_F_INNER_MAP = (1U << 12),
};
/* Flags for BPF_PROG_QUERY. */
@ -414,6 +448,11 @@ enum {
*/
#define BPF_F_QUERY_EFFECTIVE (1U << 0)
/* Flags for BPF_PROG_TEST_RUN */
/* If set, run the test on the cpu specified by bpf_attr.test.cpu */
#define BPF_F_TEST_RUN_ON_CPU (1U << 0)
/* type for BPF_ENABLE_STATS */
enum bpf_stats_type {
/* enabled run_time_ns and run_cnt */
@ -556,6 +595,8 @@ union bpf_attr {
*/
__aligned_u64 ctx_in;
__aligned_u64 ctx_out;
__u32 flags;
__u32 cpu;
} test;
struct { /* anonymous struct used by BPF_*_GET_*_ID */
@ -622,8 +663,13 @@ union bpf_attr {
};
__u32 attach_type; /* attach type */
__u32 flags; /* extra flags */
__aligned_u64 iter_info; /* extra bpf_iter_link_info */
__u32 iter_info_len; /* iter_info length */
union {
__u32 target_btf_id; /* btf_id of target to attach to */
struct {
__aligned_u64 iter_info; /* extra bpf_iter_link_info */
__u32 iter_info_len; /* iter_info length */
};
};
} link_create;
struct { /* struct used by BPF_LINK_UPDATE command */
@ -649,6 +695,12 @@ union bpf_attr {
__u32 flags;
} iter_create;
struct { /* struct used by BPF_PROG_BIND_MAP command */
__u32 prog_fd;
__u32 map_fd;
__u32 flags; /* extra flags */
} prog_bind_map;
} __attribute__((aligned(8)));
/* The description below is an attempt at providing documentation to eBPF
@ -1438,8 +1490,8 @@ union bpf_attr {
* Return
* The return value depends on the result of the test, and can be:
*
* * 0, if the *skb* task belongs to the cgroup2.
* * 1, if the *skb* task does not belong to the cgroup2.
* * 0, if current task belongs to the cgroup2.
* * 1, if current task does not belong to the cgroup2.
* * A negative error code, if an error occurred.
*
* long bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags)
@ -1649,7 +1701,7 @@ union bpf_attr {
* **TCP_CONGESTION**, **TCP_BPF_IW**,
* **TCP_BPF_SNDCWND_CLAMP**, **TCP_SAVE_SYN**,
* **TCP_KEEPIDLE**, **TCP_KEEPINTVL**, **TCP_KEEPCNT**,
* **TCP_SYNCNT**, **TCP_USER_TIMEOUT**.
* **TCP_SYNCNT**, **TCP_USER_TIMEOUT**, **TCP_NOTSENT_LOWAT**.
* * **IPPROTO_IP**, which supports *optname* **IP_TOS**.
* * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**.
* Return
@ -2204,7 +2256,7 @@ union bpf_attr {
* Description
* This helper is used in programs implementing policies at the
* skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
* if the verdeict eBPF program returns **SK_PASS**), redirect it
* if the verdict eBPF program returns **SK_PASS**), redirect it
* to the socket referenced by *map* (of type
* **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
* egress interfaces can be used for redirection. The
@ -2496,7 +2548,7 @@ union bpf_attr {
* result is from *reuse*\ **->socks**\ [] using the hash of the
* tuple.
*
* long bpf_sk_release(struct bpf_sock *sock)
* long bpf_sk_release(void *sock)
* Description
* Release the reference held by *sock*. *sock* must be a
* non-**NULL** pointer that was returned from
@ -2676,7 +2728,7 @@ union bpf_attr {
* result is from *reuse*\ **->socks**\ [] using the hash of the
* tuple.
*
* long bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
* long bpf_tcp_check_syncookie(void *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
* Description
* Check whether *iph* and *th* contain a valid SYN cookie ACK for
* the listening socket in *sk*.
@ -2807,7 +2859,7 @@ union bpf_attr {
*
* **-ERANGE** if resulting value was out of range.
*
* void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void *value, u64 flags)
* void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags)
* Description
* Get a bpf-local-storage from a *sk*.
*
@ -2823,6 +2875,9 @@ union bpf_attr {
* "type". The bpf-local-storage "type" (i.e. the *map*) is
* searched against all bpf-local-storages residing at *sk*.
*
* *sk* is a kernel **struct sock** pointer for LSM program.
* *sk* is a **struct bpf_sock** pointer for other program types.
*
* An optional *flags* (**BPF_SK_STORAGE_GET_F_CREATE**) can be
* used such that a new bpf-local-storage will be
* created if one does not exist. *value* can be used
@ -2835,13 +2890,14 @@ union bpf_attr {
* **NULL** if not found or there was an error in adding
* a new bpf-local-storage.
*
* long bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk)
* long bpf_sk_storage_delete(struct bpf_map *map, void *sk)
* Description
* Delete a bpf-local-storage from a *sk*.
* Return
* 0 on success.
*
* **-ENOENT** if the bpf-local-storage cannot be found.
* **-EINVAL** if sk is not a fullsock (e.g. a request_sock).
*
* long bpf_send_signal(u32 sig)
* Description
@ -2858,7 +2914,7 @@ union bpf_attr {
*
* **-EAGAIN** if bpf program can try again.
*
* s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
* s64 bpf_tcp_gen_syncookie(void *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
* Description
* Try to issue a SYN cookie for the packet with corresponding
* IP/TCP headers, *iph* and *th*, on the listening socket in *sk*.
@ -3087,7 +3143,7 @@ union bpf_attr {
* Return
* The id is returned or 0 in case the id could not be retrieved.
*
* long bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
* long bpf_sk_assign(struct sk_buff *skb, void *sk, u64 flags)
* Description
* Helper is overloaded depending on BPF program type. This
* description applies to **BPF_PROG_TYPE_SCHED_CLS** and
@ -3215,11 +3271,11 @@ union bpf_attr {
*
* **-EOVERFLOW** if an overflow happened: The same object will be tried again.
*
* u64 bpf_sk_cgroup_id(struct bpf_sock *sk)
* u64 bpf_sk_cgroup_id(void *sk)
* Description
* Return the cgroup v2 id of the socket *sk*.
*
* *sk* must be a non-**NULL** pointer to a full socket, e.g. one
* *sk* must be a non-**NULL** pointer to a socket, e.g. one
* returned from **bpf_sk_lookup_xxx**\ (),
* **bpf_sk_fullsock**\ (), etc. The format of returned id is
* same as in **bpf_skb_cgroup_id**\ ().
@ -3229,7 +3285,7 @@ union bpf_attr {
* Return
* The id is returned or 0 in case the id could not be retrieved.
*
* u64 bpf_sk_ancestor_cgroup_id(struct bpf_sock *sk, int ancestor_level)
* u64 bpf_sk_ancestor_cgroup_id(void *sk, int ancestor_level)
* Description
* Return id of cgroup v2 that is ancestor of cgroup associated
* with the *sk* at the *ancestor_level*. The root cgroup is at
@ -3337,38 +3393,38 @@ union bpf_attr {
* Description
* Dynamically cast a *sk* pointer to a *tcp6_sock* pointer.
* Return
* *sk* if casting is valid, or NULL otherwise.
* *sk* if casting is valid, or **NULL** otherwise.
*
* struct tcp_sock *bpf_skc_to_tcp_sock(void *sk)
* Description
* Dynamically cast a *sk* pointer to a *tcp_sock* pointer.
* Return
* *sk* if casting is valid, or NULL otherwise.
* *sk* if casting is valid, or **NULL** otherwise.
*
* struct tcp_timewait_sock *bpf_skc_to_tcp_timewait_sock(void *sk)
* Description
* Dynamically cast a *sk* pointer to a *tcp_timewait_sock* pointer.
* Return
* *sk* if casting is valid, or NULL otherwise.
* *sk* if casting is valid, or **NULL** otherwise.
*
* struct tcp_request_sock *bpf_skc_to_tcp_request_sock(void *sk)
* Description
* Dynamically cast a *sk* pointer to a *tcp_request_sock* pointer.
* Return
* *sk* if casting is valid, or NULL otherwise.
* *sk* if casting is valid, or **NULL** otherwise.
*
* struct udp6_sock *bpf_skc_to_udp6_sock(void *sk)
* Description
* Dynamically cast a *sk* pointer to a *udp6_sock* pointer.
* Return
* *sk* if casting is valid, or NULL otherwise.
* *sk* if casting is valid, or **NULL** otherwise.
*
* long bpf_get_task_stack(struct task_struct *task, void *buf, u32 size, u64 flags)
* Description
* Return a user or a kernel stack in bpf program provided buffer.
* To achieve this, the helper needs *task*, which is a valid
* pointer to struct task_struct. To store the stacktrace, the
* bpf program provides *buf* with a nonnegative *size*.
* pointer to **struct task_struct**. To store the stacktrace, the
* bpf program provides *buf* with a nonnegative *size*.
*
* The last argument, *flags*, holds the number of stack frames to
* skip (from 0 to 255), masked with
@ -3395,6 +3451,293 @@ union bpf_attr {
* A non-negative value equal to or less than *size* on success,
* or a negative error in case of failure.
*
* long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags)
* Description
* Load header option. Support reading a particular TCP header
* option for bpf program (**BPF_PROG_TYPE_SOCK_OPS**).
*
* If *flags* is 0, it will search the option from the
* *skops*\ **->skb_data**. The comment in **struct bpf_sock_ops**
* has details on what skb_data contains under different
* *skops*\ **->op**.
*
* The first byte of the *searchby_res* specifies the
* kind that it wants to search.
*
* If the searching kind is an experimental kind
* (i.e. 253 or 254 according to RFC6994). It also
* needs to specify the "magic" which is either
* 2 bytes or 4 bytes. It then also needs to
* specify the size of the magic by using
* the 2nd byte which is "kind-length" of a TCP
* header option and the "kind-length" also
* includes the first 2 bytes "kind" and "kind-length"
* itself as a normal TCP header option also does.
*
* For example, to search experimental kind 254 with
* 2 byte magic 0xeB9F, the searchby_res should be
* [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ].
*
* To search for the standard window scale option (3),
* the *searchby_res* should be [ 3, 0, 0, .... 0 ].
* Note, kind-length must be 0 for regular option.
*
* Searching for No-Op (0) and End-of-Option-List (1) are
* not supported.
*
* *len* must be at least 2 bytes which is the minimal size
* of a header option.
*
* Supported flags:
*
* * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the
* saved_syn packet or the just-received syn packet.
*
* Return
* > 0 when found, the header option is copied to *searchby_res*.
* The return value is the total length copied. On failure, a
* negative error code is returned:
*
* **-EINVAL** if a parameter is invalid.
*
* **-ENOMSG** if the option is not found.
*
* **-ENOENT** if no syn packet is available when
* **BPF_LOAD_HDR_OPT_TCP_SYN** is used.
*
* **-ENOSPC** if there is not enough space. Only *len* number of
* bytes are copied.
*
* **-EFAULT** on failure to parse the header options in the
* packet.
*
* **-EPERM** if the helper cannot be used under the current
* *skops*\ **->op**.
*
* long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags)
* Description
* Store header option. The data will be copied
* from buffer *from* with length *len* to the TCP header.
*
* The buffer *from* should have the whole option that
* includes the kind, kind-length, and the actual
* option data. The *len* must be at least kind-length
* long. The kind-length does not have to be 4 byte
* aligned. The kernel will take care of the padding
* and setting the 4 bytes aligned value to th->doff.
*
* This helper will check for duplicated option
* by searching the same option in the outgoing skb.
*
* This helper can only be called during
* **BPF_SOCK_OPS_WRITE_HDR_OPT_CB**.
*
* Return
* 0 on success, or negative error in case of failure:
*
* **-EINVAL** If param is invalid.
*
* **-ENOSPC** if there is not enough space in the header.
* Nothing has been written
*
* **-EEXIST** if the option already exists.
*
* **-EFAULT** on failrue to parse the existing header options.
*
* **-EPERM** if the helper cannot be used under the current
* *skops*\ **->op**.
*
* long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags)
* Description
* Reserve *len* bytes for the bpf header option. The
* space will be used by **bpf_store_hdr_opt**\ () later in
* **BPF_SOCK_OPS_WRITE_HDR_OPT_CB**.
*
* If **bpf_reserve_hdr_opt**\ () is called multiple times,
* the total number of bytes will be reserved.
*
* This helper can only be called during
* **BPF_SOCK_OPS_HDR_OPT_LEN_CB**.
*
* Return
* 0 on success, or negative error in case of failure:
*
* **-EINVAL** if a parameter is invalid.
*
* **-ENOSPC** if there is not enough space in the header.
*
* **-EPERM** if the helper cannot be used under the current
* *skops*\ **->op**.
*
* void *bpf_inode_storage_get(struct bpf_map *map, void *inode, void *value, u64 flags)
* Description
* Get a bpf_local_storage from an *inode*.
*
* Logically, it could be thought of as getting the value from
* a *map* with *inode* as the **key**. From this
* perspective, the usage is not much different from
* **bpf_map_lookup_elem**\ (*map*, **&**\ *inode*) except this
* helper enforces the key must be an inode and the map must also
* be a **BPF_MAP_TYPE_INODE_STORAGE**.
*
* Underneath, the value is stored locally at *inode* instead of
* the *map*. The *map* is used as the bpf-local-storage
* "type". The bpf-local-storage "type" (i.e. the *map*) is
* searched against all bpf_local_storage residing at *inode*.
*
* An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
* used such that a new bpf_local_storage will be
* created if one does not exist. *value* can be used
* together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
* the initial value of a bpf_local_storage. If *value* is
* **NULL**, the new bpf_local_storage will be zero initialized.
* Return
* A bpf_local_storage pointer is returned on success.
*
* **NULL** if not found or there was an error in adding
* a new bpf_local_storage.
*
* int bpf_inode_storage_delete(struct bpf_map *map, void *inode)
* Description
* Delete a bpf_local_storage from an *inode*.
* Return
* 0 on success.
*
* **-ENOENT** if the bpf_local_storage cannot be found.
*
* long bpf_d_path(struct path *path, char *buf, u32 sz)
* Description
* Return full path for given **struct path** object, which
* needs to be the kernel BTF *path* object. The path is
* returned in the provided buffer *buf* of size *sz* and
* is zero terminated.
*
* Return
* On success, the strictly positive length of the string,
* including the trailing NUL character. On error, a negative
* value.
*
* long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
* Description
* Read *size* bytes from user space address *user_ptr* and store
* the data in *dst*. This is a wrapper of **copy_from_user**\ ().
* Return
* 0 on success, or a negative error in case of failure.
*
* long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr, u32 btf_ptr_size, u64 flags)
* Description
* Use BTF to store a string representation of *ptr*->ptr in *str*,
* using *ptr*->type_id. This value should specify the type
* that *ptr*->ptr points to. LLVM __builtin_btf_type_id(type, 1)
* can be used to look up vmlinux BTF type ids. Traversing the
* data structure using BTF, the type information and values are
* stored in the first *str_size* - 1 bytes of *str*. Safe copy of
* the pointer data is carried out to avoid kernel crashes during
* operation. Smaller types can use string space on the stack;
* larger programs can use map data to store the string
* representation.
*
* The string can be subsequently shared with userspace via
* bpf_perf_event_output() or ring buffer interfaces.
* bpf_trace_printk() is to be avoided as it places too small
* a limit on string size to be useful.
*
* *flags* is a combination of
*
* **BTF_F_COMPACT**
* no formatting around type information
* **BTF_F_NONAME**
* no struct/union member names/types
* **BTF_F_PTR_RAW**
* show raw (unobfuscated) pointer values;
* equivalent to printk specifier %px.
* **BTF_F_ZERO**
* show zero-valued struct/union members; they
* are not displayed by default
*
* Return
* The number of bytes that were written (or would have been
* written if output had to be truncated due to string size),
* or a negative error in cases of failure.
*
* long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr, u32 ptr_size, u64 flags)
* Description
* Use BTF to write to seq_write a string representation of
* *ptr*->ptr, using *ptr*->type_id as per bpf_snprintf_btf().
* *flags* are identical to those used for bpf_snprintf_btf.
* Return
* 0 on success or a negative error in case of failure.
*
* u64 bpf_skb_cgroup_classid(struct sk_buff *skb)
* Description
* See **bpf_get_cgroup_classid**\ () for the main description.
* This helper differs from **bpf_get_cgroup_classid**\ () in that
* the cgroup v1 net_cls class is retrieved only from the *skb*'s
* associated socket instead of the current process.
* Return
* The id is returned or 0 in case the id could not be retrieved.
*
* long bpf_redirect_neigh(u32 ifindex, u64 flags)
* Description
* Redirect the packet to another net device of index *ifindex*
* and fill in L2 addresses from neighboring subsystem. This helper
* is somewhat similar to **bpf_redirect**\ (), except that it
* populates L2 addresses as well, meaning, internally, the helper
* performs a FIB lookup based on the skb's networking header to
* get the address of the next hop and then relies on the neighbor
* lookup for the L2 address of the nexthop.
*
* The *flags* argument is reserved and must be 0. The helper is
* currently only supported for tc BPF program types, and enabled
* for IPv4 and IPv6 protocols.
* Return
* The helper returns **TC_ACT_REDIRECT** on success or
* **TC_ACT_SHOT** on error.
*
* void *bpf_per_cpu_ptr(const void *percpu_ptr, u32 cpu)
* Description
* Take a pointer to a percpu ksym, *percpu_ptr*, and return a
* pointer to the percpu kernel variable on *cpu*. A ksym is an
* extern variable decorated with '__ksym'. For ksym, there is a
* global var (either static or global) defined of the same name
* in the kernel. The ksym is percpu if the global var is percpu.
* The returned pointer points to the global percpu var on *cpu*.
*
* bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the
* kernel, except that bpf_per_cpu_ptr() may return NULL. This
* happens if *cpu* is larger than nr_cpu_ids. The caller of
* bpf_per_cpu_ptr() must check the returned value.
* Return
* A pointer pointing to the kernel percpu variable on *cpu*, or
* NULL, if *cpu* is invalid.
*
* void *bpf_this_cpu_ptr(const void *percpu_ptr)
* Description
* Take a pointer to a percpu ksym, *percpu_ptr*, and return a
* pointer to the percpu kernel variable on this cpu. See the
* description of 'ksym' in **bpf_per_cpu_ptr**\ ().
*
* bpf_this_cpu_ptr() has the same semantic as this_cpu_ptr() in
* the kernel. Different from **bpf_per_cpu_ptr**\ (), it would
* never return NULL.
* Return
* A pointer pointing to the kernel percpu variable on this cpu.
*
* long bpf_redirect_peer(u32 ifindex, u64 flags)
* Description
* Redirect the packet to another net device of index *ifindex*.
* This helper is somewhat similar to **bpf_redirect**\ (), except
* that the redirection happens to the *ifindex*' peer device and
* the netns switch takes place from ingress to ingress without
* going through the CPU's backlog queue.
*
* The *flags* argument is reserved and must be 0. The helper is
* currently only supported for tc BPF program types at the ingress
* hook and for veth device types. The peer device must reside in a
* different network namespace.
* Return
* The helper returns **TC_ACT_REDIRECT** on success or
* **TC_ACT_SHOT** on error.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@ -3539,6 +3882,20 @@ union bpf_attr {
FN(skc_to_tcp_request_sock), \
FN(skc_to_udp6_sock), \
FN(get_task_stack), \
FN(load_hdr_opt), \
FN(store_hdr_opt), \
FN(reserve_hdr_opt), \
FN(inode_storage_get), \
FN(inode_storage_delete), \
FN(d_path), \
FN(copy_from_user), \
FN(snprintf_btf), \
FN(seq_printf_btf), \
FN(skb_cgroup_classid), \
FN(redirect_neigh), \
FN(bpf_per_cpu_ptr), \
FN(bpf_this_cpu_ptr), \
FN(redirect_peer), \
/* */
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@ -3648,9 +4005,13 @@ enum {
BPF_F_SYSCTL_BASE_NAME = (1ULL << 0),
};
/* BPF_FUNC_sk_storage_get flags */
/* BPF_FUNC_<kernel_obj>_storage_get flags */
enum {
BPF_SK_STORAGE_GET_F_CREATE = (1ULL << 0),
BPF_LOCAL_STORAGE_GET_F_CREATE = (1ULL << 0),
/* BPF_SK_STORAGE_GET_F_CREATE is only kept for backward compatibility
* and BPF_LOCAL_STORAGE_GET_F_CREATE must be used instead.
*/
BPF_SK_STORAGE_GET_F_CREATE = BPF_LOCAL_STORAGE_GET_F_CREATE,
};
/* BPF_FUNC_read_branch_records flags. */
@ -4071,6 +4432,15 @@ struct bpf_link_info {
__u64 cgroup_id;
__u32 attach_type;
} cgroup;
struct {
__aligned_u64 target_name; /* in/out: target_name buffer ptr */
__u32 target_name_len; /* in/out: target_name buffer len */
union {
struct {
__u32 map_id;
} map;
};
} iter;
struct {
__u32 netns_ino;
__u32 attach_type;
@ -4158,6 +4528,36 @@ struct bpf_sock_ops {
__u64 bytes_received;
__u64 bytes_acked;
__bpf_md_ptr(struct bpf_sock *, sk);
/* [skb_data, skb_data_end) covers the whole TCP header.
*
* BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received
* BPF_SOCK_OPS_HDR_OPT_LEN_CB: Not useful because the
* header has not been written.
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have
* been written so far.
* BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: The SYNACK that concludes
* the 3WHS.
* BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
* the 3WHS.
*
* bpf_load_hdr_opt() can also be used to read a particular option.
*/
__bpf_md_ptr(void *, skb_data);
__bpf_md_ptr(void *, skb_data_end);
__u32 skb_len; /* The total length of a packet.
* It includes the header, options,
* and payload.
*/
__u32 skb_tcp_flags; /* tcp_flags of the header. It provides
* an easy way to check for tcp_flags
* without parsing skb_data.
*
* In particular, the skb_tcp_flags
* will still be available in
* BPF_SOCK_OPS_HDR_OPT_LEN even though
* the outgoing header has not
* been written yet.
*/
};
/* Definitions for bpf_sock_ops_cb_flags */
@ -4166,8 +4566,51 @@ enum {
BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1),
BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2),
BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3),
/* Call bpf for all received TCP headers. The bpf prog will be
* called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB
*
* Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB
* for the header option related helpers that will be useful
* to the bpf programs.
*
* It could be used at the client/active side (i.e. connect() side)
* when the server told it that the server was in syncookie
* mode and required the active side to resend the bpf-written
* options. The active side can keep writing the bpf-options until
* it received a valid packet from the server side to confirm
* the earlier packet (and options) has been received. The later
* example patch is using it like this at the active side when the
* server is in syncookie mode.
*
* The bpf prog will usually turn this off in the common cases.
*/
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4),
/* Call bpf when kernel has received a header option that
* the kernel cannot handle. The bpf prog will be called under
* sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB.
*
* Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB
* for the header option related helpers that will be useful
* to the bpf programs.
*/
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5),
/* Call bpf when the kernel is writing header options for the
* outgoing packet. The bpf prog will first be called
* to reserve space in a skb under
* sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. Then
* the bpf prog will be called to write the header option(s)
* under sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*
* Please refer to the comment in BPF_SOCK_OPS_HDR_OPT_LEN_CB
* and BPF_SOCK_OPS_WRITE_HDR_OPT_CB for the header option
* related helpers that will be useful to the bpf programs.
*
* The kernel gets its chance to reserve space and write
* options first before the BPF program does.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
/* Mask of all currently supported cb flags */
BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF,
BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F,
};
/* List of known BPF sock_ops operators.
@ -4223,6 +4666,63 @@ enum {
*/
BPF_SOCK_OPS_RTT_CB, /* Called on every RTT.
*/
BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option.
* It will be called to handle
* the packets received at
* an already established
* connection.
*
* sock_ops->skb_data:
* Referring to the received skb.
* It covers the TCP header only.
*
* bpf_load_hdr_opt() can also
* be used to search for a
* particular option.
*/
BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the
* header option later in
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
* Arg1: bool want_cookie. (in
* writing SYNACK only)
*
* sock_ops->skb_data:
* Not available because no header has
* been written yet.
*
* sock_ops->skb_tcp_flags:
* The tcp_flags of the
* outgoing skb. (e.g. SYN, ACK, FIN).
*
* bpf_reserve_hdr_opt() should
* be used to reserve space.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options
* Arg1: bool want_cookie. (in
* writing SYNACK only)
*
* sock_ops->skb_data:
* Referring to the outgoing skb.
* It covers the TCP header
* that has already been written
* by the kernel and the
* earlier bpf-progs.
*
* sock_ops->skb_tcp_flags:
* The tcp_flags of the outgoing
* skb. (e.g. SYN, ACK, FIN).
*
* bpf_store_hdr_opt() should
* be used to write the
* option.
*
* bpf_load_hdr_opt() can also
* be used to search for a
* particular option that
* has already been written
* by the kernel or the
* earlier bpf-progs.
*/
};
/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
@ -4250,6 +4750,63 @@ enum {
enum {
TCP_BPF_IW = 1001, /* Set TCP initial congestion window */
TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */
TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */
TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */
/* Copy the SYN pkt to optval
*
* BPF_PROG_TYPE_SOCK_OPS only. It is similar to the
* bpf_getsockopt(TCP_SAVED_SYN) but it does not limit
* to only getting from the saved_syn. It can either get the
* syn packet from:
*
* 1. the just-received SYN packet (only available when writing the
* SYNACK). It will be useful when it is not necessary to
* save the SYN packet for latter use. It is also the only way
* to get the SYN during syncookie mode because the syn
* packet cannot be saved during syncookie.
*
* OR
*
* 2. the earlier saved syn which was done by
* bpf_setsockopt(TCP_SAVE_SYN).
*
* The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the
* SYN packet is obtained.
*
* If the bpf-prog does not need the IP[46] header, the
* bpf-prog can avoid parsing the IP header by using
* TCP_BPF_SYN. Otherwise, the bpf-prog can get both
* IP[46] and TCP header by using TCP_BPF_SYN_IP.
*
* >0: Total number of bytes copied
* -ENOSPC: Not enough space in optval. Only optlen number of
* bytes is copied.
* -ENOENT: The SYN skb is not available now and the earlier SYN pkt
* is not saved by setsockopt(TCP_SAVE_SYN).
*/
TCP_BPF_SYN = 1005, /* Copy the TCP header */
TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */
TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */
};
enum {
BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0),
};
/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and
* BPF_SOCK_OPS_WRITE_HDR_OPT_CB.
*/
enum {
BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the
* total option spaces
* required for an established
* sk in order to calculate the
* MSS. No skb is actually
* sent.
*/
BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode
* when sending a SYN.
*/
};
struct bpf_perf_event_value {
@ -4447,4 +5004,34 @@ struct bpf_sk_lookup {
__u32 local_port; /* Host byte order */
};
/*
* struct btf_ptr is used for typed pointer representation; the
* type id is used to render the pointer data as the appropriate type
* via the bpf_snprintf_btf() helper described above. A flags field -
* potentially to specify additional details about the BTF pointer
* (rather than its mode of display) - is included for future use.
* Display flags - BTF_F_* - are passed to bpf_snprintf_btf separately.
*/
struct btf_ptr {
void *ptr;
__u32 type_id;
__u32 flags; /* BTF ptr flags; unused at present. */
};
/*
* Flags to control bpf_snprintf_btf() behaviour.
* - BTF_F_COMPACT: no formatting around type information
* - BTF_F_NONAME: no struct/union member names/types
* - BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
* equivalent to %px.
* - BTF_F_ZERO: show zero-valued struct/union members; they
* are not displayed by default
*/
enum {
BTF_F_COMPACT = (1ULL << 0),
BTF_F_NONAME = (1ULL << 1),
BTF_F_PTR_RAW = (1ULL << 2),
BTF_F_ZERO = (1ULL << 3),
};
#endif /* _UAPI__LINUX_BPF_H__ */