Add "fail_route_offload" flag to disallow offloading routes.
It is needed to test "offload failed" notifications.
Create the flag as part of nsim_fib_create() under fib directory and set
it to false by default.
When FIB_EVENT_ENTRY_{REPLACE, APPEND} are triggered and
"fail_route_offload" value is true, set the appropriate hardware flag to
make the kernel emit RTM_NEWROUTE notification with RTM_F_OFFLOAD_FAILED
flag.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initialize the dummy FIB offload module after debugfs, so that the FIB
module could create its own directory there.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch will add the ability to fail route offload controlled by
debugfs variable called "fail_route_offload".
If we vetoed the addition, we might get a delete or append notification
for a route we do not have. Therefore, do not warn if route was not found.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.
Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.
The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.
To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.
With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.
This patch adds an "offload_failed" indication to IPv6 routes, so that
users will have better visibility into the offload process.
'struct fib6_info' is extended with new field that indicates if route
offload failed. Note that the new field is added using unused bit and
therefore there is no need to increase struct size.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.
Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel, but not
necessarily in hardware.
The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.
To avoid such cases, previous patch set added the ability to emit
RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed, this behavior is controlled by sysctl.
With the above mentioned behavior, it is possible to know from user-space
if the route was offloaded, but if the offload fails there is no indication
to user-space. Following a failure, a routing daemon will wait indefinitely
for a notification that will never come.
This patch adds an "offload_failed" indication to IPv4 routes, so that
users will have better visibility into the offload process.
'struct fib_alias', and 'struct fib_rt_info' are extended with new field
that indicates if route offload failed. Note that the new field is added
using unused bit and therefore there is no need to increase structs size.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The flag indicates to user space that route offload failed.
Previous patch set added the ability to emit RTM_NEWROUTE notifications
whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, but if the offload
fails there is no indication to user-space.
The flag will be used in subsequent patches by netdevsim and mlxsw to
indicate to user space that route offload failed, so that users will
have better visibility into the offload process.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
qedr_gsi_post_send() has a debug output which prints the return value of
in_irq() and irqs_disabled().
The result of the in_irq(), even if invoked from an interrupt handler, is
subject to change depending on the `threadirqs' command line switch. The
result of irqs_disabled() is always be 1 because the function acquires
spinlock_t with spin_lock_irqsave().
Remove in_irq() and irqs_disabled() from the debug output because it
provides little value.
Link: https://lore.kernel.org/r/20210208193347.383254-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch changes the type of init_send_wqe in rxe_verbs.c to void since
it always returns 0. It also separates out the code that copies inline
data into the send wqe as copy_inline_data_to_wqe().
Link: https://lore.kernel.org/r/20210206002437.2756-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
checkpatch -f found 3 warnings in RDMA/rxe
1. a missing space following switch
2. return followed by else
3. use of strlcpy() instead of strscpy().
This patch fixes each of these. In
...
} elseif (...) {
...
return 0;
} else
...
The middle block can be safely moved since it is completely independent of
the other code.
Link: https://lore.kernel.org/r/20210205230525.49068-1-rpearson@hpe.com
Signed-off-by: Bob Pearson <rpearson@hpe.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Cleanup the synchronize_srcu() from the ODP flow as it was found to be a
very heavy time consumer as part of dereg_mr.
For example de-registration of 10000 ODP MRs each with size of 2M hugepage
took 19.6 sec comparing de-registration of same number of non ODP MRs that
took 172 ms.
The new locking scheme uses the wait_event() mechanism which follows the
use count of the MR instead of using synchronize_srcu().
By that change, the time required for the above test took 95 ms which is
even better than the non ODP flow.
Once fully dropped the srcu usage, had to come with a lock to protect the
XA access.
As part of using the above mechanism we could also clean the
num_deferred_work stuff and follow the use count instead.
Link: https://lore.kernel.org/r/20210202071309.2057998-1-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add the futex test binary introduced by commit a4fd841465
("selftests/timens: Add a test for futex()") to .gitignore.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
The ice documentation has not been updated since the initial commits of the
driver. Update the documentation with features and information that are now
available.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently if the driver is unable to get all the MSI-X vectors it wants, it
falls back to the minimum configuration which equates to a single Tx/Rx
traffic queue pair. Instead of using the minimum configuration, if given
more vectors than the minimum, utilize those vectors for additional traffic
queues after accounting for other interrupts.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
This message indicates an error on close, not open.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Casting a void * rvalue in an assignment is unnecessary in C; remove the
casts.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Refactor the DCB related variables out of the ice_port_info_struct. The
goal is to make the ice_port_info struct cleaner.
Signed-off-by: Chinh T Cao <chinh.t.cao@intel.com>
Co-developed-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The writeback enable logic was incorrectly implemented (due to
misunderstanding what the side effects of the implementation would be
during polling).
Fix this logic issue, while implementing a new feature allowing the user
to control the writeback frequency using the knobs for controlling
interrupt throttling that we already have. Basically if you leave
adaptive interrupts enabled, the writeback frequency will be varied even
if busy_polling or if napi-poll is in use. If the interrupt rates are
set to a fixed value by ethtool -C and adaptive is off, the driver will
allow the user-set interrupt rate to guide how frequently the hardware
will complete descriptors to the driver.
Effectively the user will get a control over the hardware efficiency,
allowing the choice between immediate interrupts or delayed up to a
maximum of the interrupt rate, even when interrupts are disabled
during polling.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Co-developed-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The core clock frequency is currently hardcoded at 446 MHz for the RL
profile calculations. This causes issues since not all devices use that
clock frequency. Read the GLGEN_CLKSTAT_SRC register to determine which PSM
clock frequency is selected. This ensures that the rate limiter profile
calculations will be correct.
Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Create set scheduler aggregator node and move for VSIs into respective
scheduler node. Max children per aggregator node is 64.
There are two types of aggregator node(s) created.
1. dedicated node for PF and _CTRL VSIs
2. dedicated node(s) for VFs.
As part of reset and rebuild, aggregator nodes are recreated and VSIs
are moved to respective aggregator node.
Having related VSIs in respective tree avoid starvation between PF and VF
w.r.t Tx bandwidth.
Co-developed-by: Tarun Singh <tarun.k.singh@intel.com>
Signed-off-by: Tarun Singh <tarun.k.singh@intel.com>
Co-developed-by: Victor Raj <victor.raj@intel.com>
Signed-off-by: Victor Raj <victor.raj@intel.com>
Co-developed-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add the framework and initial implementation for receiving and processing
netdev bonding events. This is only the software support and the
implementation of the HW offload for bonding support will be coming at a
later time. There are some architectural gaps that need to be closed
before that happens.
Because this is a software only solution that supports in kernel bonding,
SR-IOV is not supported with this implementation.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Current implementation of netdev already contains xsk_buff_pools.
We no longer have to contain these structures in ice_vsi.
Refactor the code to operate on netdev-provided xsk_buff_pools.
Move scheduling napi on each queue to a separate function to
simplify setup function.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
There is an issue with some NVMs where an already existent LLDP
filter is blocking the creation of a filter to allow LLDP packets
to be redirected to the default VSI for the interface. This is
blocking all LLDP functionality based in the kernel when the FW
LLDP agent is disabled (e.g. software based DCBx).
Implement the new AQ command to allow adding VSI destinations to
existent filters on NVM versions that support the new command.
The new lldp_fltr_ctrl AQ command supports Rx filters only, so the
code flow for adding filters to disable Tx of control frames will
remain intact.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Currently there is no message printed on the host when a VF goes in and
out of promiscuous mode. This is causing confusion because this is the
expected behavior based on i40e. Fix this.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
There is no need to use a for loop to assign values for an array of cmd
descriptors which has only two elements.
Link: https://lore.kernel.org/r/1612517974-31867-13-git-send-email-liweihang@huawei.com
Signed-off-by: Xinhao Liu <liuxinhao5@hisilicon.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The hns driver wrap around the consumer index of AEQ and CEQ when they
reach to two times of queue entries number for owner mechanism, actually,
it is unnecessary to wrap around since the hardware itself will mask it
before use.
Link: https://lore.kernel.org/r/1612517974-31867-12-git-send-email-liweihang@huawei.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
All fields of WQE will be rewrote, so the memset is unnecessary. And when
SQ is working in OWNER mode, the pipeline may prefetch the WQEs beyond PI,
the memset operation may flip the owner bit too early, then the pipeline
may get a wrong WQ.
Link: https://lore.kernel.org/r/1612517974-31867-11-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Use macros instead of magic numbers to represent shift of dma_handle_wqe,
dma_handle_idx and UDP destination port number of RoCEv2.
Link: https://lore.kernel.org/r/1612517974-31867-10-git-send-email-liweihang@huawei.com
Signed-off-by: Xinhao Liu <liuxinhao5@hisilicon.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
hns_roce_device.h is not specific to hardware, some definitions are only
used for HIP06, they should be moved into hns_roce_hw_v1.h.
Link: https://lore.kernel.org/r/1612517974-31867-9-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, the driver updates doorbell looks like this:
post()
{
wqe.field = 0x111;
wmb();
update_wq_db();
}
update_wq_db()
{
db.field = 0x222;
__raw_writeq(db, db_reg);
}
writeq() is a better choice than __raw_writeq() because it calls dma_wmb()
to barrier in ARM64, and dma_wmb() is better than wmb() for ROCEE device.
This patch removes all wmb() before updating doorbell of SQ/RQ/CQ/SRQ by
replacing __raw_writeq() with writeq() to improve performence. The new
process looks like this:
post()
{
wqe.field = 0x111;
update_wq_db();
}
update_wq_db()
{
db.field = 0x222;
writeq(db, db_reg);
}
Link: https://lore.kernel.org/r/1612517974-31867-8-git-send-email-liweihang@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Horatiu Vultur says:
====================
bridge: mrp: Fix br_mrp_port_switchdev_set_state
Based on the discussion here[1], there was a problem with the function
br_mrp_port_switchdev_set_state. The problem was that it was called
both with BR_STATE* and BR_MRP_PORT_STATE* types. This patch series
fixes this issue and removes SWITCHDEV_ATTR_ID_MRP_PORT_STAT because
is not used anymore.
[1] https://www.spinics.net/lists/netdev/msg714816.html
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that MRP started to use also SWITCHDEV_ATTR_ID_PORT_STP_STATE to
notify HW, then SWITCHDEV_ATTR_ID_MRP_PORT_STAT is not used anywhere
else, therefore we can remove it.
Fixes: c284b54590 ("switchdev: mrp: Extend switchdev API to offload MRP")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function br_mrp_port_switchdev_set_state was called both with MRP
port state and STP port state, which is an issue because they don't
match exactly.
Therefore, update the function to be used only with STP port state and
use the id SWITCHDEV_ATTR_ID_PORT_STP_STATE.
The choice of using STP over MRP is that the drivers already implement
SWITCHDEV_ATTR_ID_PORT_STP_STATE and already in SW we update the port
STP state.
Fixes: 9a9f26e8f7 ("bridge: mrp: Connect MRP API with the switchdev API")
Fixes: fadd409136 ("bridge: switchdev: mrp: Implement MRP API for switchdev")
Fixes: 2f1a11ae11 ("bridge: mrp: Add MRP interface.")
Reported-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prevent netif_tx_disable() running concurrently with dev_watchdog() by
taking the device global xmit lock. Otherwise, the recommended:
netif_carrier_off(dev);
netif_tx_disable(dev);
driver shutdown sequence can happen after the watchdog has already
checked carrier, resulting in possible false alarms. This is because
netif_tx_lock() only sets the frozen bit without maintaining the locks
on the individual queues.
Fixes: c3f26a269c ("netdev: Fix lockdep warnings in multiqueue configurations.")
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This feature should only be enabled by querying capability from firmware.
Fixes: ba6bb7e974 ("RDMA/hns: Add interfaces to get pf capabilities from firmware")
Link: https://lore.kernel.org/r/1612517974-31867-5-git-send-email-liweihang@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add the mapped page count checking flow to avoid invalid page size when
creating MTR.
Fixes: 38389eaa4d ("RDMA/hns: Add mtr support for mixed multihop addressing")
Link: https://lore.kernel.org/r/1612517974-31867-4-git-send-email-liweihang@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This bit should be in type of enum ib_sig_type, or there will be a sparse
warning.
Fixes: bfe860351e ("RDMA/hns: Fix cast from or to restricted __le32 for driver")
Link: https://lore.kernel.org/r/1612517974-31867-3-git-send-email-liweihang@huawei.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ULP usually set IB(V)_QP_AV when trying to modify QP to RTR if they want
to record sgid index into QPC. For UD QPs, it is useless because it will
be included in WQE. For RC QPs, it will be filled in
hns_roce_set_path(). So sgid index shouldn't be filled by default. Then
hns_get_gid_index() is moved to hns_roce_hw_v1.c because it is only called
in it.
Fixes: 926a01dc00 ("RDMA/hns: Add QP operations support for hip08 SoC")
Link: https://lore.kernel.org/r/1612517974-31867-2-git-send-email-liweihang@huawei.com
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Direct wqe is a mechanism to fill wqe directly into the hardware. In the
case of light load, the wqe will be filled into pcie bar space of the
hardware, this will reduce one memory access operation and therefore
reduce the latency.
Link: https://lore.kernel.org/r/1611997513-27107-1-git-send-email-liweihang@huawei.com
Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Vlad Buslov says:
=================
Implement support for VF tunneling
Abstract
Currently, mlx5 only supports configuration with tunnel endpoint IP address on
uplink representor. Remove implicit and explicit assumptions of tunnel always
being terminated on uplink and implement necessary infrastructure for
configuring tunnels on VF representors and updating rules on such tunnels
according to routing changes.
SW TC model
From TC perspective VF tunnel configuration requires two rules in both
directions:
TX rules
1. Rule that redirects packets from UL to VF rep that has the tunnel
endpoint IP address:
$ tc -s filter show dev enp8s0f0 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 16:c9:a0:2d:69:2c
src_mac 0c:42:a1:58:ab:e4
eth_type ipv4
ip_flags nofrag
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
index 3 ref 1 bind 1 installed 377 sec used 0 sec
Action statistics:
Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 114096 bytes 952 pkt
backlog 0b 0p requeues 0
cookie 878fa48d8c423fc08c3b6ca599b50a97
no_percpu
used_hw_stats delayed
2. Rule that decapsulates the tunneled flow and redirects to destination VF
representor:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
RX rules
1. Rule that encapsulates the tunneled flow and redirects packets from
source VF rep to tunnel device:
$ tc -s filter show dev enp8s0f0_1 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 0a:40:bd:30:89:99
src_mac ca:2e:a7:3f:f5:0f
eth_type ipv4
ip_tos 0/0x3
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 7.7.7.5
dst_ip 7.7.7.1
key_id 98
dst_port 4789
nocsum
ttl 64 pipe
index 1 ref 1 bind 1 installed 411 sec used 411 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
no_percpu
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
index 1 ref 1 bind 1 installed 411 sec used 0 sec
Action statistics:
Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 5615833 bytes 4028 pkt
backlog 0b 0p requeues 0
cookie bb406d45d343bf7ade9690ae80c7cba4
no_percpu
used_hw_stats delayed
2. Rule that redirects from tunnel device to UL rep:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
HW offloads model
For hardware offload the goal is to mach packet on both rules without exposing
it to software on tunnel endpoint VF. In order to achieve this for tx, TC
implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
modified to recirculate such packets after source port value is changed, which
allows second tx rules to match.
For rx path indirect table infrastructure is used to allow fully processing VF
tunnel traffic in hardware. To implement such pipeline driver needs to program
the hardware after matching on UL rule to overwrite source vport from UL to
tunnel VF and recirculate the packet to the root table to allow matching on the
rule installed on tunnel VF. For this, indirect table matches all encapsulated
traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
the miss rule. Such configuration will cause packet to appear on VF representor
instead of VF itself if packet has been matches by indirect table rule based on
tunnel parameters but missed on second rule (after recirculation). Handle such
case by marking packets processed by indirect table with special 0xFFF value in
reg_c1 and extending slow table with additional flow group that matches on
reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
mark). When creating offloads fdb tables, install one rule per VF vport to match
on recirculated miss packets and redirect them to appropriate VF vport.
Routing events
In order to support routing changes and migration of tunnel device between
different endpoint VFs, implement routing infrastructure and update it with FIB
events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
rule is attached to a routing entry, which is shared for rules of same tunnel.
On FIB event the work is scheduled to delete/recreate all rules of affected
tunnel.
Note: only vxlan tunnel type is supported by this series.
=================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmAeINMACgkQSD+KveBX
+j5ssAgAlmHUXB13W8FzXmp37hj6990QVVUNMe1tX09u6TOKi3X9VgRydCLdZlIm
CEgdknjhlesjiYsy4z9o8MTV4IXGnNoy+qW9cuL9SCpDpVLeJ0g+3/laUv21oOhr
zGxR4nmLwDxpzAj8huqOv5kVlojiA90x9wZIiOjx0+obOmglhfjzpUORAGXeHQTf
yxeiEi1ef5MO02lE854gzPBF60XB6LN7+Viw+4E+G67n7TdvIQ0xu2j/DpOubpH2
BzXoU12a424FvpAhhW8xrIZF4wFEo120Ln+vDMGq30Hqo/9gFQ1EmSBXaOOVhPwx
M/gJ3OJhckrMpNs36tdCyoOm/pTS+w==
=7d1N
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2021-02-04
Vlad Buslov says:
=================
Implement support for VF tunneling
Abstract
Currently, mlx5 only supports configuration with tunnel endpoint IP address on
uplink representor. Remove implicit and explicit assumptions of tunnel always
being terminated on uplink and implement necessary infrastructure for
configuring tunnels on VF representors and updating rules on such tunnels
according to routing changes.
SW TC model
From TC perspective VF tunnel configuration requires two rules in both
directions:
TX rules
1. Rule that redirects packets from UL to VF rep that has the tunnel
endpoint IP address:
$ tc -s filter show dev enp8s0f0 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 16:c9:a0:2d:69:2c
src_mac 0c:42:a1:58:ab:e4
eth_type ipv4
ip_flags nofrag
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
index 3 ref 1 bind 1 installed 377 sec used 0 sec
Action statistics:
Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 114096 bytes 952 pkt
backlog 0b 0p requeues 0
cookie 878fa48d8c423fc08c3b6ca599b50a97
no_percpu
used_hw_stats delayed
2. Rule that decapsulates the tunneled flow and redirects to destination VF
representor:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
RX rules
1. Rule that encapsulates the tunneled flow and redirects packets from
source VF rep to tunnel device:
$ tc -s filter show dev enp8s0f0_1 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 0a:40:bd:30:89:99
src_mac ca:2e:a7:3f:f5:0f
eth_type ipv4
ip_tos 0/0x3
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 7.7.7.5
dst_ip 7.7.7.1
key_id 98
dst_port 4789
nocsum
ttl 64 pipe
index 1 ref 1 bind 1 installed 411 sec used 411 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
no_percpu
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
index 1 ref 1 bind 1 installed 411 sec used 0 sec
Action statistics:
Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 5615833 bytes 4028 pkt
backlog 0b 0p requeues 0
cookie bb406d45d343bf7ade9690ae80c7cba4
no_percpu
used_hw_stats delayed
2. Rule that redirects from tunnel device to UL rep:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
HW offloads model
For hardware offload the goal is to mach packet on both rules without exposing
it to software on tunnel endpoint VF. In order to achieve this for tx, TC
implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
modified to recirculate such packets after source port value is changed, which
allows second tx rules to match.
For rx path indirect table infrastructure is used to allow fully processing VF
tunnel traffic in hardware. To implement such pipeline driver needs to program
the hardware after matching on UL rule to overwrite source vport from UL to
tunnel VF and recirculate the packet to the root table to allow matching on the
rule installed on tunnel VF. For this, indirect table matches all encapsulated
traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
the miss rule. Such configuration will cause packet to appear on VF representor
instead of VF itself if packet has been matches by indirect table rule based on
tunnel parameters but missed on second rule (after recirculation). Handle such
case by marking packets processed by indirect table with special 0xFFF value in
reg_c1 and extending slow table with additional flow group that matches on
reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
mark). When creating offloads fdb tables, install one rule per VF vport to match
on recirculated miss packets and redirect them to appropriate VF vport.
Routing events
In order to support routing changes and migration of tunnel device between
different endpoint VFs, implement routing infrastructure and update it with FIB
events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
rule is attached to a routing entry, which is shared for rules of same tunnel.
On FIB event the work is scheduled to delete/recreate all rules of affected
tunnel.
Note: only vxlan tunnel type is supported by this series.
=================
It is likely that this is a leftover from T3 driver heritage. cxgb4 uses
the PCI core VPD access code that handles detection of VPD capabilities.
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.
To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Restore the original behaviour where users are allowed to add an element
with any stateful expression if the set definition specifies no stateful
expressions. Make sure upper maximum number of stateful expressions of
NFT_SET_EXPR_MAX is not reached.
Fixes: 8cfd9b0f85 ("netfilter: nftables: generalize set expressions support")
Fixes: 48b0ae046e ("netfilter: nftables: netlink support for several set element expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Only older versions of the RISC-V GCC toolchain define __riscv__. Check
for __riscv as well, which is used by newer GCC toolchains. Also set
VDSO_32BIT based on __riscv_xlen.
Before (on riscv64):
$ ./vdso_test_abi
[vDSO kselftest] VDSO_VERSION: LINUX_4
Could not find __vdso_gettimeofday
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_REALTIME [PASS]
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_BOOTTIME [PASS]
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_TAI [PASS]
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_REALTIME_COARSE [PASS]
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_MONOTONIC [PASS]
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_MONOTONIC_RAW [PASS]
Could not find __vdso_clock_gettime
Could not find __vdso_clock_getres
clock_id: CLOCK_MONOTONIC_COARSE [PASS]
Could not find __vdso_time
After (on riscv32):
$ ./vdso_test_abi
[vDSO kselftest] VDSO_VERSION: LINUX_4.15
The time is 1612449376.015086
The time is 1612449376.18340784
The resolution is 0 1
clock_id: CLOCK_REALTIME [PASS]
The time is 774.842586182
The resolution is 0 1
clock_id: CLOCK_BOOTTIME [PASS]
The time is 1612449376.22536565
The resolution is 0 1
clock_id: CLOCK_TAI [PASS]
The time is 1612449376.20885172
The resolution is 0 4000000
clock_id: CLOCK_REALTIME_COARSE [PASS]
The time is 774.845491269
The resolution is 0 1
clock_id: CLOCK_MONOTONIC [PASS]
The time is 774.849534200
The resolution is 0 1
clock_id: CLOCK_MONOTONIC_RAW [PASS]
The time is 774.842139684
The resolution is 0 4000000
clock_id: CLOCK_MONOTONIC_COARSE [PASS]
Could not find __vdso_time
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>