Add support for offloading of HSR/PRP (IEC 62439-3) tag insertion
tag removal, duplicate generation and forwarding.
For HSR, insertion involves the switch adding a 6 byte HSR header after
the 14 byte Ethernet header. For PRP it adds a 6 byte trailer.
Tag removal involves automatically stripping the HSR/PRP header/trailer
in the switch. This is possible when the switch also performs auto
deduplication using the HSR/PRP header/trailer (making it no longer
required).
Forwarding involves automatically forwarding between redundant ports in
an HSR. This is crucial because delay is accumulated as a frame passes
through each node in the ring.
Duplication involves the switch automatically sending a single frame
from the CPU port to both redundant ports. This is required because the
inserted HSR/PRP header/trailer must contain the same sequence number
on the frames sent out both redundant ports.
Export is_hsr_master so DSA can tell them apart from other devices in
dsa_slave_changeupper.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the moment, PORT_MII is reported in the ethtool ops. This is odd
because it is an interface between the MAC and the PHY and no external
port. Some network card drivers will overwrite the port to twisted pair
or fiber, though. Even worse, the MDI/MDIX setting is only used by
ethtool if the port is twisted pair.
Set the port to PORT_TP by default because most PHY drivers are copper
ones. If there is fibre support and it is enabled, the PHY driver will
set it to PORT_FIBRE.
This will change reporting PORT_MII to either PORT_TP or PORT_FIBRE;
except for the genphy fallback driver.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes the following W=1 kernel build warning(s):
drivers/clk/spear/spear1310_clock.c:385:13: warning: no previous prototype for ‘spear1310_clk_init’ [-Wmissing-prototypes]
drivers/clk/spear/spear1340_clock.c:442:13: warning: no previous prototype for ‘spear1340_clk_init’ [-Wmissing-prototypes]
Cc: Viresh Kumar <vireshk@kernel.org>
Cc: Shiraz Hashim <shiraz.linux.kernel@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Rajeev Kumar <rajeev-dlh.kumar@st.com>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Link: https://lore.kernel.org/r/20210126124540.3320214-20-lee.jones@linaro.org
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
spi_mem_default_supports_op() rejects DTR ops by default to ensure that
the controller drivers that haven't been updated with DTR support
continue to reject them. It also makes sure that controllers that don't
support DTR mode at all (which is most of them at the moment) also
reject them.
This means that controller drivers that want to support DTR mode can't
use spi_mem_default_supports_op(). Driver authors have to roll their own
supports_op() function and mimic the buswidth checks. See
spi-cadence-quadspi.c for example. Or even worse, driver authors might
skip it completely or get it wrong.
Add spi_mem_dtr_supports_op(). It provides a basic sanity check for DTR
ops and performs the buswidth requirement check. Move the logic for
checking buswidth in spi_mem_default_supports_op() to a separate
function so the logic is not repeated twice.
Signed-off-by: Pratyush Yadav <p.yadav@ti.com>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20210204141218.32229-1-p.yadav@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Add per-program counter for number of times recursion prevention mechanism
was triggered and expose it via show_fdinfo and bpf_prog_info.
Teach bpftool to print it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-7-alexei.starovoitov@gmail.com
Since both sleepable and non-sleepable programs execute under migrate_disable
add recursion prevention mechanism to both types of programs when they're
executed via bpf trampoline.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-5-alexei.starovoitov@gmail.com
Since sleepable programs don't migrate from the cpu the excution stats can be
computed for them as well. Reuse the same infrastructure for both sleepable and
non-sleepable programs.
run_cnt -> the number of times the program was executed.
run_time_ns -> the program execution time in nanoseconds including the
off-cpu time when the program was sleeping.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-4-alexei.starovoitov@gmail.com
Move bpf_prog_stats from prog->aux into prog to avoid one extra load
in critical path of program execution.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210210033634.62081-2-alexei.starovoitov@gmail.com
The system would deadlock when swapping to a dm-crypt device. The reason
is that for each incoming write bio, dm-crypt allocates memory that holds
encrypted data. These excessive allocations exhaust all the memory and the
result is either deadlock or OOM trigger.
This patch limits the number of in-flight swap bios, so that the memory
consumed by dm-crypt is limited. The limit is enforced if the target set
the "limit_swap_bios" variable and if the bio has REQ_SWAP set.
Non-swap bios are not affected becuase taking the semaphore would cause
performance degradation.
This is similar to request-based drivers - they will also block when the
number of requests is over the limit.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Allow removal of CONFIG_BLK_DEV_ZONED conditionals in target_type
definition of various targets.
Suggested-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Update the device-mapper core to support exposing the inline crypto
support of the underlying device(s) through the device-mapper device.
This works by creating a "passthrough keyslot manager" for the dm
device, which declares support for encryption settings which all
underlying devices support. When a supported setting is used, the bio
cloning code handles cloning the crypto context to the bios for all the
underlying devices. When an unsupported setting is used, the blk-crypto
fallback is used as usual.
Crypto support on each underlying device is ignored unless the
corresponding dm target opts into exposing it. This is needed because
for inline crypto to semantically operate on the original bio, the data
must not be transformed by the dm target. Thus, targets like dm-linear
can expose crypto support of the underlying device, but targets like
dm-crypt can't. (dm-crypt could use inline crypto itself, though.)
A DM device's table can only be changed if the "new" inline encryption
capabilities are a (*not* necessarily strict) superset of the "old" inline
encryption capabilities. Attempts to make changes to the table that result
in some inline encryption capability becoming no longer supported will be
rejected.
For the sake of clarity, key eviction from underlying devices will be
handled in a future patch.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce blk_ksm_update_capabilities() to update the capabilities of
a keyslot manager (ksm) in-place. The pointer to a ksm in a device's
request queue may not be easily replaced, because upper layers like
the filesystem might access it (e.g. for programming keys/checking
capabilities) at the same time the device wants to replace that
request queue's ksm (and free the old ksm's memory). This function
allows the device to update the capabilities of the ksm in its request
queue directly. Devices can safely update the ksm this way without any
synchronization with upper layers *only* if the updated (new) ksm
continues to support all the crypto capabilities that the old ksm did
(see description below for blk_ksm_is_superset() for why this is so).
Also introduce blk_ksm_is_superset() which checks whether one ksm's
capabilities are a (not necessarily strict) superset of another ksm's.
The blk-crypto framework requires that crypto capabilities that were
advertised when a bio was created continue to be supported by the
device until that bio is ended - in practice this probably means that
a device's advertised crypto capabilities can *never* "shrink" (since
there's no synchronization between bio creation and when a device may
want to change its advertised capabilities) - so a previously
advertised crypto capability must always continue to be supported.
This function can be used to check that a new ksm is a valid
replacement for an old ksm.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The device mapper may map over devices that have inline encryption
capabilities, and to make use of those capabilities, the DM device must
itself advertise those inline encryption capabilities. One way to do this
would be to have the DM device set up a keyslot manager with a
"sufficiently large" number of keyslots, but that would use a lot of
memory. Also, the DM device itself has no "keyslots", and it doesn't make
much sense to talk about "programming a key into a DM device's keyslot
manager", so all that extra memory used to represent those keyslots is just
wasted. All a DM device really needs to be able to do is advertise the
crypto capabilities of the underlying devices in a coherent manner and
expose a way to evict keys from the underlying devices.
There are also devices with inline encryption hardware that do not
have a limited number of keyslots. One can send a raw encryption key along
with a bio to these devices (as opposed to typical inline encryption
hardware that require users to first program a raw encryption key into a
keyslot, and send the index of that keyslot along with the bio). These
devices also only need the same things from the keyslot manager that DM
devices need - a way to advertise crypto capabilities and potentially a way
to expose a function to evict keys from hardware.
So we introduce a "passthrough" keyslot manager that provides a way to
represent a keyslot manager that doesn't have just a limited number of
keyslots, and for which do not require keys to be programmed into keyslots.
DM devices can set up a passthrough keyslot manager in their request
queues, and advertise appropriate crypto capabilities based on those of the
underlying devices. Blk-crypto does not attempt to program keys into any
keyslots in the passthrough keyslot manager. Instead, if/when the bio is
resubmitted to the underlying device, blk-crypto will try to program the
key into the underlying device's keyslot manager.
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
By default the PCA9450 doesn't handle the assertion of the WDOG_B
signal, but this is required to guarantee that things like software
resets triggered by the watchdog work reliably.
As we don't want to rely on the bootloader to enable this, we tell
the PMIC to issue a cold reset in case the WDOG_B signal is
asserted (WDOG_B_CFG = 10), just as the NXP U-Boot code does.
Signed-off-by: Frieder Schrempf <frieder.schrempf@kontron.de>
Link: https://lore.kernel.org/r/20210211105534.38972-3-frieder.schrempf@kontron.de
Signed-off-by: Mark Brown <broonie@kernel.org>
To the very best of my knowledge there has never been any in-tree
code that calls this function. It exists largely to support an
out-of-tree driver that provides kgdb-over-ethernet using the
netpoll API.
kgdboe has been out-of-tree for more than 10 years and I don't
recall any serious attempt to upstream it at any point in the last
five. At this stage it looks better to stop carrying this code in
the kernel and integrate the code into the out-of-tree driver
instead.
The long term trajectory for the kernel looks likely to include
effort to remove or reduce the use of tasklets (something that has
also been true for the last 10 years). Thus the main real reason
for this patch is to make explicit that the in-tree kgdb features
do not require tasklets.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210210142525.2876648-1-daniel.thompson@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Jason Wessel <jason.wessel@windriver.com>
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].
Refactor the code according to the use of a flexible-array member in
struct icmsg_negotiate, instead of a one-element array.
Also, this helps the ongoing efforts to enable -Warray-bounds and fix the
following warnings:
drivers/hv/channel_mgmt.c:315:23: warning: array subscript 1 is above array bounds of ‘struct ic_version[1]’ [-Warray-bounds]
drivers/hv/channel_mgmt.c:316:23: warning: array subscript 1 is above array bounds of ‘struct ic_version[1]’ [-Warray-bounds]
[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays
Link: https://github.com/KSPP/linux/issues/79
Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210201174334.GA171933@embeddedor
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Only the VSCs or ICs that have been hardened and that are critical for
the successful adoption of Confidential VMs should be allowed if the
guest is running isolated. This change reduces the footprint of the
code that will be exercised by Confidential VMs and hence the exposure
to bugs and vulnerabilities.
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210201144814.2701-3-parri.andrea@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Device list is not stored in mlx5_priv anymore, so delete it as it's not
used.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
IMA allocates kernel virtual memory to carry forward the measurement
list, from the current kernel to the next kernel on kexec system call,
in ima_add_kexec_buffer() function. This buffer is not freed before
completing the kexec system call resulting in memory leak.
Add ima_buffer field in "struct kimage" to store the virtual address
of the buffer allocated for the IMA measurement list.
Free the memory allocated for the IMA measurement list in
kimage_file_post_load_cleanup() function.
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Fixes: 7b8589cc29 ("ima: on soft reboot, save the measurement list")
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Pull networking fixes from David Miller:
"Another pile of networing fixes:
1) ath9k build error fix from Arnd Bergmann
2) dma memory leak fix in mediatec driver from Lorenzo Bianconi.
3) bpf int3 kprobe fix from Alexei Starovoitov.
4) bpf stackmap integer overflow fix from Bui Quang Minh.
5) Add usb device ids for Cinterion MV31 to qmi_qwwan driver, from
Christoph Schemmel.
6) Don't update deleted entry in xt_recent netfilter module, from
Jazsef Kadlecsik.
7) Use after free in nftables, fix from Pablo Neira Ayuso.
8) Header checksum fix in flowtable from Sven Auhagen.
9) Validate user controlled length in qrtr code, from Sabyrzhan
Tasbolatov.
10) Fix race in xen/netback, from Juergen Gross,
11) New device ID in cxgb4, from Raju Rangoju.
12) Fix ring locking in rxrpc release call, from David Howells.
13) Don't return LAPB error codes from x25_open(), from Xie He.
14) Missing error returns in gsi_channel_setup() from Alex Elder.
15) Get skb_copy_and_csum_datagram working properly with odd segment
sizes, from Willem de Bruijn.
16) Missing RFS/RSS table init in enetc driver, from Vladimir Oltean.
17) Do teardown on probe failure in DSA, from Vladimir Oltean.
18) Fix compilation failures of txtimestamp selftest, from Vadim
Fedorenko.
19) Limit rx per-napi gro queue size to fix latency regression, from
Eric Dumazet.
20) dpaa_eth xdp fixes from Camelia Groza.
21) Missing txq mode update when switching CBS off, in stmmac driver,
from Mohammad Athari Bin Ismail.
22) Failover pending logic fix in ibmvnic driver, from Sukadev
Bhattiprolu.
23) Null deref fix in vmw_vsock, from Norbert Slusarek.
24) Missing verdict update in xdp paths of ena driver, from Shay
Agroskin.
25) seq_file iteration fix in sctp from Neil Brown.
26) bpf 32-bit src register truncation fix on div/mod, from Daniel
Borkmann.
27) Fix jmp32 pruning in bpf verifier, from Daniel Borkmann.
28) Fix locking in vsock_shutdown(), from Stefano Garzarella.
29) Various missing index bound checks in hns3 driver, from Yufeng Mo.
30) Flush ports on .phylink_mac_link_down() in dsa felix driver, from
Vladimir Oltean.
31) Don't mix up stp and mrp port states in bridge layer, from Horatiu
Vultur.
32) Fix locking during netif_tx_disable(), from Edwin Peer"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
bpf: Fix 32 bit src register truncation on div/mod
bpf: Fix verifier jmp32 pruning decision logic
bpf: Fix verifier jsgt branch analysis on max bound
vsock: fix locking in vsock_shutdown()
net: hns3: add a check for index in hclge_get_rss_key()
net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
net: hns3: add a check for queue_id in hclge_reset_vf_queue()
net: dsa: felix: implement port flushing on .phylink_mac_link_down
switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
net: watchdog: hold device global xmit lock during tx disable
netfilter: nftables: relax check for stateful expressions in set definition
netfilter: conntrack: skip identical origin tuple in same zone only
vsock/virtio: update credit only if socket is not closed
net: fix iteration for sctp transport seq_files
net: ena: Update XDP verdict upon failure
net/vmw_vsock: improve locking in vsock_connect_timeout()
net/vmw_vsock: fix NULL pointer dereference
ibmvnic: Clear failover_pending if unable to schedule
net: stmmac: set TxQ mode back to DCB after disabling CBS
...
Before this patch, variable offset access to the stack was dissalowed
for regular instructions, but was allowed for "indirect" accesses (i.e.
helpers). This patch removes the restriction, allowing reading and
writing to the stack through stack pointers with variable offsets. This
makes stack-allocated buffers more usable in programs, and brings stack
pointers closer to other types of pointers.
The motivation is being able to use stack-allocated buffers for data
manipulation. When the stack size limit is sufficient, allocating
buffers on the stack is simpler than per-cpu arrays, or other
alternatives.
In unpriviledged programs, variable-offset reads and writes are
disallowed (they were already disallowed for the indirect access case)
because the speculative execution checking code doesn't support them.
Additionally, when writing through a variable-offset stack pointer, if
any pointers are in the accessible range, there's possilibities of later
leaking pointers because the write cannot be tracked precisely.
Writes with variable offset mark the whole range as initialized, even
though we don't know which stack slots are actually written. This is in
order to not reject future reads to these slots. Note that this doesn't
affect writes done through helpers; like before, helpers need the whole
stack range to be initialized to begin with.
All the stack slots are in range are considered scalars after the write;
variable-offset register spills are not tracked.
For reads, all the stack slots in the variable range needs to be
initialized (but see above about what writes do), otherwise the read is
rejected. All register spilled in stack slots that might be read are
marked as having been read, however reads through such pointers don't do
register filling; the target register will always be either a scalar or
a constant zero.
Signed-off-by: Andrei Matei <andreimatei1@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210207011027.676572-2-andreimatei1@gmail.com
nvme drivers need to set the state of request to MQ_RQ_COMPLETE when
directly complete request in queue_rq.
So add blk_mq_set_request_complete.
Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Here are the USB-serial updates for 5.12-rc1, including:
- a line-speed fix for newer pl2303 devices
- a line-speed fix for FTDI FT-X devices
- a new xr_serial driver for MaxLinear/Exar devices (non-ACM mode)
- a cdc-acm blacklist entry for when the xr_serial driver is enabled
- cp210x support for software flow control
- various cp210x modem-control fixes
- an updated ZTE P685M modem entry to stop claiming the QMI interface
- an update to drop the port_remove() driver-callback return value
Included are also various clean ups.
All have been in linux-next with no reported issues.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQHbPq+cpGvN/peuzMLxc3C7H1lCAUCYCPxBQAKCRALxc3C7H1l
CP1iAQCRn7/4ulkGXgSjVL2o8TfGAQRhvxL14qtzysOyPLwgrAD6ApuJdPRHnetL
q0TDaRqnXqVTV6uUfoSC5eVEF4dS/Qs=
=5ZLn
-----END PGP SIGNATURE-----
Merge tag 'usb-serial-5.12-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-next
Johan writes:
USB-serial updates for 5.12-rc1
Here are the USB-serial updates for 5.12-rc1, including:
- a line-speed fix for newer pl2303 devices
- a line-speed fix for FTDI FT-X devices
- a new xr_serial driver for MaxLinear/Exar devices (non-ACM mode)
- a cdc-acm blacklist entry for when the xr_serial driver is enabled
- cp210x support for software flow control
- various cp210x modem-control fixes
- an updated ZTE P685M modem entry to stop claiming the QMI interface
- an update to drop the port_remove() driver-callback return value
Included are also various clean ups.
All have been in linux-next with no reported issues.
* tag 'usb-serial-5.12-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial: (41 commits)
USB: serial: drop bogus to_usb_serial_port() checks
USB: serial: make remove callback return void
USB: serial: drop if with an always false condition
USB: serial: option: update interface mapping for ZTE P685M
USB: serial: ftdi_sio: restore divisor-encoding comments
USB: serial: ftdi_sio: fix FTX sub-integer prescaler
USB: serial: cp210x: clean up auto-RTS handling
USB: serial: cp210x: fix RTS handling
USB: serial: cp210x: clean up printk zero padding
USB: serial: cp210x: clean up flow-control debug message
USB: serial: cp210x: drop shift macros
USB: serial: cp210x: fix modem-control handling
USB: serial: cp210x: suppress modem-control errors
USB: serial: mos7720: fix error code in mos7720_write()
USB: serial: xr: fix B0 handling
USB: serial: xr: fix pin configuration
USB: serial: xr: fix gpio-mode handling
USB: serial: xr: simplify line-speed logic
USB: serial: xr: clean up line-settings handling
USB: serial: xr: document vendor-request recipient
...
Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
all writes into sequential write required zones be aligned to the device
physical block size. However, NVMe ZNS does not have this constraint and
allows write operations into sequential zones to be aligned to the
device logical block size. This inconsistency does not help with
software portability across device types.
To solve this, introduce the zone_write_granularity queue limit to
indicate the alignment constraint, in bytes, of write operations into
zones of a zoned block device. This new limit is exported as a
read-only sysfs queue attribute and the helper
blk_queue_zone_write_granularity() introduced for drivers to set this
limit.
The function blk_queue_set_zoned() is modified to set this new limit to
the device logical block size by default. NVMe ZNS devices as well as
zoned nullb devices use this default value as is. The scsi disk driver
is modified to execute the blk_queue_zone_write_granularity() helper to
set the zone write granularity of host-managed SMR disks to the disk
physical block size.
The accessor functions queue_zone_write_granularity() and
bdev_zone_write_granularity() are also introduced.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.
Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are not users of mutex_trylock_recursive() in tree as of
v5.11-rc7.
Remove it.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210210085248.219210-2-bigeasy@linutronix.de
This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device,
without the need for a device up/down.
User sets it to 1 or 0 to enable or disable threaded mode.
Note: when switching between threaded and the current softirq based mode
for a napi instance, it will not immediately take effect if the napi is
currently being polled. The mode switch will happen for the next time
napi_schedule() is called.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows running each napi poll loop inside its own
kernel thread.
The kthread is created during netif_napi_add() if dev->threaded
is set. And threaded mode is enabled in napi_enable(). We will
provide a way to set dev->threaded and enable threaded mode
without a device up/down in the following patch.
Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.
The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socinfo driver gains support for dumping information about the platform's
PMICs, as well as new definitions for a number of platforms. The LLCC driver
gains SM8250 support, AOSS QMP gains SM8350 support and the RPMPD driver gains
support for MSM8994 power domains. In addition to this it contains a few minor
fixes in the ocmem, rpmh and llcc drivers.
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAmAbg1wbHGJqb3JuLmFu
ZGVyc3NvbkBsaW5hcm8ub3JnAAoJEAsfOT8Nma3FYYEP/R2zxLPj1ntJmyLV/MWX
bioEGJWXKk60b2k5lAai2nwD/iswPqMyRtI3eXub7YVoYPRVMDHVl7nuGE83hHES
r3OUCTisiM5VCOIYjzs9ZJJx4ceGzicsXoV2eEZUabQ4pg2/VHzdbh3DmH+Yh7hL
PTgEiAYqQRSpRlFTf7ByccuqjAMyhs1GJ3Ajrl9dACsXIrT8ktPqk1UZ6JDl7w+3
iox86p6EzpcnzgMY1APNgAoMDqHNOMbZky5zvgWEdMXGnpBZGjY8l1XXzIG9ZjE2
o9u9DnxnCyBJoaxqbsBeHmFux2QCNTggQc4k5fd4BI0vFLR5X4sTyhcG2rEy12st
LUaSKP9hb4M4JTkbMCvKAgae1FrMArLXAExhsoXopa2QwV0JwbtlE1onaOE4hsGT
9YmBuJD6auegplIroGbOEihoNrOhPWEiNCX8N9I0daPewY/ulzxqn57Blq1RVXmV
xs3ifBVyiFTbTD/cFvyDKnLDbgPuaT1bUReG6QHZYzOO/vzshe6JkduNY/RRgRd/
l/ENBkZ70yQRvImIcbzRgmq767u7zGa7VWrwTmLOe8+5VxxosxZD1sYBioO9N2uL
pZ5kfAeEAfhn70SX+12SMYIMKxNuvVSCesQNNoBHyMAQX6Y0pbuAyEEyrRdcgodO
lT1qpN+V2lkjlmzWEiokI3V0
=rWBX
-----END PGP SIGNATURE-----
Merge tag 'qcom-drivers-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/drivers
Qualcomm driver updates for 5.12
The socinfo driver gains support for dumping information about the platform's
PMICs, as well as new definitions for a number of platforms. The LLCC driver
gains SM8250 support, AOSS QMP gains SM8350 support and the RPMPD driver gains
support for MSM8994 power domains. In addition to this it contains a few minor
fixes in the ocmem, rpmh and llcc drivers.
* tag 'qcom-drivers-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
soc: qcom: ocmem: don't return NULL in of_get_ocmem
soc: qcom: socinfo: Remove unwanted le32_to_cpu()
soc: qcom: aoss: Add SM8350 compatible
drivers: soc: qcom: rpmpd: Add msm8994 RPM Power Domains
soc: qcom: socinfo: Fix an off by one in qcom_show_pmic_model()
soc: qcom: socinfo: Fix off-by-one array index bounds check
soc: qcom: socinfo: Add MDM9607 IDs
soc: qcom: socinfo: Add SoC IDs for APQ/MSM8998
soc: qcom: socinfo: Add SoC IDs for 630 family
soc: qcom: socinfo: Open read access to all for debugfs
soc: qcom: socinfo: add info from PMIC models array
soc: qcom: socinfo: add several PMIC IDs
soc: qcom: socinfo: add qrb5165 SoC ID
soc: qcom: rpmh: Remove serialization of TCS commands
soc: qcom: smem: use %*ph to print small buffer
dt-bindings: soc: qcom: convert qcom,smem bindings to yaml
drivers: qcom: rpmh-rsc: Do not read back the register write on trigger
soc: qcom: llcc-qcom: Add support for SM8250 SoC
soc: qcom: llcc-qcom: Extract major hardware version
dt-bindings: msm: Add LLCC for SM8250
Link: https://lore.kernel.org/r/20210204052258.388890-1-bjorn.andersson@linaro.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Restructure the code a bit to make it simpler, fix some formatting problems
and add READ_ONCE/WRITE_ONCE to make sure there's no compiler load/store
tearing to the variables that can be accessed across CPUs.
Started with Mathieu Desnoyers's patch:
Link: https://lore.kernel.org/lkml/20210203175741.20665-1-mathieu.desnoyers@efficios.com/
And will keep his signature, but I will take the responsibility of this
being correct, and keep the authorship.
Link: https://lkml.kernel.org/r/20210204143004.61126582@gandalf.local.home
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
With static calls, a tracepoint can call the callback directly if there is
only one callback registered to that tracepoint. When there is more than
one, the static call will call the tracepoint's "iterator" function, which
needs to reload the tracepoint's "funcs" array again, as it could have
changed since the first time it was loaded.
But an arch without static calls is punished by having to load the
tracepoint's "funcs" array twice. Once in the DO_TRACE macro, and once
again in the iterator macro.
For archs without static calls, there's no reason to load the array macro
in the first place, since the iterator function will do it anyway.
Change the __DO_TRACE_CALL() macro to do the load and call of the
tracepoints funcs array only for architectures with static calls, and just
call the iterator function directly for architectures without static calls.
Link: https://lkml.kernel.org/r/20210208201050.909329787@goodmis.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
While working on a clean up that would restructure the difference between
architectures that have static calls vs those that do not, I was stumbling
over the "data_args" parameter that includes "__data" in the arguments. The
issue was that one version didn't even need it, while the other one did.
Instead of injecting a "__data = NULL;" into the macro for the unneeded
version, just remove it completely.
The original idea behind data_args is that there may be a case of a
tracepoint with no arguments. But this is considered bad practice, and all
tracepoints should pass something to that location (that's what tracepoints
were created for).
Link: https://lkml.kernel.org/r/20210208201050.768074128@goodmis.org
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Current KVM_USER_MEM_SLOTS limits are arch specific (512 on Power, 509 on x86,
32 on s390, 16 on MIPS) but they don't really need to be. Memory slots are
allocated dynamically in KVM when added so the only real limitation is
'id_to_index' array which is 'short'. We don't have any other
KVM_MEM_SLOTS_NUM/KVM_USER_MEM_SLOTS-sized statically defined structures.
Low KVM_USER_MEM_SLOTS can be a limiting factor for some configurations.
In particular, when QEMU tries to start a Windows guest with Hyper-V SynIC
enabled and e.g. 256 vCPUs the limit is hit as SynIC requires two pages per
vCPU and the guest is free to pick any GFN for each of them, this fragments
memslots as QEMU wants to have a separate memslot for each of these pages
(which are supposed to act as 'overlay' pages).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210127175731.2020089-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All usb_serial drivers return 0 in their remove callbacks and driver
core ignores the value returned by usb_serial_device_remove(). So change
the remove callback to return void and return 0 unconditionally in
usb_serial_device_remove().
Signed-off-by: Uwe Kleine-König <uwe@kleine-koenig.org>
Link: https://lore.kernel.org/r/20210208143149.963644-2-uwe@kleine-koenig.org
Signed-off-by: Johan Hovold <johan@kernel.org>
Currently, the follow_pfn function is exported for modules but
follow_pte is not. However, follow_pfn is very easy to misuse,
because it does not provide protections (so most of its callers
assume the page is writable!) and because it returns after having
already unlocked the page table lock.
Provide instead a simplified version of follow_pte that does
not have the pmdpp and range arguments. The older version
survives as follow_invalidate_pte() for use by fs/dax.c.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This includes following Thunderbolt/USB4 changes for v5.12 merge
window:
* Start lane initialization after sleep for Thunderbolt 3 compatible
devices
* Add support for de-authorizing PCIe tunnels (software based
connection manager only)
* Add support for new ACPI 6.4 USB4 _OSC
* Allow disabling XDomain protocol
* Add support for new SL5 security level
* Clean up kernel-docs to pass W=1 builds
* A couple of cleanups and minor fixes
All these have been in linux-next without reported issues.
-----BEGIN PGP SIGNATURE-----
iQJUBAABCgA+FiEEVTdhRGBbNzLrSUBaAP2fSd+ZWKAFAmAib9IgHG1pa2Eud2Vz
dGVyYmVyZ0BsaW51eC5pbnRlbC5jb20ACgkQAP2fSd+ZWKAykw/+JfXClHYVlqRh
IH7kFkD4nA7g7359PpmSLxRBzvkivz7w66BqNtp6GhIF1oGRtDJ5t5ufgwdYX3Ld
tqH2Glhw9gV5EmeqDzq3TbLzAU+zm9a5bVE3vwQbxpgPGtDigKpDjqUobGFooDaB
F8EX3H3rwI3i+1/S1vBAZbuJqND0cuDM2yQNYJ//Dch2wnZ/Vy6darzY+GpJYSWi
ZlzWsvhfbo1V3n1PphzRkvzcMbVcJriVfwJihtY1h3H6RnnD4pqQJ7e1ec+4Pgdv
fql3lJg0M02rzTSYLyLwuMOojZA+gPqLWb3Fj5pkfjcKoneq1BNTsl4xRzDpqwP3
/DeOPE+TO6SAsAkRksyH5W9nvaYGmLy0r9JREkkzdhH41RML4ueEjstaJyfWn1N+
bX2Ba+5A9pIWKYbwfbNTZPyHk9r94IdDDXzjXFLSeOTcCwMf97XX4sgrbPHedxh6
VAjIrYxuSCTRRN1G6xtPI9aer5wh7e01W71WK3SJhvpqs092E0m0p4gN/nm6yW1Z
SSJg36uqp0egjU/NxsyNvB1ePQFiUkO7WIow9pVBU+US7vSuW9IXbC59lAx0XUE5
tonIF7zAiz1hLp6DZRYfTX7R9xbyshep0uhESceCcIsF5winsgUXL76QEMN0fEjf
wmUG/HWrOd2O/IH8704rfnu7ajxATGM=
=pzr4
-----END PGP SIGNATURE-----
Merge tag 'thunderbolt-for-v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-next
Mika writes:
thunderbolt: Changes for v5.12 merge window
This includes following Thunderbolt/USB4 changes for v5.12 merge
window:
* Start lane initialization after sleep for Thunderbolt 3 compatible
devices
* Add support for de-authorizing PCIe tunnels (software based
connection manager only)
* Add support for new ACPI 6.4 USB4 _OSC
* Allow disabling XDomain protocol
* Add support for new SL5 security level
* Clean up kernel-docs to pass W=1 builds
* A couple of cleanups and minor fixes
All these have been in linux-next without reported issues.
* tag 'thunderbolt-for-v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: (27 commits)
thunderbolt: Add support for native USB4 _OSC
ACPI: Add support for native USB4 control _OSC
ACPI: Execute platform _OSC also with query bit clear
thunderbolt: Allow disabling XDomain protocol
thunderbolt: Add support for PCIe tunneling disabled (SL5)
thunderbolt: dma_test: Drop unnecessary include
thunderbolt: Add clarifying comments about USB4 terms router and adapter
thunderbolt: switch: Fix kernel-doc descriptions of non-static functions
thunderbolt: nhi: Fix kernel-doc descriptions of non-static functions
thunderbolt: path: Fix kernel-doc descriptions of non-static functions
thunderbolt: eeprom: Fix kernel-doc descriptions of non-static functions
thunderbolt: ctl: Fix kernel-doc descriptions of non-static functions
thunderbolt: switch: Fix function name in the header
thunderbolt: tunnel: Fix misspelling of 'receive_path'
thunderbolt: icm: Fix a couple of formatting issues
thunderbolt: switch: Demote a bunch of non-conformant kernel-doc headers
thunderbolt: tb: Kernel-doc function headers should document their parameters
thunderbolt: nhi: Demote some non-conformant kernel-doc headers
thunderbolt: xdomain: Fix 'tb_unregister_service_driver()'s 'drv' param
thunderbolt: eeprom: Demote non-conformant kernel-doc headers to standard comment blocks
...
PD Rev 3.0 introduces SVDM Version 2.0. This patch makes the field
configuable in the header in order to be able to be compatible with
older SVDM version.
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Kyle Tso <kyletso@google.com>
Link: https://lore.kernel.org/r/20210205033415.3320439-3-kyletso@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
PD Spec Revision 3.0 Version 2.0 + ECNs 2020-12-10
6.4.4.2.3 Structured VDM Version
"The Structured VDM Version field of the Discover Identity Command
sent and received during VDM discovery Shall be used to determine the
lowest common Structured VDM Version supported by the Port Partners or
Cable Plug and Shall continue to operate using this Specification
Revision until they are Detached."
Add a variable in typec_capability to specify the highest SVDM version
supported by the port and another variable in typec_partner to cache the
negotiated SVDM version between the port and the partner.
Also add setter/getter functions for the negotiated SVDM version.
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Signed-off-by: Kyle Tso <kyletso@google.com>
Link: https://lore.kernel.org/r/20210205033415.3320439-2-kyletso@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add notification to inform caller that mux objects array has been
created. It allows to user, invoked platform device registration for
"i2c-mux-mlxcpld" driver, to be notified that mux infrastructure is
available, and thus some devices could be connected to this
infrastructure.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Acked-by: Peter Rosin <peda@axentia.se>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Extend driver to allow I2C routing control through CPLD devices with
word address space. Till now only CPLD devices with byte address space
have been supported.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Michael Shych <michaelsh@nvidia.com>
Acked-by: Peter Rosin <peda@axentia.se>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
Do not set the argument 'force_nr' of i2c_mux_add_adapter() routine,
instead provide argument 'chan_id'.
Rename mux ids array from 'adap_ids' to 'chan_ids'.
The motivation is to prepare infrastructure to be able to:
- Create only the child adapters which are actually needed - for which
channel ids are specified.
- To assign 'nrs' to these child adapters dynamically, with no 'nr'
enforcement.
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Acked-by: Peter Rosin <peda@axentia.se>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
The ST-Ericsson U300 platform is getting removed, so this driver is no
longer needed.
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20210120131026.1721788-5-arnd@kernel.org
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Cleanup the synchronize_srcu() from the ODP flow as it was found to be a
very heavy time consumer as part of dereg_mr.
For example de-registration of 10000 ODP MRs each with size of 2M hugepage
took 19.6 sec comparing de-registration of same number of non ODP MRs that
took 172 ms.
The new locking scheme uses the wait_event() mechanism which follows the
use count of the MR instead of using synchronize_srcu().
By that change, the time required for the above test took 95 ms which is
even better than the non ODP flow.
Once fully dropped the srcu usage, had to come with a lock to protect the
XA access.
As part of using the above mechanism we could also clean the
num_deferred_work stuff and follow the use count instead.
Link: https://lore.kernel.org/r/20210202071309.2057998-1-leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Prevent netif_tx_disable() running concurrently with dev_watchdog() by
taking the device global xmit lock. Otherwise, the recommended:
netif_carrier_off(dev);
netif_tx_disable(dev);
driver shutdown sequence can happen after the watchdog has already
checked carrier, resulting in possible false alarms. This is because
netif_tx_lock() only sets the frozen bit without maintaining the locks
on the individual queues.
Fixes: c3f26a269c ("netdev: Fix lockdep warnings in multiqueue configurations.")
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Buslov says:
=================
Implement support for VF tunneling
Abstract
Currently, mlx5 only supports configuration with tunnel endpoint IP address on
uplink representor. Remove implicit and explicit assumptions of tunnel always
being terminated on uplink and implement necessary infrastructure for
configuring tunnels on VF representors and updating rules on such tunnels
according to routing changes.
SW TC model
From TC perspective VF tunnel configuration requires two rules in both
directions:
TX rules
1. Rule that redirects packets from UL to VF rep that has the tunnel
endpoint IP address:
$ tc -s filter show dev enp8s0f0 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 16:c9:a0:2d:69:2c
src_mac 0c:42:a1:58:ab:e4
eth_type ipv4
ip_flags nofrag
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
index 3 ref 1 bind 1 installed 377 sec used 0 sec
Action statistics:
Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 114096 bytes 952 pkt
backlog 0b 0p requeues 0
cookie 878fa48d8c423fc08c3b6ca599b50a97
no_percpu
used_hw_stats delayed
2. Rule that decapsulates the tunneled flow and redirects to destination VF
representor:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
RX rules
1. Rule that encapsulates the tunneled flow and redirects packets from
source VF rep to tunnel device:
$ tc -s filter show dev enp8s0f0_1 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 0a:40:bd:30:89:99
src_mac ca:2e:a7:3f:f5:0f
eth_type ipv4
ip_tos 0/0x3
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 7.7.7.5
dst_ip 7.7.7.1
key_id 98
dst_port 4789
nocsum
ttl 64 pipe
index 1 ref 1 bind 1 installed 411 sec used 411 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
no_percpu
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
index 1 ref 1 bind 1 installed 411 sec used 0 sec
Action statistics:
Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 5615833 bytes 4028 pkt
backlog 0b 0p requeues 0
cookie bb406d45d343bf7ade9690ae80c7cba4
no_percpu
used_hw_stats delayed
2. Rule that redirects from tunnel device to UL rep:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
HW offloads model
For hardware offload the goal is to mach packet on both rules without exposing
it to software on tunnel endpoint VF. In order to achieve this for tx, TC
implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
modified to recirculate such packets after source port value is changed, which
allows second tx rules to match.
For rx path indirect table infrastructure is used to allow fully processing VF
tunnel traffic in hardware. To implement such pipeline driver needs to program
the hardware after matching on UL rule to overwrite source vport from UL to
tunnel VF and recirculate the packet to the root table to allow matching on the
rule installed on tunnel VF. For this, indirect table matches all encapsulated
traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
the miss rule. Such configuration will cause packet to appear on VF representor
instead of VF itself if packet has been matches by indirect table rule based on
tunnel parameters but missed on second rule (after recirculation). Handle such
case by marking packets processed by indirect table with special 0xFFF value in
reg_c1 and extending slow table with additional flow group that matches on
reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
mark). When creating offloads fdb tables, install one rule per VF vport to match
on recirculated miss packets and redirect them to appropriate VF vport.
Routing events
In order to support routing changes and migration of tunnel device between
different endpoint VFs, implement routing infrastructure and update it with FIB
events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
rule is attached to a routing entry, which is shared for rules of same tunnel.
On FIB event the work is scheduled to delete/recreate all rules of affected
tunnel.
Note: only vxlan tunnel type is supported by this series.
=================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmAeINMACgkQSD+KveBX
+j5ssAgAlmHUXB13W8FzXmp37hj6990QVVUNMe1tX09u6TOKi3X9VgRydCLdZlIm
CEgdknjhlesjiYsy4z9o8MTV4IXGnNoy+qW9cuL9SCpDpVLeJ0g+3/laUv21oOhr
zGxR4nmLwDxpzAj8huqOv5kVlojiA90x9wZIiOjx0+obOmglhfjzpUORAGXeHQTf
yxeiEi1ef5MO02lE854gzPBF60XB6LN7+Viw+4E+G67n7TdvIQ0xu2j/DpOubpH2
BzXoU12a424FvpAhhW8xrIZF4wFEo120Ln+vDMGq30Hqo/9gFQ1EmSBXaOOVhPwx
M/gJ3OJhckrMpNs36tdCyoOm/pTS+w==
=7d1N
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2021-02-04
Vlad Buslov says:
=================
Implement support for VF tunneling
Abstract
Currently, mlx5 only supports configuration with tunnel endpoint IP address on
uplink representor. Remove implicit and explicit assumptions of tunnel always
being terminated on uplink and implement necessary infrastructure for
configuring tunnels on VF representors and updating rules on such tunnels
according to routing changes.
SW TC model
From TC perspective VF tunnel configuration requires two rules in both
directions:
TX rules
1. Rule that redirects packets from UL to VF rep that has the tunnel
endpoint IP address:
$ tc -s filter show dev enp8s0f0 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 16:c9:a0:2d:69:2c
src_mac 0c:42:a1:58:ab:e4
eth_type ipv4
ip_flags nofrag
in_hw in_hw_count 1
action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
index 3 ref 1 bind 1 installed 377 sec used 0 sec
Action statistics:
Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 114096 bytes 952 pkt
backlog 0b 0p requeues 0
cookie 878fa48d8c423fc08c3b6ca599b50a97
no_percpu
used_hw_stats delayed
2. Rule that decapsulates the tunneled flow and redirects to destination VF
representor:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
RX rules
1. Rule that encapsulates the tunneled flow and redirects packets from
source VF rep to tunnel device:
$ tc -s filter show dev enp8s0f0_1 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac 0a:40:bd:30:89:99
src_mac ca:2e:a7:3f:f5:0f
eth_type ipv4
ip_tos 0/0x3
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key set
src_ip 7.7.7.5
dst_ip 7.7.7.1
key_id 98
dst_port 4789
nocsum
ttl 64 pipe
index 1 ref 1 bind 1 installed 411 sec used 411 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
no_percpu
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
index 1 ref 1 bind 1 installed 411 sec used 0 sec
Action statistics:
Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 5615833 bytes 4028 pkt
backlog 0b 0p requeues 0
cookie bb406d45d343bf7ade9690ae80c7cba4
no_percpu
used_hw_stats delayed
2. Rule that redirects from tunnel device to UL rep:
$ tc -s filter show dev vxlan_sys_4789 ingress
filter protocol ip pref 4 flower chain 0
filter protocol ip pref 4 flower chain 0 handle 0x1
dst_mac ca:2e:a7:3f:f5:0f
src_mac 0a:40:bd:30:89:99
eth_type ipv4
enc_dst_ip 7.7.7.5
enc_src_ip 7.7.7.1
enc_key_id 98
enc_dst_port 4789
enc_tos 0
ip_flags nofrag
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1 installed 434 sec used 434 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
used_hw_stats delayed
action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
index 4 ref 1 bind 1 installed 434 sec used 0 sec
Action statistics:
Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
Sent software 0 bytes 0 pkt
Sent hardware 129936 bytes 1082 pkt
backlog 0b 0p requeues 0
cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
no_percpu
used_hw_stats delayed
HW offloads model
For hardware offload the goal is to mach packet on both rules without exposing
it to software on tunnel endpoint VF. In order to achieve this for tx, TC
implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
modified to recirculate such packets after source port value is changed, which
allows second tx rules to match.
For rx path indirect table infrastructure is used to allow fully processing VF
tunnel traffic in hardware. To implement such pipeline driver needs to program
the hardware after matching on UL rule to overwrite source vport from UL to
tunnel VF and recirculate the packet to the root table to allow matching on the
rule installed on tunnel VF. For this, indirect table matches all encapsulated
traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
the miss rule. Such configuration will cause packet to appear on VF representor
instead of VF itself if packet has been matches by indirect table rule based on
tunnel parameters but missed on second rule (after recirculation). Handle such
case by marking packets processed by indirect table with special 0xFFF value in
reg_c1 and extending slow table with additional flow group that matches on
reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
mark). When creating offloads fdb tables, install one rule per VF vport to match
on recirculated miss packets and redirect them to appropriate VF vport.
Routing events
In order to support routing changes and migration of tunnel device between
different endpoint VFs, implement routing infrastructure and update it with FIB
events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
rule is attached to a routing entry, which is shared for rules of same tunnel.
On FIB event the work is scheduled to delete/recreate all rules of affected
tunnel.
Note: only vxlan tunnel type is supported by this series.
=================
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.
To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>