Commit graph

1105317 commits

Author SHA1 Message Date
Vladimir Oltean
93a8417088 net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
Given the following order of operations:

(1) we add filter A using tc-flower
(2) we send a packet that matches it
(3) we read the filter's statistics to find a hit count of 1
(4) we add a second filter B with a higher preference than A, and A
    moves one position to the right to make room in the TCAM for it
(5) we send another packet, and this matches the second filter B
(6) we read the filter statistics again.

When this happens, the hit count of filter A is 2 and of filter B is 1,
despite a single packet having matched each filter.

Furthermore, in an alternate history, reading the filter stats a second
time between steps (3) and (4) makes the hit count of filter A remain at
1 after step (6), as expected.

The reason why this happens has to do with the filter->stats.pkts field,
which is written to hardware through the call path below:

               vcap_entry_set
               /      |      \
              /       |       \
             /        |        \
            /         |         \
es0_entry_set   is1_entry_set   is2_entry_set
            \         |         /
             \        |        /
              \       |       /
        vcap_data_set(data.counter, ...)

The primary role of filter->stats.pkts is to transport the filter hit
counters from the last readout all the way from vcap_entry_get() ->
ocelot_vcap_filter_stats_update() -> ocelot_cls_flower_stats().
The reason why vcap_entry_set() writes it to hardware is so that the
counters (saturating and having a limited bit width) are cleared
after each user space readout.

The writing of filter->stats.pkts to hardware during the TCAM entry
movement procedure is an unintentional consequence of the code design,
because the hit count isn't up to date at this point.

So at step (4), when filter A is moved by ocelot_vcap_filter_add() to
make room for filter B, the hardware hit count is 0 (no packet matched
on it in the meantime), but filter->stats.pkts is 1, because the last
readout saw the earlier packet. The movement procedure programs the old
hit count back to hardware, so this creates the impression to user space
that more packets have been matched than they really were.

The bug can be seen when running the gact_drop_and_ok_test() from the
tc_actions.sh selftest.

Fix the issue by reading back the hit count to tmp->stats.pkts before
migrating the VCAP filter. Sure, this is a best-effort technique, since
the packets that hit the rule between vcap_entry_get() and
vcap_entry_set() won't be counted, but at least it allows the counters
to be reliably used for selftests where the traffic is under control.

The vcap_entry_get() name is a bit unintuitive, but it only reads back
the counter portion of the TCAM entry, not the entire entry.

The index from which we retrieve the counter is also a bit unintuitive
(i - 1 during add, i + 1 during del), but this is the way in which TCAM
entry movement works. The "entry index" isn't a stored integer for a
TCAM filter, instead it is dynamically computed by
ocelot_vcap_block_get_filter_index() based on the entry's position in
the &block->rules list. That position (as well as block->count) is
automatically updated by ocelot_vcap_filter_add_to_block() on add, and
by ocelot_vcap_block_remove_filter() on del. So "i" is the new filter
index, and "i - 1" or "i + 1" respectively are the old addresses of that
TCAM entry (we only support installing/deleting one filter at a time).

Fixes: b596229448 ("net: mscc: ocelot: Add support for tcam")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:15:17 -07:00
Vladimir Oltean
477d2b9162 net: mscc: ocelot: restrict tc-trap actions to VCAP IS2 lookup 0
Once the CPU port was added to the destination port mask of a packet, it
can never be cleared, so even packets marked as dropped by the MASK_MODE
of a VCAP IS2 filter will still reach it. This is why we need the
OCELOT_POLICER_DISCARD to "kill dropped packets dead" and make software
stop seeing them.

We disallow policer rules from being put on any other chain than the one
for the first lookup, but we don't do this for "drop" rules, although we
should. This change is merely ascertaining that the rules dont't
(completely) work and letting the user know.

The blamed commit is the one that introduced the multi-chain architecture
in ocelot. Prior to that, we should have always offloaded the filters to
VCAP IS2 lookup 0, where they did work.

Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:15:16 -07:00
Vladimir Oltean
6741e11880 net: mscc: ocelot: fix VCAP IS2 filters matching on both lookups
The VCAP IS2 TCAM is looked up twice per packet, and each filter can be
configured to only match during the first, second lookup, or both, or
none.

The blamed commit wrote the code for making VCAP IS2 filters match only
on the given lookup. But right below that code, there was another line
that explicitly made the lookup a "don't care", and this is overwriting
the lookup we've selected. So the code had no effect.

Some of the more noticeable effects of having filters match on both
lookups:

- in "tc -s filter show dev swp0 ingress", we see each packet matching a
  VCAP IS2 filter counted twice. This throws off scripts such as
  tools/testing/selftests/net/forwarding/tc_actions.sh and makes them
  fail.

- a "tc-drop" action offloaded to VCAP IS2 needs a policer as well,
  because once the CPU port becomes a member of the destination port
  mask of a packet, nothing removes it, not even a PERMIT/DENY mask mode
  with a port mask of 0. But VCAP IS2 rules with the POLICE_ENA bit in
  the action vector can only appear in the first lookup. What happens
  when a filter matches both lookups is that the action vector is
  combined, and this makes the POLICE_ENA bit ineffective, since the
  last lookup in which it has appeared is the second one. In other
  words, "tc-drop" actions do not drop packets for the CPU port, dropped
  packets are still seen by software unless there was an FDB entry that
  directed those packets to some other place different from the CPU.

The last bit used to work, because in the initial commit b596229448
("net: mscc: ocelot: Add support for tcam"), we were writing the FIRST
field of the VCAP IS2 half key with a 1, not with a "don't care".
The change to "don't care" was made inadvertently by me in commit
c1c3993edb ("net: mscc: ocelot: generalize existing code for VCAP"),
which I just realized, and which needs a separate fix from this one,
for "stable" kernels that lack the commit blamed below.

Fixes: 226e9cd82a ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:15:16 -07:00
Vladimir Oltean
16bbebd356 net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
ocelot_vcap_filter_del() works by moving the next filters over the
current one, and then deleting the last filter by calling vcap_entry_set()
with a del_filter which was specially created by memsetting its memory
to zeroes. vcap_entry_set() then programs this to the TCAM and action
RAM via the cache registers.

The problem is that vcap_entry_set() is a dispatch function which looks
at del_filter->block_id. But since del_filter is zeroized memory, the
block_id is 0, or otherwise said, VCAP_ES0. So practically, what we do
is delete the entry at the same TCAM index from VCAP ES0 instead of IS1
or IS2.

The code was not always like this. vcap_entry_set() used to simply be
is2_entry_set(), and then, the logic used to work.

Restore the functionality by populating the block_id of the del_filter
based on the VCAP block of the filter that we're deleting. This makes
vcap_entry_set() know what to do.

Fixes: 1397a2eb52 ("net: mscc: ocelot: create TCAM skeleton from tc filter chains")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:15:15 -07:00
Vladimir Oltean
e1846cff2f net: mscc: ocelot: mark traps with a bool instead of keeping them in a list
Since the blamed commit, VCAP filters can appear on more than one list.
If their action is "trap", they are chained on ocelot->traps via
filter->trap_list. This is in addition to their normal placement on the
VCAP block->rules list head.

Therefore, when we free a VCAP filter, we must remove it from all lists
it is a member of, including ocelot->traps.

There are at least 2 bugs which are direct consequences of this design
decision.

First is the incorrect usage of list_empty(), meant to denote whether
"filter" is chained into ocelot->traps via filter->trap_list.
This does not do the correct thing, because list_empty() checks whether
"head->next == head", but in our case, head->next == head->prev == NULL.
So we dereference NULL pointers and die when we call list_del().

Second is the fact that not all places that should remove the filter
from ocelot->traps do so. One example is ocelot_vcap_block_remove_filter(),
which is where we have the main kfree(filter). By keeping freed filters
in ocelot->traps we end up in a use-after-free in
felix_update_trapping_destinations().

Attempting to fix all the buggy patterns is a whack-a-mole game which
makes the driver unmaintainable. Actually this is what the previous
patch version attempted to do:
https://patchwork.kernel.org/project/netdevbpf/patch/20220503115728.834457-3-vladimir.oltean@nxp.com/

but it introduced another set of bugs, because there are other places in
which create VCAP filters, not just ocelot_vcap_filter_create():

- ocelot_trap_add()
- felix_tag_8021q_vlan_add_rx()
- felix_tag_8021q_vlan_add_tx()

Relying on the convention that all those code paths must call
INIT_LIST_HEAD(&filter->trap_list) is not going to scale.

So let's do what should have been done in the first place and keep a
bool in struct ocelot_vcap_filter which denotes whether we are looking
at a trapping rule or not. Iterating now happens over the main VCAP IS2
block->rules. The advantage is that we no longer risk having stale
references to a freed filter, since it is only present in that list.

Fixes: e42bd4ed09 ("net: mscc: ocelot: keep traps in a list")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:15:14 -07:00
Haowen Bai
f1c8781ac9 s390/dasd: Use kzalloc instead of kmalloc/memset
Use kzalloc rather than duplicating its implementation, which
makes code simple and easy to understand.

Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-6-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 20:08:27 -06:00
Jan Höppner
b9c10f68e2 s390/dasd: Fix read inconsistency for ESE DASD devices
Read requests that return with NRF error are partially completed in
dasd_eckd_ese_read(). The function keeps track of the amount of
processed bytes and the driver will eventually return this information
back to the block layer for further processing via __dasd_cleanup_cqr()
when the request is in the final stage of processing (from the driver's
perspective).

For this, blk_update_request() is used which requires the number of
bytes to complete the request. As per documentation the nr_bytes
parameter is described as follows:
   "number of bytes to complete for @req".

This was mistakenly interpreted as "number of bytes _left_ for @req"
leading to new requests with incorrect data length. The consequence are
inconsistent and completely wrong read requests as data from random
memory areas are read back.

Fix this by correctly specifying the amount of bytes that should be used
to complete the request.

Fixes: 5e6bdd37c5 ("s390/dasd: fix data corruption for thin provisioned devices")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-5-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 20:08:27 -06:00
Jan Höppner
cd68c48ea1 s390/dasd: Fix read for ESE with blksize < 4k
When reading unformatted tracks on ESE devices, the corresponding memory
areas are simply set to zero for each segment. This is done incorrectly
for blocksizes < 4096.

There are two problems. First, the increment of dst is done using the
counter of the loop (off), which is increased by blksize every
iteration. This leads to a much bigger increment for dst as actually
intended. Second, the increment of dst is done before the memory area
is set to 0, skipping a significant amount of bytes of memory.

This leads to illegal overwriting of memory and ultimately to a kernel
panic.

This is not a problem with 4k blocksize because
blk_queue_max_segment_size is set to PAGE_SIZE, always resulting in a
single iteration for the inner segment loop (bv.bv_len == blksize). The
incorrectly used 'off' value to increment dst is 0 and the correct
memory area is used.

In order to fix this for blksize < 4k, increment dst correctly using the
blksize and only do it at the end of the loop.

Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 20:08:27 -06:00
Stefan Haberland
71f3871657 s390/dasd: prevent double format of tracks for ESE devices
For ESE devices we get an error for write operations on an unformatted
track. Afterwards the track will be formatted and the IO operation
restarted.
When using alias devices a track might be accessed by multiple requests
simultaneously and there is a race window that a track gets formatted
twice resulting in data loss.

Prevent this by remembering the amount of formatted tracks when starting
a request and comparing this number before actually formatting a track
on the fly. If the number has changed there is a chance that the current
track was finally formatted in between. As a result do not format the
track and restart the current IO to check.

The number of formatted tracks does not match the overall number of
formatted tracks on the device and it might wrap around but this is no
problem. It is only needed to recognize that a track has been formatted at
all in between.

Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 20:08:27 -06:00
Stefan Haberland
5b53a405e4 s390/dasd: fix data corruption for ESE devices
For ESE devices we get an error when accessing an unformatted track.
The handling of this error will return zero data for read requests and
format the track on demand before writing to it. To do this the code needs
to distinguish between read and write requests. This is done with data from
the blocklayer request. A pointer to the blocklayer request is stored in
the CQR.

If there is an error on the device an ERP request is built to do error
recovery. While the ERP request is mostly a copy of the original CQR the
pointer to the blocklayer request is not copied to not accidentally pass
it back to the blocklayer without cleanup.

This leads to the error that during ESE handling after an ERP request was
built it is not possible to determine the IO direction. This leads to the
formatting of a track for read requests which might in turn lead to data
corruption.

Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 20:08:27 -06:00
Jakub Kicinski
949dfdcf34 Merge branch 'mptcp-improve-mptcp-level-window-tracking'
Mat Martineau says:

====================
mptcp: Improve MPTCP-level window tracking

This series improves MPTCP receive window compliance with RFC 8684 and
helps increase throughput on high-speed links. Note that patch 3 makes a
change in tcp_output.c

For the details, Paolo says:

I've been chasing bad/unstable performance with multiple subflows
on very high speed links.

It looks like the root cause is due to the current mptcp-level
congestion window handling. There are apparently a few different
sub-issues:

- the rcv_wnd is not effectively shared on the tx side, as each
  subflow takes in account only the value received by the underlaying
  TCP connection. This is addressed in patch 1/5

- The mptcp-level offered wnd right edge is currently allowed to shrink.
  Reading section 3.3.4.:

"""
   The receive window is relative to the DATA_ACK.  As in TCP, a
   receiver MUST NOT shrink the right edge of the receive window (i.e.,
   DATA_ACK + receive window).  The receiver will use the data sequence
   number to tell if a packet should be accepted at the connection
   level.
"""

I read the above as we need to reflect window right-edge tracking
on the wire, see patch 4/5.

- The offered window right edge tracking can happen concurrently on
  multiple subflows, but there is no mutex protection. We need an
  additional atomic operation - still patch 4/5

This series additionally bumps a few new MIBs to track all the above
(ensure/observe that the suspected races actually take place).

I could not access again the host where the issue was so
noticeable, still in the current setup the tput changes from
[6-18] Gbps to 19Gbps very stable.
====================

Link: https://lore.kernel.org/r/20220504215408.349318-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:20 -07:00
Paolo Abeni
38acb6260f mptcp: add more offered MIBs counter
Track the exceptional handling of MPTCP-level offered window
with a few more counters for observability.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:16 -07:00
Paolo Abeni
f3589be0c4 mptcp: never shrink offered window
As per RFC, the offered MPTCP-level window should never shrink.
While we currently track the right edge, we don't enforce the
above constraint on the wire.
Additionally, concurrent xmit on different subflows can end-up in
erroneous right edge update.
Address the above explicitly updating the announced window and
protecting the update with an additional atomic operation (sic)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:15 -07:00
Paolo Abeni
ea66758c17 tcp: allow MPTCP to update the announced window
The MPTCP RFC requires that the MPTCP-level receive window's
right edge never moves backward. Currently the MPTCP code
enforces such constraint while tracking the right edge, but it
does not reflects it on the wire, as MPTCP lacks a suitable hook
to update accordingly the TCP header.

This change modifies the existing mptcp_write_options() hook,
providing the current packet's TCP header to the MPTCP protocol,
so that the next patch could implement the above mentioned
constraint.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:15 -07:00
Paolo Abeni
92be2f5227 mptcp: add mib for xmit window sharing
Bump a counter for counter when snd_wnd is shared among subflow,
for observability's sake.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:15 -07:00
Paolo Abeni
b713d00675 mptcp: really share subflow snd_wnd
As per RFC, mptcp subflows use a "shared" snd_wnd: the effective
window is the maximum among the current values received on all
subflows. Without such feature a data transfer using multiple
subflows could block.

Window sharing is currently implemented in the RX side:
__tcp_select_window uses the mptcp-level receive buffer to compute
the announced window.

That is not enough: the TCP stack will stick to the window size
received on the given subflow; we need to propagate the msk window
value on each subflow at xmit time.

Change the packet scheduler to ignore the subflow level window
and use instead the msk level one

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 19:00:14 -07:00
Jonathan Toppins
4e707344e1 MAINTAINERS: add missing files for bonding definition
The bonding entry did not include additional include files that have
been added nor did it reference the documentation. Add these references
for completeness.

Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Link: https://lore.kernel.org/r/903ed2906b93628b38a2015664a20d2802042863.1651690748.git.jtoppins@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 18:41:18 -07:00
Tariq Toukan
85db6352fc net: Fix features skip in for_each_netdev_feature()
The find_next_netdev_feature() macro gets the "remaining length",
not bit index.
Passing "bit - 1" for the following iteration is wrong as it skips
the adjacent bit. Pass "bit" instead.

Fixes: 3b89ea9c59 ("net: Fix for_each_netdev_feature on Big endian")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 18:25:57 -07:00
Dave Airlie
5727375215 Merge tag 'drm-msm-fixes-2022-04-30' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
single lockdep fix.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGtkzqzxDLp82OaKXVrWd7nWZtkxKsuOK1wOGCDz7qF-dA@mail.gmail.com
2022-05-06 11:22:03 +10:00
Andy Shevchenko
10b4a11fe7 firmware: tee_bnxt: Use UUID API for exporting the UUID
There is export_uuid() function which exports uuid_t to the u8 array.
Use it instead of open coding variant.

This allows to hide the uuid_t internals.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220504091407.70661-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 18:14:29 -07:00
Jakub Kicinski
690447a22c Merge branch 'vrf-fix-address-binding-with-icmp-socket'
Nicolas Dichtel says:

====================
vrf: fix address binding with icmp socket

The first patch fixes the issue.
The second patch adds related tests in selftests.
====================

Link: https://lore.kernel.org/r/20220504090739.21821-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 18:12:48 -07:00
Nicolas Dichtel
e71b7f1f44 selftests: add ping test with ping_group_range tuned
The 'ping' utility is able to manage two kind of sockets (raw or icmp),
depending on the sysctl ping_group_range. By default, ping_group_range is
set to '1 0', which forces ping to use an ip raw socket.

Let's replay the ping tests by allowing 'ping' to use the ip icmp socket.
After the previous patch, ipv4 tests results are the same with both kinds
of socket. For ipv6, there are a lot a new failures (the previous patch
fixes only two cases).

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 18:12:44 -07:00
Nicolas Dichtel
e1a7ac6f3b ping: fix address binding wrt vrf
When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
instead of an IP raw socket. In this case, 'ping' is unable to bind its
socket to a local address owned by a vrflite.

Before the patch:
$ sysctl -w net.ipv4.ping_group_range='0  2147483647'
$ ip link add blue type vrf table 10
$ ip link add foo type dummy
$ ip link set foo master blue
$ ip link set foo up
$ ip addr add 192.168.1.1/24 dev foo
$ ip addr add 2001::1/64 dev foo
$ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
ping: bind: Cannot assign requested address
$ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
ping6: bind icmp socket: Cannot assign requested address

CC: stable@vger.kernel.org
Fixes: 1b69c6d0ae ("net: Introduce L3 Master device abstraction")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 18:12:44 -07:00
Dave Airlie
ca5e2f4d6b drm-misc-fixes for v5.18-rc6:
- Small fix for hot-unplugging fb devices.
 - Kconfig fix for it6505.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmJz8hUACgkQ/lWMcqZw
 E8P2gQ//aIwrdXWEceTuWDRo4k42/eHa0P9HJJH7wXXZmWXAr/sdsDZcS/fbobEO
 iOV3XxHQx/QgRV1u2/0VwIOtDcSpdpFmoYlctNt5o5oycg4H9geq/uCWfSEhJ0sg
 eilVH4it6RgEDtU5CDIuewSu1MTu2IGCQZGlVRJmYubkcATUypni9qoPqODwU6jM
 g2nIIH3Z9FxzNbgwazUFjSR2ig8JK0qnuu577pSPiUc8Dw03PXs7r4zVjIVemUri
 zmbz/FfAdDLzaGqp2qu2e4jmMWF5QG4Um+gkRKJhtw5vg9kpjmQnsl8isxHMGLxC
 KqzJr/8E9PK5C8VoCEto9pwWTFO12aDe3BopuzG6mq7XIy4lXFCVtT5GK8i0qkD6
 mR9Y0ig3rbZbO3mgLGbbbrBuDdPpec57YUOhWCBERtaZGvey2/m1J9eihRdCFc2p
 k0bGjaGqsxoWOOpagHfoBmVYCU7PS9SqUEsbNKqXuh4pizZlXrHsJXC/sWtS8Zx4
 N3lgILYMUah0ZKRwx1E+1CDJv2NStJvL5p4ZkWzF9Txcw5jSHBUnFa004TPW2i58
 zF1ljrTwpHGAtQg37EASP8SPpteORt5eMg+sQwvq56jlFWS/px+43xITwpjixBnj
 DHiJ031mhk1GP6lmuAnrO40L85Xx0+92HPHnnqh6g8FgKv2m/uc=
 =cQJQ
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2022-05-05' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

drm-misc-fixes for v5.18-rc6:
- Small fix for hot-unplugging fb devices.
- Kconfig fix for it6505.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/69e51773-8c6f-4ff7-9a06-5c2922a43999@linux.intel.com
2022-05-06 10:56:38 +10:00
David Ahern
c67b627e99 net: Make msg_zerocopy_alloc static
msg_zerocopy_alloc is only used by msg_zerocopy_realloc; remove the
export and make static in skbuff.c

Signed-off-by: David Ahern <dsahern@kernel.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/r/20220504170947.18773-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 17:02:50 -07:00
Fabio Estevam
15f03ffe4b net: phy: micrel: Pass .probe for KS8737
Since commit f1131b9c23 ("net: phy: micrel: use
kszphy_suspend()/kszphy_resume for irq aware devices") the kszphy_suspend/
resume hooks are used.

These functions require the probe function to be called so that
priv can be allocated.

Otherwise, a NULL pointer dereference happens inside
kszphy_config_reset().

Cc: stable@vger.kernel.org
Fixes: f1131b9c23 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220504143104.1286960-2-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 17:02:02 -07:00
Fabio Estevam
e333eed63a net: phy: micrel: Do not use kszphy_suspend/resume for KSZ8061
Since commit f1131b9c23 ("net: phy: micrel: use
kszphy_suspend()/kszphy_resume for irq aware devices") the following
NULL pointer dereference is observed on a board with KSZ8061:

 # udhcpc -i eth0
udhcpc: started, v1.35.0
8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = f73cef4e
[00000008] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 196 Comm: ifconfig Not tainted 5.15.37-dirty #94
Hardware name: Freescale i.MX6 SoloX (Device Tree)
PC is at kszphy_config_reset+0x10/0x114
LR is at kszphy_resume+0x24/0x64
...

The KSZ8061 phy_driver structure does not have the .probe/..driver_data
fields, which means that priv is not allocated.

This causes the NULL pointer dereference inside kszphy_config_reset().

Fix the problem by using the generic suspend/resume functions as before.

Another alternative would be to provide the .probe and .driver_data
information into the structure, but to be on the safe side, let's
just restore Ethernet functionality by using the generic suspend/resume.

Cc: stable@vger.kernel.org
Fixes: f1131b9c23 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220504143104.1286960-1-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 17:02:02 -07:00
Dave Airlie
ebbc04bdb1 Merge tag 'amd-drm-fixes-5.18-2022-05-04' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
amd-drm-fixes-5.18-2022-05-04:

amdgpu:
- Fix a xen dom0 regression on APUs
- Fix a potential array overflow if a receiver were to
  send an erroneous audio channel count

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504190439.5723-1-alexander.deucher@amd.com
2022-05-06 09:59:48 +10:00
Linus Torvalds
fe27d189e3 Two folio fixes for 5.18:
- Fix a race when we were calling folio_next() in the BIO folio iter
    without holding a reference, meaning the folio could be split or freed,
    and we'd jump to the next page instead of the intended next folio.
 
  - Fix readahead creating single-page folios instead of the intended
    large folios when doing reads that are not a power of two in size.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmJ0Xu4ACgkQDpNsjXcp
 gj4rTAf/Rp2P9jwnOCN9X78YBiydkHq9dtIYbEz1jhOr2pnbz/ZWOeWvVvTBgG5I
 GSIeaK3dhCBqi6G28QrQR1j1+gOWOJOs/rmJtkkOgBfoGsCL8HLFzcbXR10zeF2K
 8bhivsq5tshn2DiVu8WK1W2n25mg4k7ORrBVcuUtW4Am8EPsyJpzoSWBTlZJvClt
 Re9mIkbWNWktEyRiMl8wA4WRKqysaIWBuf9jugaOrv0Y0Db2TqiqYiAG6xm3VSZy
 ABf8ZSOyNuxF6ZrW2tUjwdnJ6oDXjVB3Dykw4EQMFQ6uINJPBArj8AkDUe4FJa2w
 9FmDLDxR1T4k9+8cEC6ZkkVb6KyvdQ==
 =KKsJ
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache

Pull folio fixes from Matthew Wilcox:
 "Two folio fixes for 5.18.

  Darrick and Brian have done amazing work debugging the race I created
  in the folio BIO iterator. The readahead problem was deterministic, so
  easy to fix.

   - Fix a race when we were calling folio_next() in the BIO folio iter
     without holding a reference, meaning the folio could be split or
     freed, and we'd jump to the next page instead of the intended next
     folio.

   - Fix readahead creating single-page folios instead of the intended
     large folios when doing reads that are not a power of two in size"

* tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache:
  mm/readahead: Fix readahead with large folios
  block: Do not call folio_next() on an unreferenced folio
2022-05-05 16:52:15 -07:00
Tetsuo Handa
6997fbd7a3 net: rds: use maybe_get_net() when acquiring refcount on TCP sockets
Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
delayed works queued in rds_wq might be invoked after a net namespace's
refcount already reached 0.

Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
it is guaranteed that we can instead use maybe_get_net() from delayed work
functions until rds_tcp_exit_net() returns.

Note that I'm not convinced that all works which might access a net
namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
will fail to wait for work functions, and kmem_cache_free() could be
called from net_free() before maybe_get_net() is called from
rds_tcp_tune().

Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 3a58f13a88 ("net: rds: acquire refcount on TCP sockets")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 16:44:49 -07:00
Jens Axboe
9396ed850f io_uring: kill io_recv_buffer_select() wrapper
It's just a thin wrapper around io_buffer_select(), get rid of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:15:15 -06:00
Jens Axboe
0a352aaa94 io_uring: use 'sr' vs 'req->sr_msg' consistently
For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:14:33 -06:00
Jens Axboe
0455d4ccec io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg
If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:09:31 -06:00
Jens Axboe
73911426aa io_uring: check IOPOLL/ioprio support upfront
Don't punt this check to the op prep handlers, add the support to
io_op_defs and we can check them while setting up the request.

This reduces the text size by 500 bytes on aarch64, and makes this less
fragile by having the check in one spot and needing opcodes to opt in
to IOPOLL or ioprio support.

Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-05 17:09:15 -06:00
Jakub Kicinski
8d602e1a13 net: move snowflake callers to netif_napi_add_tx_weight()
Make the drivers with custom tx napi weight call netif_napi_add_tx_weight().

Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220504163725.550782-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:54:18 -07:00
Jakub Kicinski
16d083e28f net: switch to netif_napi_add_tx()
Switch net callers to the new API not requiring
the NAPI_POLL_WEIGHT argument.

Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://lore.kernel.org/r/20220504163725.550782-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:54:12 -07:00
Jakub Kicinski
fd49f8e61c jme: remove an unnecessary indirection
Remove a define which looks like a OS abstraction layer
and makes spatch conversions on this driver problematic.

Link: https://lore.kernel.org/r/20220504163939.551231-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:53:28 -07:00
Christophe Leroy
6bff3ffcf6 net: ethernet: Prepare cleanup of powerpc's asm/prom.h
powerpc's asm/prom.h includes some headers that it doesn't
need itself.

In order to clean powerpc's asm/prom.h up in a further step,
first clean all files that include asm/prom.h

Some files don't need asm/prom.h at all. For those ones,
just remove inclusion of asm/prom.h

Some files don't need any of the items provided by asm/prom.h,
but need some of the headers included by asm/prom.h. For those
ones, add the needed headers that are brought by asm/prom.h at
the moment and remove asm/prom.h

Some files really need asm/prom.h but also need some of the
headers included by asm/prom.h. For those one, leave asm/prom.h
but also add the needed headers so that they can be removed
from asm/prom.h in a later step.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/09a13d592d628de95d30943e59b2170af5b48110.1651663857.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:53:02 -07:00
Christophe Leroy
d9ccf770c7 sungem: Prepare cleanup of powerpc's asm/prom.h
powerpc's <asm/prom.h> includes some headers that it doesn't
need itself.

In order to clean powerpc's <asm/prom.h> up in a further step,
first clean all files that include <asm/prom.h>

sungem_phy.c doesn't use any object provided by <asm/prom.h>.

But removing inclusion of <asm/prom.h> leads to the following
errors:

  CC      drivers/net/sungem_phy.o
drivers/net/sungem_phy.c: In function 'bcm5421_init':
drivers/net/sungem_phy.c:448:42: error: implicit declaration of function 'of_get_parent'; did you mean 'dget_parent'? [-Werror=implicit-function-declaration]
  448 |                 struct device_node *np = of_get_parent(phy->platform_data);
      |                                          ^~~~~~~~~~~~~
      |                                          dget_parent
drivers/net/sungem_phy.c:448:42: warning: initialization of 'struct device_node *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
drivers/net/sungem_phy.c:450:35: error: implicit declaration of function 'of_get_property' [-Werror=implicit-function-declaration]
  450 |                 if (np == NULL || of_get_property(np, "no-autolowpower", NULL))
      |                                   ^~~~~~~~~~~~~~~

Remove <asm/prom.h> from included headers but add <linux/of.h> to
handle the above.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/f7a7fab3ec5edf803d934fca04df22631c2b449d.1651662885.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:52:27 -07:00
Linus Torvalds
f47c960e93 Devicetree fixes for v5.18, part 3:
- Drop unused 'max-link-speed' in Apple PCIe
 
 - More redundant 'maxItems/minItems' schema fixes
 
 - Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
 
 - Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
 
 - Add missing 'power-domains' property to Cadence UFSHC
 -----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmJz6UAQHHJvYmhAa2Vy
 bmVsLm9yZwAKCRD6+121jbxhw1wzEACPQm9LVVSFxeevPMPfWEO3CqYdVYaLt2w9
 AnQHng8OkUJkN5lEdhQdyrYpJpvYlqVJVx7FeXr839nBi+qzdCszeWEu3WH6+Tw3
 HNp43QarhRAi5XyV87jBFQeuFONBBHEvOKuoqE8jQ/aFZ6hEOdrSQ9JVpcC5LPOK
 dvokPcHaKQElWVa1oG4hqJjoEHvTQNo391+L/PmCxLBlIIWkSX/BtEnXKioes0nm
 EXoYoMNQHsH3RtZThGb+NpafOgLE8oRODTJx+Q1zUDw9MEJHQCXEBv4Y+ene1pkg
 KkMaPGXNQbI6XQaVSBXHXHkx1vh6BE400MlHpOpeXau+psjYzEF96ePpe2GYq3GY
 TRFGvIj6W/We5VLdVlJ94CIdZAs1qSYQ0qIgzbg5pj7SK7KNLojkZiozg+qa5xXL
 5Z1VQGmaTMqLukUmLjFpF63Zdxm2AmgifbQTO1eZTFC9xPB3mKoIuIXojr/5g84u
 Jc05xNz/mbCnOwN3WNYLqc4bVS2bDhutfpDJOnn47RHI3vBhYiY1Xohe+Vm+Xu5s
 6hKKxTn/2QW0dubn9/5BLfvNuH9/3jg0Pw0f/orJcBdI1OuLKCuNhAO4PTjFOAAZ
 6wy5W5GOGwE7HgJGdb9BNQY0FHmaxL9lMWQrUuJ6yF7ghNKI2gO1MFWnzjYbsMaC
 V91pihSiCg==
 =fpGC
 -----END PGP SIGNATURE-----

Merge tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

 - Drop unused 'max-link-speed' in Apple PCIe

 - More redundant 'maxItems/minItems' schema fixes

 - Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'

 - Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding

 - Add missing 'power-domains' property to Cadence UFSHC

* tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: pci: apple,pcie: Drop max-link-speed from example
  dt-bindings: Drop redundant 'maxItems/minItems' in if/then schemas
  dt-bindings: pinctrl: Allow values for drive-push-pull and drive-open-drain
  dt-bindings: leds-mt6360: Drop redundant 'unevaluatedProperties'
  dt-bindings: ufs: cdns,ufshc: Add power-domains
2022-05-05 15:50:27 -07:00
Eyal Birger
1f86123b97 net: align SO_RCVMARK required privileges with SO_MARK
The commit referenced in the "Fixes" tag added the SO_RCVMARK socket
option for receiving the skb mark in the ancillary data.

Since this is a new capability, and exposes admin configured details
regarding the underlying network setup to sockets, let's align the
needed capabilities with those of SO_MARK.

Fixes: 6fd1d51cfa ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20220504095459.2663513-1-eyal.birger@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:48:17 -07:00
Jakub Kicinski
c4a67a21a6 Revert "Merge branch 'mlxsw-line-card-model'"
This reverts commit 5e927a9f4b, reversing
changes made to cfc1d91a7d.

The discussion is still ongoing so let's remove the uAPI
until the discussion settles.

Link: https://lore.kernel.org/all/20220425090021.32e9a98f@kernel.org/
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220504154037.539442-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05 15:47:23 -07:00
Stanislav Jakubek
fa2024c315 dt-bindings: timer: Convert rda,8810pl-timer to YAML
Convert RDA Micro Timer bindings to DT schema format.

Signed-off-by: Stanislav Jakubek <stano.jakubek@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220504175502.GA2573@standask-GA-A55M-S2HP
2022-05-05 17:12:52 -05:00
Mimi Zohar
398c42e2c4 ima: support fs-verity file digest based version 3 signatures
IMA may verify a file's integrity against a "good" value stored in the
'security.ima' xattr or as an appended signature, based on policy.  When
the "good value" is stored in the xattr, the xattr may contain a file
hash or signature.  In either case, the "good" value is preceded by a
header.  The first byte of the xattr header indicates the type of data
- hash, signature - stored in the xattr.  To support storing fs-verity
signatures in the 'security.ima' xattr requires further differentiating
the fs-verity signature from the existing IMA signature.

In addition the signatures stored in 'security.ima' xattr, need to be
disambiguated.  Instead of directly signing the fs-verity digest, a new
signature format version 3 is defined as the hash of the ima_file_id
structure, which identifies the type of signature and the digest.

The IMA policy defines "which" files are to be measured, verified, and/or
audited.  For those files being verified, the policy rules indicate "how"
the file should be verified.  For example to require a file be signed,
the appraise policy rule must include the 'appraise_type' option.

	appraise_type:= [imasig] | [imasig|modsig] | [sigv3]
           where 'imasig' is the original or signature format v2 (default),
           where 'modsig' is an appended signature,
           where 'sigv3' is the signature format v3.

The policy rule must also indicate the type of digest, if not the IMA
default, by first specifying the digest type:

	digest_type:= [verity]

The following policy rule requires fsverity signatures.  The rule may be
constrained, for example based on a fsuuid or LSM label.

      appraise func=BPRM_CHECK digest_type=verity appraise_type=sigv3

Acked-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2022-05-05 17:41:51 -04:00
Mike Snitzer
4edadf6dcb dm: improve abnormal bio processing
Read/write/flush are the most common operations, optimize switch in
is_abnormal_io() for those cases. Follows same pattern established in
block perf-wip commit ("block: optimise blk_may_split for normal rw")

Also, push is_abnormal_io() check and blk_queue_split() down from
dm_submit_bio() to dm_split_and_process_bio() and set new
'is_abnormal_io' flag in clone_info. Optimize __split_and_process_bio
and __process_abnormal_io by leveraging ci.is_abnormal_io flag.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05 17:31:36 -04:00
Mike Snitzer
9d20653fe8 dm: simplify bio-based IO accounting further
Now that io splitting is recorded prior to, or during, ->map IO
accounting can happen immediately rather than defer until after
bio splitting in dm_split_and_process_bio().

Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task
member because there is no longer any need to wait for splitting to
occur before accounting.

Also move dm_io struct's 'flags' member to consolidate struct holes.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05 17:31:36 -04:00
Ming Lei
ec211631ae dm: put all polled dm_io instances into a single list
Now that bio_split() isn't used by DM's bio splitting, it is a bit
overkill to link dm_io into an hlist given there is only single dm_io
in the list.

Convert to using a single list for holding all dm_io instances
associated with this bio.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05 17:31:36 -04:00
Ming Lei
0f14d60a02 dm: improve dm_io reference counting
Currently each dm_io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
to initialization time inside alloc_io().

Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both sides are done.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05 17:31:36 -04:00
Ming Lei
2e803cd99b dm: don't grab target io reference in dm_zone_map_bio
dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so not necessary to do it dm_zone_map_bio
any more.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05 17:31:36 -04:00
Ming Lei
7dd76d1fee dm: improve bio splitting and associated IO accounting
The current DM code (ab)uses late assignment of dm_io->orig_bio (after
__map_bio() returns and any bio splitting is complete) to indicate the
FS bio has been processed and can be accounted. This results in
awkward waiting until ->orig_bio is set in dm_submit_bio_remap().

Also the bio splitting was implemented using bio_split()+bio_chain()
-- a well-worn pattern but it requires bio cloning purely for the
benefit of more natural IO accounting.  The bio_split() result was
stored in ->orig_bio to represent the mapped part of the original FS
bio.

DM has switched to the bdev based IO accounting interface.  DM's IO
accounting can be implemented in terms of the original FS bio (now
stored early in ->orig_bio) via access to its sectors/bio_op. And
if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use
new dm_io fields of .sector_offset & .sectors to allow IO accounting
for split bios _without_ needing to clone a new bio to store in
->orig_bio.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-05 17:31:35 -04:00