Track the exceptional handling of MPTCP-level offered window
with a few more counters for observability.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As per RFC, the offered MPTCP-level window should never shrink.
While we currently track the right edge, we don't enforce the
above constraint on the wire.
Additionally, concurrent xmit on different subflows can end-up in
erroneous right edge update.
Address the above explicitly updating the announced window and
protecting the update with an additional atomic operation (sic)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MPTCP RFC requires that the MPTCP-level receive window's
right edge never moves backward. Currently the MPTCP code
enforces such constraint while tracking the right edge, but it
does not reflects it on the wire, as MPTCP lacks a suitable hook
to update accordingly the TCP header.
This change modifies the existing mptcp_write_options() hook,
providing the current packet's TCP header to the MPTCP protocol,
so that the next patch could implement the above mentioned
constraint.
No functional changes intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Bump a counter for counter when snd_wnd is shared among subflow,
for observability's sake.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As per RFC, mptcp subflows use a "shared" snd_wnd: the effective
window is the maximum among the current values received on all
subflows. Without such feature a data transfer using multiple
subflows could block.
Window sharing is currently implemented in the RX side:
__tcp_select_window uses the mptcp-level receive buffer to compute
the announced window.
That is not enough: the TCP stack will stick to the window size
received on the given subflow; we need to propagate the msk window
value on each subflow at xmit time.
Change the packet scheduler to ignore the subflow level window
and use instead the msk level one
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The find_next_netdev_feature() macro gets the "remaining length",
not bit index.
Passing "bit - 1" for the following iteration is wrong as it skips
the adjacent bit. Pass "bit" instead.
Fixes: 3b89ea9c59 ("net: Fix for_each_netdev_feature on Big endian")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20220504080914.1918-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is export_uuid() function which exports uuid_t to the u8 array.
Use it instead of open coding variant.
This allows to hide the uuid_t internals.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220504091407.70661-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nicolas Dichtel says:
====================
vrf: fix address binding with icmp socket
The first patch fixes the issue.
The second patch adds related tests in selftests.
====================
Link: https://lore.kernel.org/r/20220504090739.21821-1-nicolas.dichtel@6wind.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'ping' utility is able to manage two kind of sockets (raw or icmp),
depending on the sysctl ping_group_range. By default, ping_group_range is
set to '1 0', which forces ping to use an ip raw socket.
Let's replay the ping tests by allowing 'ping' to use the ip icmp socket.
After the previous patch, ipv4 tests results are the same with both kinds
of socket. For ipv6, there are a lot a new failures (the previous patch
fixes only two cases).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When ping_group_range is updated, 'ping' uses the DGRAM ICMP socket,
instead of an IP raw socket. In this case, 'ping' is unable to bind its
socket to a local address owned by a vrflite.
Before the patch:
$ sysctl -w net.ipv4.ping_group_range='0 2147483647'
$ ip link add blue type vrf table 10
$ ip link add foo type dummy
$ ip link set foo master blue
$ ip link set foo up
$ ip addr add 192.168.1.1/24 dev foo
$ ip addr add 2001::1/64 dev foo
$ ip vrf exec blue ping -c1 -I 192.168.1.1 192.168.1.2
ping: bind: Cannot assign requested address
$ ip vrf exec blue ping6 -c1 -I 2001::1 2001::2
ping6: bind icmp socket: Cannot assign requested address
CC: stable@vger.kernel.org
Fixes: 1b69c6d0ae ("net: Introduce L3 Master device abstraction")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
msg_zerocopy_alloc is only used by msg_zerocopy_realloc; remove the
export and make static in skbuff.c
Signed-off-by: David Ahern <dsahern@kernel.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/r/20220504170947.18773-1-dsahern@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit f1131b9c23 ("net: phy: micrel: use
kszphy_suspend()/kszphy_resume for irq aware devices") the kszphy_suspend/
resume hooks are used.
These functions require the probe function to be called so that
priv can be allocated.
Otherwise, a NULL pointer dereference happens inside
kszphy_config_reset().
Cc: stable@vger.kernel.org
Fixes: f1131b9c23 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220504143104.1286960-2-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit f1131b9c23 ("net: phy: micrel: use
kszphy_suspend()/kszphy_resume for irq aware devices") the following
NULL pointer dereference is observed on a board with KSZ8061:
# udhcpc -i eth0
udhcpc: started, v1.35.0
8<--- cut here ---
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = f73cef4e
[00000008] *pgd=00000000
Internal error: Oops: 5 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 196 Comm: ifconfig Not tainted 5.15.37-dirty #94
Hardware name: Freescale i.MX6 SoloX (Device Tree)
PC is at kszphy_config_reset+0x10/0x114
LR is at kszphy_resume+0x24/0x64
...
The KSZ8061 phy_driver structure does not have the .probe/..driver_data
fields, which means that priv is not allocated.
This causes the NULL pointer dereference inside kszphy_config_reset().
Fix the problem by using the generic suspend/resume functions as before.
Another alternative would be to provide the .probe and .driver_data
information into the structure, but to be on the safe side, let's
just restore Ethernet functionality by using the generic suspend/resume.
Cc: stable@vger.kernel.org
Fixes: f1131b9c23 ("net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices")
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220504143104.1286960-1-festevam@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Fix a race when we were calling folio_next() in the BIO folio iter
without holding a reference, meaning the folio could be split or freed,
and we'd jump to the next page instead of the intended next folio.
- Fix readahead creating single-page folios instead of the intended
large folios when doing reads that are not a power of two in size.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmJ0Xu4ACgkQDpNsjXcp
gj4rTAf/Rp2P9jwnOCN9X78YBiydkHq9dtIYbEz1jhOr2pnbz/ZWOeWvVvTBgG5I
GSIeaK3dhCBqi6G28QrQR1j1+gOWOJOs/rmJtkkOgBfoGsCL8HLFzcbXR10zeF2K
8bhivsq5tshn2DiVu8WK1W2n25mg4k7ORrBVcuUtW4Am8EPsyJpzoSWBTlZJvClt
Re9mIkbWNWktEyRiMl8wA4WRKqysaIWBuf9jugaOrv0Y0Db2TqiqYiAG6xm3VSZy
ABf8ZSOyNuxF6ZrW2tUjwdnJ6oDXjVB3Dykw4EQMFQ6uINJPBArj8AkDUe4FJa2w
9FmDLDxR1T4k9+8cEC6ZkkVb6KyvdQ==
=KKsJ
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache
Pull folio fixes from Matthew Wilcox:
"Two folio fixes for 5.18.
Darrick and Brian have done amazing work debugging the race I created
in the folio BIO iterator. The readahead problem was deterministic, so
easy to fix.
- Fix a race when we were calling folio_next() in the BIO folio iter
without holding a reference, meaning the folio could be split or
freed, and we'd jump to the next page instead of the intended next
folio.
- Fix readahead creating single-page folios instead of the intended
large folios when doing reads that are not a power of two in size"
* tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache:
mm/readahead: Fix readahead with large folios
block: Do not call folio_next() on an unreferenced folio
Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for
delayed works queued in rds_wq might be invoked after a net namespace's
refcount already reached 0.
Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq),
it is guaranteed that we can instead use maybe_get_net() from delayed work
functions until rds_tcp_exit_net() returns.
Note that I'm not convinced that all works which might access a net
namespace are already queued in rds_wq by the moment rds_tcp_exit_net()
calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net()
will fail to wait for work functions, and kmem_cache_free() could be
called from net_free() before maybe_get_net() is called from
rds_tcp_tune().
Reported-by: Eric Dumazet <edumazet@google.com>
Fixes: 3a58f13a88 ("net: rds: acquire refcount on TCP sockets")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't punt this check to the op prep handlers, add the support to
io_op_defs and we can check them while setting up the request.
This reduces the text size by 500 bytes on aarch64, and makes this less
fragile by having the check in one spot and needing opcodes to opt in
to IOPOLL or ioprio support.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch net callers to the new API not requiring
the NAPI_POLL_WEIGHT argument.
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Link: https://lore.kernel.org/r/20220504163725.550782-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
powerpc's asm/prom.h includes some headers that it doesn't
need itself.
In order to clean powerpc's asm/prom.h up in a further step,
first clean all files that include asm/prom.h
Some files don't need asm/prom.h at all. For those ones,
just remove inclusion of asm/prom.h
Some files don't need any of the items provided by asm/prom.h,
but need some of the headers included by asm/prom.h. For those
ones, add the needed headers that are brought by asm/prom.h at
the moment and remove asm/prom.h
Some files really need asm/prom.h but also need some of the
headers included by asm/prom.h. For those one, leave asm/prom.h
but also add the needed headers so that they can be removed
from asm/prom.h in a later step.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/09a13d592d628de95d30943e59b2170af5b48110.1651663857.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
powerpc's <asm/prom.h> includes some headers that it doesn't
need itself.
In order to clean powerpc's <asm/prom.h> up in a further step,
first clean all files that include <asm/prom.h>
sungem_phy.c doesn't use any object provided by <asm/prom.h>.
But removing inclusion of <asm/prom.h> leads to the following
errors:
CC drivers/net/sungem_phy.o
drivers/net/sungem_phy.c: In function 'bcm5421_init':
drivers/net/sungem_phy.c:448:42: error: implicit declaration of function 'of_get_parent'; did you mean 'dget_parent'? [-Werror=implicit-function-declaration]
448 | struct device_node *np = of_get_parent(phy->platform_data);
| ^~~~~~~~~~~~~
| dget_parent
drivers/net/sungem_phy.c:448:42: warning: initialization of 'struct device_node *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
drivers/net/sungem_phy.c:450:35: error: implicit declaration of function 'of_get_property' [-Werror=implicit-function-declaration]
450 | if (np == NULL || of_get_property(np, "no-autolowpower", NULL))
| ^~~~~~~~~~~~~~~
Remove <asm/prom.h> from included headers but add <linux/of.h> to
handle the above.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/r/f7a7fab3ec5edf803d934fca04df22631c2b449d.1651662885.git.christophe.leroy@csgroup.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Drop unused 'max-link-speed' in Apple PCIe
- More redundant 'maxItems/minItems' schema fixes
- Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
- Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
- Add missing 'power-domains' property to Cadence UFSHC
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmJz6UAQHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw1wzEACPQm9LVVSFxeevPMPfWEO3CqYdVYaLt2w9
AnQHng8OkUJkN5lEdhQdyrYpJpvYlqVJVx7FeXr839nBi+qzdCszeWEu3WH6+Tw3
HNp43QarhRAi5XyV87jBFQeuFONBBHEvOKuoqE8jQ/aFZ6hEOdrSQ9JVpcC5LPOK
dvokPcHaKQElWVa1oG4hqJjoEHvTQNo391+L/PmCxLBlIIWkSX/BtEnXKioes0nm
EXoYoMNQHsH3RtZThGb+NpafOgLE8oRODTJx+Q1zUDw9MEJHQCXEBv4Y+ene1pkg
KkMaPGXNQbI6XQaVSBXHXHkx1vh6BE400MlHpOpeXau+psjYzEF96ePpe2GYq3GY
TRFGvIj6W/We5VLdVlJ94CIdZAs1qSYQ0qIgzbg5pj7SK7KNLojkZiozg+qa5xXL
5Z1VQGmaTMqLukUmLjFpF63Zdxm2AmgifbQTO1eZTFC9xPB3mKoIuIXojr/5g84u
Jc05xNz/mbCnOwN3WNYLqc4bVS2bDhutfpDJOnn47RHI3vBhYiY1Xohe+Vm+Xu5s
6hKKxTn/2QW0dubn9/5BLfvNuH9/3jg0Pw0f/orJcBdI1OuLKCuNhAO4PTjFOAAZ
6wy5W5GOGwE7HgJGdb9BNQY0FHmaxL9lMWQrUuJ6yF7ghNKI2gO1MFWnzjYbsMaC
V91pihSiCg==
=fpGC
-----END PGP SIGNATURE-----
Merge tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Drop unused 'max-link-speed' in Apple PCIe
- More redundant 'maxItems/minItems' schema fixes
- Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
- Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
- Add missing 'power-domains' property to Cadence UFSHC
* tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: pci: apple,pcie: Drop max-link-speed from example
dt-bindings: Drop redundant 'maxItems/minItems' in if/then schemas
dt-bindings: pinctrl: Allow values for drive-push-pull and drive-open-drain
dt-bindings: leds-mt6360: Drop redundant 'unevaluatedProperties'
dt-bindings: ufs: cdns,ufshc: Add power-domains
The commit referenced in the "Fixes" tag added the SO_RCVMARK socket
option for receiving the skb mark in the ancillary data.
Since this is a new capability, and exposes admin configured details
regarding the underlying network setup to sockets, let's align the
needed capabilities with those of SO_MARK.
Fixes: 6fd1d51cfa ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Link: https://lore.kernel.org/r/20220504095459.2663513-1-eyal.birger@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous patch introduced new failure cases in the HuC init flow
that can be hit by simply changing the config, so we want to avoid
failing the probe in those scenarios. HuC load failure is already
considered a non-fatal error and we have a way to report to userspace
if the HuC is not available via a dedicated getparam, so no changes
in expectation there.
The error message in the HuC init code has also been lowered to info to
avoid throwing error message for an expected behavior.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504204816.2082588-5-daniele.ceraolospurio@intel.com
HuC loading via GSC is performed via a PXP command sent through the mei
modules, so we need both MEI_GSC and MEI_PXP to be available. Given that
the GSC will do both the transfer and the authentication, the legacy HuC
loading paths can be safely skipped.
Also note that the GSC-loaded HuC survives GT reset.
v2: move the huc_is_authenticated() function to this patch.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504204816.2082588-4-daniele.ceraolospurio@intel.com
On newer platforms (starting DG2 G10 B-step and G11 A-step), ownership of
HuC loading has been moved from the GuC to the GSC. As part of the
change, the header format of the HuC binary has been updated and does not
match the GuC anymore. The GSC will perform all the required checks on
the binary size, so we only need to check that the version matches.
Note that since we still haven't added any gsc-loaded FWs, the
loaded_via_gsc variable will always be kept to its initialization value
of zero.
v2: Add a note about loaded_via_gsc being zero (Alan)
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504204816.2082588-3-daniele.ceraolospurio@intel.com
The function name is confusing, because it doesn't check the actual auth
status in HW but the SW status. Given that there is only one user (the
huc_auth function itself), just get rid of it and use the FW status
checker directly.
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504204816.2082588-2-daniele.ceraolospurio@intel.com
In GuC submission mode the EU priority must be updated by the GuC rather
than the driver as the GuC owns the programming of the context descriptor.
Given that the GuC code uses the GuC priorities, we can't use a generic
function using i915 priorities for both execlists and GuC submission.
The existing function has therefore been pushed to the execlists
back-end while a new one has been added for GuC.
v2: correctly use the GuC prio.
Cc: John Harrison <john.c.harrison@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220504234636.2119794-1-daniele.ceraolospurio@intel.com
IMA may verify a file's integrity against a "good" value stored in the
'security.ima' xattr or as an appended signature, based on policy. When
the "good value" is stored in the xattr, the xattr may contain a file
hash or signature. In either case, the "good" value is preceded by a
header. The first byte of the xattr header indicates the type of data
- hash, signature - stored in the xattr. To support storing fs-verity
signatures in the 'security.ima' xattr requires further differentiating
the fs-verity signature from the existing IMA signature.
In addition the signatures stored in 'security.ima' xattr, need to be
disambiguated. Instead of directly signing the fs-verity digest, a new
signature format version 3 is defined as the hash of the ima_file_id
structure, which identifies the type of signature and the digest.
The IMA policy defines "which" files are to be measured, verified, and/or
audited. For those files being verified, the policy rules indicate "how"
the file should be verified. For example to require a file be signed,
the appraise policy rule must include the 'appraise_type' option.
appraise_type:= [imasig] | [imasig|modsig] | [sigv3]
where 'imasig' is the original or signature format v2 (default),
where 'modsig' is an appended signature,
where 'sigv3' is the signature format v3.
The policy rule must also indicate the type of digest, if not the IMA
default, by first specifying the digest type:
digest_type:= [verity]
The following policy rule requires fsverity signatures. The rule may be
constrained, for example based on a fsuuid or LSM label.
appraise func=BPRM_CHECK digest_type=verity appraise_type=sigv3
Acked-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Read/write/flush are the most common operations, optimize switch in
is_abnormal_io() for those cases. Follows same pattern established in
block perf-wip commit ("block: optimise blk_may_split for normal rw")
Also, push is_abnormal_io() check and blk_queue_split() down from
dm_submit_bio() to dm_split_and_process_bio() and set new
'is_abnormal_io' flag in clone_info. Optimize __split_and_process_bio
and __process_abnormal_io by leveraging ci.is_abnormal_io flag.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Now that io splitting is recorded prior to, or during, ->map IO
accounting can happen immediately rather than defer until after
bio splitting in dm_split_and_process_bio().
Remove the DM_IO_START_ACCT flag and also remove dm_io's map_task
member because there is no longer any need to wait for splitting to
occur before accounting.
Also move dm_io struct's 'flags' member to consolidate struct holes.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Now that bio_split() isn't used by DM's bio splitting, it is a bit
overkill to link dm_io into an hlist given there is only single dm_io
in the list.
Convert to using a single list for holding all dm_io instances
associated with this bio.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Currently each dm_io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
to initialization time inside alloc_io().
Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both sides are done.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so not necessary to do it dm_zone_map_bio
any more.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
The current DM code (ab)uses late assignment of dm_io->orig_bio (after
__map_bio() returns and any bio splitting is complete) to indicate the
FS bio has been processed and can be accounted. This results in
awkward waiting until ->orig_bio is set in dm_submit_bio_remap().
Also the bio splitting was implemented using bio_split()+bio_chain()
-- a well-worn pattern but it requires bio cloning purely for the
benefit of more natural IO accounting. The bio_split() result was
stored in ->orig_bio to represent the mapped part of the original FS
bio.
DM has switched to the bdev based IO accounting interface. DM's IO
accounting can be implemented in terms of the original FS bio (now
stored early in ->orig_bio) via access to its sectors/bio_op. And
if/when splitting is needed, set a new DM_IO_WAS_SPLIT flag and use
new dm_io fields of .sector_offset & .sectors to allow IO accounting
for split bios _without_ needing to clone a new bio to store in
->orig_bio.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
DM splits flush with data into empty flush followed by bio with data
payload, switch dm_io_acct() to use bdev_{start,end}_io_acct() to do
this accoiunting more naturally (rather than temporarily changing the
bio's bi_size).
This will allow DM to more easily account bios that are split (in
following commit).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
All the other 4 parameters are retrieved from the 'dm_io' instance, so
it's not necessary to pass all four to dm_io_acct().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct,
so it isn't necessary to take one bio parameter for the two helpers.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Rename 'bi_size' to 'bio_sectors' given bi_size is being stored in
sectors. Also, use bio_sectors() rather than open-coding it.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Use jump_labels to further reduce cost of unlikely branches for zoned
block devices, dm-stats and swap_bios throttling.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>