TCP packets will be dropped if the segments number in the tx skb
exceeds limitation when sending iperf3 traffic with --zerocopy option.
we make the following changes:
Get nr_frags in nfp_nfdk_tx_maybe_close_block instead of passing from
outside because it will be changed after skb_linearize operation.
Fill maximum dma_len in first tx descriptor to make sure the whole
head is included in the first descriptor.
Fixes: c10d12e3dc ("nfp: add support for NFDK data path")
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau says:
====================
mptcp: Self test improvements and a header tweak
Patch 1 moves a definition to a header so it can be used in a struct
declaration.
Patch 2 adjusts a time threshold for a selftest that runs much slower on
debug kernels (and even more on slow CI infrastructure), to reduce
spurious failures.
Patches 3 & 4 improve userspace PM test coverage.
Patches 5 & 6 clean up output from a test script and selftest helper
tool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The usage header of pm_nl_ctl command doesn't match with the context. So
this patch adds the missing userspace PM keywords 'ann', 'rem', 'csf',
'dsf', 'events' and 'listen' in it.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There're some 'Terminated' messages in the output of userspace pm tests
script after killing './pm_nl_ctl events' processes:
Created network namespaces ns1, ns2 [OK]
./userspace_pm.sh: line 166: 13735 Terminated ip netns exec "$ns2" ./pm_nl_ctl events >> "$client_evts" 2>&1
./userspace_pm.sh: line 172: 13737 Terminated ip netns exec "$ns1" ./pm_nl_ctl events >> "$server_evts" 2>&1
Established IPv4 MPTCP Connection ns2 => ns1 [OK]
./userspace_pm.sh: line 166: 13753 Terminated ip netns exec "$ns2" ./pm_nl_ctl events >> "$client_evts" 2>&1
./userspace_pm.sh: line 172: 13755 Terminated ip netns exec "$ns1" ./pm_nl_ctl events >> "$server_evts" 2>&1
Established IPv6 MPTCP Connection ns2 => ns1 [OK]
ADD_ADDR 10.0.2.2 (ns2) => ns1, invalid token [OK]
This patch adds a helper kill_wait(), in it using 'wait $pid 2>/dev/null'
commands after 'kill $pid' to avoid printing out these Terminated messages.
Use this helper instead of using 'kill $pid'.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace pm subflow tests support for mptcp_join.sh
script. Add userspace pm create subflow and destroy test cases in
userspace_tests().
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace pm tests support for mptcp_join.sh script. Add
userspace pm add_addr and rm_addr test cases in userspace_tests().
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The mentioned test measures the transfer run-time to verify
that the user-space program is able to use the full aggregate B/W.
Even on (virtual) link-speed-bound tests, debug kernel can slow
down the transfer enough to cause sporadic test failures.
Instead of unconditionally raising the maximum allowed run-time,
tweak when the running kernel is a debug one, and use some simple/
rough heuristic to guess such scenarios.
Note: this intentionally avoids looking for /boot/config-<version> as
the latter file is not always available in our reference CI
environments.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move macro MPTCPOPT_HMAC_LEN definition from net/mptcp/protocol.h to
include/net/mptcp.h.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some Intel processors may use alternate predictors for RETs on
RSB-underflow. This condition may be vulnerable to Branch History
Injection (BHI) and intramode-BTI.
Kernel earlier added spectre_v2 mitigation modes (eIBRS+Retpolines,
eIBRS+LFENCE, Retpolines) which protect indirect CALLs and JMPs against
such attacks. However, on RSB-underflow, RET target prediction may
fallback to alternate predictors. As a result, RET's predicted target
may get influenced by branch history.
A new MSR_IA32_SPEC_CTRL bit (RRSBA_DIS_S) controls this fallback
behavior when in kernel mode. When set, RETs will not take predictions
from alternate predictors, hence mitigating RETs as well. Support for
this is enumerated by CPUID.7.2.EDX[RRSBA_CTRL] (bit2).
For spectre v2 mitigation, when a user selects a mitigation that
protects indirect CALLs and JMPs against BHI and intramode-BTI, set
RRSBA_DIS_S also to protect RETs for RSB-underflow case.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
All the invocations unroll to __x86_return_thunk and this file
must be PIC independent.
This fixes kexec on 64-bit AMD boxes.
[ bp: Fix 32-bit build. ]
Reported-by: Edward Tran <edward.tran@oracle.com>
Reported-by: Awais Tanveer <awais.tanveer@oracle.com>
Suggested-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
NFPROTO_ARP is expecting to find the ARP header at the network offset.
In the particular case of ARP, HTYPE= field shows the initial bytes of
the ethernet header destination MAC address.
netdev out: IN= OUT=bridge0 MACSRC=c2:76:e5:71:e1:de MACDST=36:b0:4a:e2:72:ea MACPROTO=0806 ARP HTYPE=14000 PTYPE=0x4ae2 OPCODE=49782
NFPROTO_NETDEV egress hook is also expecting to find the IP headers at
the network offset.
Fixes: 35b9395104 ("netfilter: add generic ARP packet logger")
Reported-by: Tom Yan <tom.ty89@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Martin Blumenstingl says:
====================
selftests: forwarding: Install two missing tests
For some distributions (e.g. OpenWrt) we don't want to rely on rsync
to copy the tests to the target as some extra dependencies need to be
installed. The Makefile in tools/testing/selftests/net/forwarding
already installs most of the tests.
This series adds the two missing tests to the list of installed tests.
That way a downstream distribution can build a package using this
Makefile (and add dependencies there as needed).
====================
Link: https://lore.kernel.org/r/20220707135532.1783925-1-martin.blumenstingl@googlemail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When using the Makefile from tools/testing/selftests/net/forwarding/
all tests should be installed. Add no_forwarding.sh to the list of
"to be installed tests" where it has been missing so far.
Fixes: 476a4f05d9 ("selftests: forwarding: add a no_forwarding.sh test")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When using the Makefile from tools/testing/selftests/net/forwarding/
all tests should be installed. Add local_termination.sh to the list of
"to be installed tests" where it has been missing so far.
Fixes: 90b9566aa5 ("selftests: forwarding: add a test for local_termination.sh")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When we are operating in SGMII inband mode, it implies that there is a
PHY connected, and the ethtool advertisement for autoneg applies to
the PHY, not the SGMII link. When in 1000base-X mode, then this applies
to the 802.3z link and needs to be applied to the PCS.
Fix this.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/E1o9Ng2-005Qbe-3H@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A description is missing for the net.core.high_order_alloc_disable
option in admin-guide/sysctl/net.rst ; add it. The above sysctl option
was introduced by commit ce27ec6064 ("net: add high_order_alloc_disable
sysctl/static key").
Thanks to Eric for running again the benchmark cited in the above
commit, showing this knob is now mostly of historical importance.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220707080245.180525-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building with Clang we encounter this warning:
| net/rxrpc/rxkad.c:434:33: error: format specifies type 'unsigned short'
| but the argument has type 'u32' (aka 'unsigned int') [-Werror,-Wformat]
| _leave(" = %d [set %hx]", ret, y);
y is a u32 but the format specifier is `%hx`. Going from unsigned int to
short int results in a loss of data. This is surely not intended
behavior. If it is intended, the warning should be suppressed through
other means.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
Link: https://github.com/ClangBuiltLinux/linux/issues/378
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220707182052.769989-1-justinstitt@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
tls: pad strparser, internal header, decrypt_ctx etc.
A grab bag of non-functional refactoring to make the series
which will let us decrypt into a fresh skb smaller.
Patches in this series are not strictly required to get the
decryption into a fresh skb going, they are more in the "things
which had been annoying me for a while" category.
====================
Link: https://lore.kernel.org/r/20220708010314.1451462-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tls_wait_data() sets the return code as an output parameter
and always returns ctx->recv_pkt on success.
Return the error code directly and let the caller read the skb
from the context. Use positive return code to indicate ctx->recv_pkt
is ready.
While touching the definition of the function rename it.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The max size of iv + aad + tail is 22B. That's smaller
than a single sg entry (32B). Don't bother with the
memory packing, just create a struct which holds the
max size of those members.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk_skb_cb lives within skb->cb[]. skb->cb[] straddles
2 cache lines, each containing 24B of data.
The first cache line does not contain much interesting
information for users of strparser, so pad things a little.
Previously strp_msg->full_len would live in the first cache
line and strp_msg->offset in the second.
We need to reorder the 8 byte temp_reg with struct tls_msg
to prevent a 4B hole which would push the struct over 48B.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLInYwACgkQ+7dXa6fL
C2tbmxAAlBkMvoIeyOlkMeAB3l1aK5CAYmgddjWOxY6UHNIGgXqWlGJCoxYngTmy
j+5Uz1mMNh7TamSJpCTXUBbiLWxzoHI34oie+NT9VUXTTVw0/GQ//jQPlvE+HeJG
Hmi0HR4lby1O6X9r63obzvDctJPhsVLdgJFtZ3UidwzgY2zYnScxwyCuwSfLUVnF
YJACZWPFj1oWxOiS2bQZgnvCJpfJ6J0I5asmO0lI1tQDS0o4RoiRvDPTx8Z6K3KO
8CSm5tvu+vYrbZ5aNCN9f0MZfNl+YZiNQyxEiIP9Phu9iZFRe9UgXvDrE9346uQM
/Vxa5nvyj+D0Hg+l1jBFXvwoKxLHTZ/BaK74h2/QzRKgxKrHTAyYOgxqt7U/YCsa
S489eHOGNwbqIKACljBcMX9YZeCdeixt4aK/VIRr+d1f2Ui30Z05ExuYOfGn4wBq
VcSsBgAALwcyqrpi61URYvKduHTa2D8ATKNKXNM/i1Wv6KQDeom5jyKj3mE7UNsI
WAgWE2xy74kjYs1Lc0izsfieyZrt1MqZiyYFjS86Cnt0wVmdAQYrQ3wciN5D9IJN
MiUYuienGRf1MNM2tmFK+N7e+pa/waKcl15/d3Zal94rlATpR2bFOI1zJ9qJ8rpi
k8B/eydAZx+oUthxWvfIdLwIREJzA49jzBHgfAy/CMjqPCj7gZ8=
=FDkz
-----END PGP SIGNATURE-----
Merge tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache fixes from David Howells:
- Fix a check in fscache_wait_on_volume_collision() in which the
polarity is reversed. It should complain if a volume is still marked
acquisition-pending after 20s, but instead complains if the mark has
been cleared (ie. the condition has cleared).
Also switch an open-coded test of the ACQUIRE_PENDING volume flag to
use the helper function for consistency.
- Not a fix per se, but neaten the code by using a helper to check for
the DROPPED state.
- Fix cachefiles's support for erofs to only flush requests associated
with a released control file, not all requests.
- Fix a race between one process invalidating an object in the cache
and another process trying to look it up.
* tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Fix invalidation/lookup race
cachefiles: narrow the scope of flushed requests when releasing fd
fscache: Introduce fscache_cookie_is_dropped()
fscache: Fix if condition in fscache_wait_on_volume_collision()
When CONFIG_NF_CONNTRACK=m, struct bpf_ct_opts and enum member
BPF_F_CURRENT_NETNS are not exposed. This commit allows building the
xdp_synproxy selftest in such cases. Note that nf_conntrack must be
loaded before running the test if it's compiled as a module.
This commit also allows this selftest to be successfully compiled when
CONFIG_NF_CONNTRACK is disabled.
One unused local variable of type struct bpf_ct_opts is also removed.
Fixes: fb5cd0ce70 ("selftests/bpf: Add selftests for raw syncookie helpers")
Reported-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220708130319.1016294-1-maximmi@nvidia.com
This change addresses a comment made earlier [0] about a missing return
of an error when __bpf_core_types_match is invoked from
bpf_core_composites_match, which could have let to us erroneously
ignoring errors.
Regarding the typedef name check pointed out in the same context, it is
not actually an issue, because callers of the function perform a name
check for the root type anyway. To make that more obvious, let's add
comments to the function (similar to what we have for
bpf_core_types_are_compat, which is called in pretty much the same
context).
[0]: https://lore.kernel.org/bpf/165708121449.4919.13204634393477172905.git-patchwork-notify@kernel.org/T/#m55141e8f8cfd2e8d97e65328fa04852870d01af6
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220707211931.3415440-1-deso@posteo.net
Daniel Borkmann says:
====================
bpf 2022-07-08
We've added 3 non-merge commits during the last 2 day(s) which contain
a total of 7 files changed, 40 insertions(+), 24 deletions(-).
The main changes are:
1) Fix cBPF splat triggered by skb not having a mac header, from Eric Dumazet.
2) Fix spurious packet loss in generic XDP when pushing packets out (note
that native XDP is not affected by the issue), from Johan Almbladh.
3) Fix bpf_dynptr_{read,write}() helper signatures with flag argument before
its set in stone as UAPI, from Joanne Koong.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs
bpf: Make sure mac_header was set before using it
xdp: Fix spurious packet loss in generic XDP TX path
====================
Link: https://lore.kernel.org/r/20220708213418.19626-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CI reported the following splat while running the strace testsuite:
[ 3976.640309] WARNING: CPU: 1 PID: 3570031 at kernel/ptrace.c:272 ptrace_check_attach+0x12e/0x178
[ 3976.640391] CPU: 1 PID: 3570031 Comm: strace Tainted: G OE 5.19.0-20220624.rc3.git0.ee819a77d4e7.300.fc36.s390x #1
[ 3976.640410] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
[ 3976.640452] Call Trace:
[ 3976.640454] [<00000000ab4b645a>] ptrace_check_attach+0x132/0x178
[ 3976.640457] ([<00000000ab4b6450>] ptrace_check_attach+0x128/0x178)
[ 3976.640460] [<00000000ab4b6cde>] __s390x_sys_ptrace+0x86/0x160
[ 3976.640463] [<00000000ac03fcec>] __do_syscall+0x1d4/0x200
[ 3976.640468] [<00000000ac04e312>] system_call+0x82/0xb0
[ 3976.640470] Last Breaking-Event-Address:
[ 3976.640471] [<00000000ab4ea3c8>] wait_task_inactive+0x98/0x190
This is because JOBCTL_TRACED is set, but the task is not in TASK_TRACED
state. Caused by ptrace_unfreeze_traced() which does:
task->jobctl &= ~TASK_TRACED
but it should be:
task->jobctl &= ~JOBCTL_TRACED
Fixes: 31cae1eaae ("sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lkml.kernel.org/r/20220706101625.2100298-1-svens@linux.ibm.com
Link: https://lkml.kernel.org/r/YrHA5UkJLornOdCz@li-4a3a4a4c-28e5-11b2-a85c-a8d192c6f089.ibm.com
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2101641
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Tested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Syzkaller reports the following crash:
RIP: 0010:check_return_code kernel/bpf/verifier.c:10575 [inline]
RIP: 0010:do_check kernel/bpf/verifier.c:12346 [inline]
RIP: 0010:do_check_common+0xb3d2/0xd250 kernel/bpf/verifier.c:14610
With the following reproducer:
bpf$PROG_LOAD_XDP(0x5, &(0x7f00000004c0)={0xd, 0x3, &(0x7f0000000000)=ANY=[@ANYBLOB="1800000000000019000000000000000095"], &(0x7f0000000300)='GPL\x00', 0x0, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, 0x2b, 0xffffffffffffffff, 0x8, 0x0, 0x0, 0x10, 0x0}, 0x80)
Because we don't enforce expected_attach_type for XDP programs,
we end up in hitting 'if (prog->expected_attach_type == BPF_LSM_CGROUP'
part in check_return_code and follow up with testing
`prog->aux->attach_func_proto->type`, but `prog->aux->attach_func_proto`
is NULL.
Add explicit prog_type check for the "Note, BPF_LSM_CGROUP that
attach ..." condition. Also, don't skip return code check for
LSM/STRUCT_OPS.
The above actually brings an issue with existing selftest which
tries to return EPERM from void inet_csk_clone. Fix the
test (and move called_socket_clone to make sure it's not
incremented in case of an error) and add a new one to explicitly
verify this condition.
Fixes: 69fd337a97 ("bpf: per-cgroup lsm flavor")
Reported-by: syzbot+5cc0730bd4b4d2c5f152@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220708175000.2603078-1-sdf@google.com
- Prevent _CPC from being used if the platform firmware does not
confirm CPPC v2 support via _OSC (Mario Limonciello).
- Allow systems with X86_FEATURE_CPPC set to use _CPC even if CPPC
support cannot be agreed on via _OSC (Mario Limonciello).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmLIgTwSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxmDAP/1OEgChbdJNvhg7no4BHEwSyuArAVcRD
+D5bCiPq8x5dSNke3BrN4n789fhsHphXucqI5yTq7lX1AonyO1EHkc+oG+EGrZ+8
dyw+6Nx4WdJkOLKrEwQm3QQYuBK/k4pCUwhWBBg5FEsn4QldBE2pzEC8aP7gNgN6
gAxs+6nC7r/9fMoRo8+SG9Xrr1FH1mddru4WTp6croLNrlLVXTaoAHwXMB4qk8in
JC8bvfnWSWXn05fvJ+nYf8wb93tjBsuyx/Om7kmP7V1WT30fidv+BdP50x6kFq9/
nE0iTAAgqiO1SpRpUHfYc63GUvBC9d9ZUsbSyT7fMEqZCqLRsKukUA5xFxgQORBi
fKWVVSjOaQUFk0eeivdqWTgpaIRkjxVCmLFxGN/vY3T1pqAmJIdoY5x7ZzxuOrBW
6h1zOT0Qm9LURrlnXhcFHXALXFAGVpYCoP4X0PhNi9fZCSrjraG77dar0fj5KgHv
pUnLTL1qHxrzX/Ax7LJ/Usc+mKO8s/Wd790CvtFElHIsH5p7Bx/l3BDdQHbGK2l2
ToO9WQQCumccXs6SHCVYQJIH1pr0UIB2X15FNrmLVem8WkEBFJ7DwrZ/nitenOV3
c9BsQr+rJ3s9S9aSYvIAadurPzC6x5lKu0GKMUfM2GUl/vVcFrzkzlAATD467/IC
Z7gsfL1UtOwx
=yWsx
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix two recent regressions related to CPPC support.
Specifics:
- Prevent _CPC from being used if the platform firmware does not
confirm CPPC v2 support via _OSC (Mario Limonciello)
- Allow systems with X86_FEATURE_CPPC set to use _CPC even if CPPC
support cannot be agreed on via _OSC (Mario Limonciello)"
* tag 'acpi-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: CPPC: Don't require _OSC if X86_FEATURE_CPPC is supported
ACPI: CPPC: Only probe for _CPC if CPPC v2 is acked
- Fix NULL pointer dereference related to printing a diagnostic
message in the exynos-bus devfreq driver (Christian Marangi).
- Fix race condition in the runtime PM framework which in some
cases may cause a supplier device to be suspended when its
consumer is still active (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmLIgKMSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxNgAP/R4t7jx+LyeQbX9JzdY/bA6NTGIdkMbJ
RhA9Nc3Sbsb+KXulsccq108O4CZwjiCEngvZHkOJJtzgRzLL7jxV6J63tNfLLS60
oJS4EAOAZ0c1JiO/Fh9G/c6stKdPAajs4PNSHpGcsq8m1e8CktmWPqW40OPD6kDk
YbkdMkO/7KJtOBX4IwzF67lYMQsulUGi2TAsq6sxsC00KA2A1CoXTWzoamDKtOD3
8y5JKKRCg7nYhZ7/qxy1zlnhHFJGzVppWbSv98DdZdvsKyizAkk7agBOP4Ld3iPT
ZtA1L0Pgjx6qUoQ7D7rUgNrmnHGCk77aluAkg/gtS2GI61bvbtT3apyYusJVC0jg
IKI3MHZZTwqdsEyoBxxWXuRciv1HXXrcwSMB3S9pBAwZEES+GwDFO9MwStkAv8zi
acceNirUiawODs6qNZjS8Ycsaz56DHCjcsBBCvmP5NnNA8kFdqxB4muz1SRmqubE
N9BPNDLFCoZArUkwg35QL9EZZLYjLjf5YtF9Xe3lPCaQdbPaIc+4MfXkEp4HHYC3
V1vvbyZbkq/xSNNCjXRoF4n1x2IZvz7SpsdPjxqADpVNxpL/d/dET8/NBXGxAwJE
KLh/iElyuIL17mgqH/fXpYNI6bqphxhe+sasH/nFpKOewdEQDtnYqmz6GQQGH469
/42eiBC4saNB
=iyaY
-----END PGP SIGNATURE-----
Merge tag 'pm-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a NULL pointer dereference in a devfreq driver and a runtime
PM framework issue that may cause a supplier device to be suspended
before its consumer.
Specifics:
- Fix NULL pointer dereference related to printing a diagnostic
message in the exynos-bus devfreq driver (Christian Marangi)
- Fix race condition in the runtime PM framework which in some cases
may cause a supplier device to be suspended when its consumer is
still active (Rafael Wysocki)"
* tag 'pm-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / devfreq: exynos-bus: Fix NULL pointer dereference
PM: runtime: Fix supplier device management during consumer probe
PM: runtime: Redefine pm_runtime_release_supplier()
Including:
- Patch for the Intel VT-d driver to fix device setup failures
when the PASID table is shared
- Fix for Intel VT-d device hot-add failure due to wrong device
notifier order
- Remove the old IOMMU mailing list from the MAINTAINERS file
now that it has been retired
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmLIUbYACgkQK/BELZcB
GuPSYBAAo5HfhNcdBaecojN/bPxvOpZBm00Tta1XQ6KnEKGiD+qgE3EhZZgmeaEq
yrpomHhGzb6Der+eq0mdKSYGTY4RTPFbwTpmZh/VtCm5aLnhjeyjgk5y3fFDlsiU
wZfKr1BrBZvJ+yAgpAdvIKsZEjsNyK+2cwvLz36WlMHqzMtNtqaU0OF3oxF+DC08
CzFOlRSqOiyp2Ct4FP9yEXU8sVwvqnHQbuekn8L+loQB5zm5SgIOVy7ttb4idPEn
Q2Kg6imxRVSA8MXvMR8OleQrnOTZrMZVr5OuMbsFPQ2H9RxX7Bq92AMPSReYGRhi
ep6WOVaAq14PxlL4y4HyEd6RYkiNYyz0hdnoXgezJF3k0hoG6hZLdG/SSK5l7zGY
+hrMmKgnF7uPz1M7YFenoy5p5EudzRBafIRgXrOZNZoXUdwNhh8hqJfjRev4op/y
t8Mt1j6goX9icUonPK5SN8MrDo8EaQj0BgC0d2ihScA3uOhVGSfCzLfhNDkGg5Jy
b936bgBzwv8Q7kBORPMmzK1+sNu28/NJnSmURGeSGGwZnsSOEd/GRjZkf7iGc9u7
RrRpW4dH3RlwcUY/ug6UcGqq7xaQPSS//PIVQ/dmjCjX32Xk95jurt1v/hLf0Cm8
eXSp1Q69kQAiNsu4uvZMfYRsVzh9Z2+UgdpoVXQH20ct94M92vA=
=KPue
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- fix device setup failures in the Intel VT-d driver when the PASID
table is shared
- fix Intel VT-d device hot-add failure due to wrong device notifier
order
- remove the old IOMMU mailing list from the MAINTAINERS file now that
it has been retired
* tag 'iommu-fixes-v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
MAINTAINERS: Remove iommu@lists.linux-foundation.org
iommu/vt-d: Fix RID2PASID setup/teardown failure
iommu/vt-d: Fix PCI bus rescan device hot add
- fix a build error in gpio-vf610
- fix a null-pointer dereference in the GPIO character device code
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmLIQ2YACgkQEacuoBRx
13KmMRAAqP0dMz9xocZWIif2feUYy4EAHGidIh0AN2ehTW0urmb6e8KAzOL4t0h6
O7ODeCHXvtM7xGqmeFPveyY+yygSfB1oitQQWawePpUHtG0jHicG6gtV70S7k0BA
MZbBhWHmJ0iC0s3x5DPnXlJ2/jqBjaF7UZi4bHVMMMfjB049YVTKQJ3mS/KTEFv8
Wn5GBwSACwSlYYA9opjJNK7fxxU4z7tbiycORLFKoiG66AdOGz7aZeI9Rr06EZ3T
ftkawIfxOtBivvcRdQmqKF3pvYju/aWnSTfCErmuifHP+sSsna8EZke+qvyx2YJp
+Q95QObc/7sghKWjSNz1vvGRdT27kzyJdquopzVDdV5+vM6X7OGS/leSmGTo5m0f
WAt8ZCmyGcTAsK3eL814vhntYls9aTiCeHLYD3i5k+cVym8qHxzGLe8vls8+Mmvy
U/S0obfFHbe2dur4ET2f71vjF4niEJmDwNYTe5tCzTNP/9fWcw8LLMOKfdjLldu7
44ennv+FAKE9eLRrRjQ8CkpXxUl++IV/TEtbTTY780ocSG4Olr951Ns7YQQC4aV3
ro3SzmgBaeO7UDvelCUsxhQT1G3faBmhwaJh57CegGFSKQYN1waCmT/oispz4U0p
08Xa7f2iNnWjBqI4nLWe2vad7+2Lq0SwH9jhp9oIw2qmVGGnz0I=
=Fq0O
-----END PGP SIGNATURE-----
Merge tag 'gpio-fixes-for-v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix a build error in gpio-vf610
- fix a null-pointer dereference in the GPIO character device code
* tag 'gpio-fixes-for-v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: cdev: fix null pointer dereference in linereq_free()
gpio: vf610: fix compilation error
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLIJFEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgIPD/9InvUKGh8y27bZT+y5F7aucRm1wAZGa016
EESTOapGUb0lWo3X3JGDjwfnpUhCWwxsHs5R13ZEdrwEWh1q55TJiCc9MGt2X+NB
f5fJHeoa9hWTlG/iFi7+ElkuTER1TcKqxysL3Gci6JWlQ1ObYVBOZAi9/9v5r5t8
3Cv7KfkhhclVOFpEMio00YTSl54axKToqFoNr0jE10lVXZtZtIsPlRAt8QvG66qA
K10rbznRrKmtKJvzHKx+3PrTj77QIaifcH9bqRu4KWqs2SsoEUIjlVKUkbz8F16g
+YmhfktYaatrPI/G3/tACKkdCi537HdUVyJFRFNw1439+DSETJNi2nQjeVdatsTe
kEL+5RlrUfjUnQpEWUBOmW4T5mNB0sryxSiOjSNQx0X8Mulp6SI7CaNovioI0SWS
7lydZZpb3difQjdrA1iGU8dpLtyDPRPeQZOJV9jPodhMYzr5UB4qTX2NkWTFSnr3
tiIflRdEdbGCVc5E41OVMpIAzdyLQWQDRn3r/3T/hScN99mKKAEv31HoPP4etmRw
M8mvwrJwNRrbOd+1QiAgblh/3ozqafmFEAI3LZBtCbupb1fWG2qaqXer5KPWpCXv
Kor+Q6dGWK0pQFSxoMaFwjDUo7gZiL9RxkmZEpkr+w1rKA3gqlx4b+NlZV7RbJez
2tijGLQtQA==
=UQiB
-----END PGP SIGNATURE-----
Merge tag 'block-5.19-2022-07-08' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"NVMe pull request with another id quirk addition, and a tracing fix"
* tag 'block-5.19-2022-07-08' of git://git.kernel.dk/linux-block:
nvme: use struct group for generic command dwords
nvme-pci: phison e16 has bogus namespace ids
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLIJGcQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgS1D/9E2MxaWaHR+A35AXignJLaXxiLWlzTAwxr
/ioT6omgogbiaoTrxVZMbW+6vAPbUamXA9yqv2fC/4RERBfz3Z6UBLa8Hp6SFCMQ
vb/LSwRwnUuMr/HyF3hLjwISOi4oYs/uo3E7IpOPbpC/e0nDpJnGWx1UfQn0tJSH
WVwZsLbUe8T9fhHA0uSHbMoMSvLQGqfwwY+MET2+j+ZDwMoet194yka22jJwfDbF
l3cnUe2TQh6orRdzuagWX9WmdnWWyQM3DTqW2cSA0hepyxGMWkCMhMuyV5yqUXhD
noHshcyL76h8WQi/BwYDAGYqGy1+FkOkuV3DmVnjHVQory17Ze8ijtImMoEWpkgl
TwTTd2+o0ivcEd0JHLeqLHkTXKUENeUMTpJVuLotLMMdupIvF0jrdNWTWCM7uBto
Q9JxIkEs+16bRqT+yzC4cNuzSQRL6+qQ5jVO5BsNmJoNvs15KN7vgAQ+uR4NCCIv
GqHbTiBVsi7DYipS6jNi/bWnxDtIsNsn48WCdx52OYpd1NbkY3oEHMqBZ6rnsPJH
/uek3VLajRZfG61EMUGlazitQdW3/z31wM0iP8Y8xPyvSNCbsJkFtrNO13jpuy9p
0YRP3peXYb2eEzYpq355RSCCpyActDYp77hjcyYQP3gJcnoViZ6rBUHr5Rw5W6lN
siwWz7aNXg==
=w3SQ
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block
Pull io_uring tweak from Jens Axboe:
"Just a minor tweak to an addition made in this release cycle: padding
a 32-bit value that's in a 64-bit union to avoid any potential
funkiness from that"
* tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block:
io_uring: explicit sqe padding for ioctl commands
fbcon now prevents switching to screen resolutions which are smaller
than the font size, and prevents enabling a font which is bigger than
the current screen resolution. This fixes vmalloc-out-of-bounds accesses
found by KASAN.
Guiling Deng fixed a bug where the centered fbdev logo wasn't displayed
correctly if the screen size matched the logo size.
Hsin-Yi Wang provided a patch to include errno.h to fix build when
CONFIG_OF isn't enabled.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCYsfdCgAKCRD3ErUQojoP
XxhvAP9HYH0XcfHGEIQ3YSyFRY4JpyLb0TcTD7mnrYwPgAw+KAD7Bw9EE7WhGZLC
7iuDn30/mdCFqiz3IuTjoRKYuG2ceg8=
=yrqq
-----END PGP SIGNATURE-----
Merge tag 'for-5.19/fbdev-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev fixes from Helge Deller:
- fbcon now prevents switching to screen resolutions which are smaller
than the font size, and prevents enabling a font which is bigger than
the current screen resolution. This fixes vmalloc-out-of-bounds
accesses found by KASAN.
- Guiling Deng fixed a bug where the centered fbdev logo wasn't
displayed correctly if the screen size matched the logo size.
- Hsin-Yi Wang provided a patch to include errno.h to fix build when
CONFIG_OF isn't enabled.
* tag 'for-5.19/fbdev-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbcon: Use fbcon_info_from_console() in fbcon_modechange_possible()
fbmem: Check virtual screen sizes in fb_set_var()
fbcon: Prevent that screen size is smaller than font size
fbcon: Disallow setting font bigger than screen size
video: of_display_timing.h: include errno.h
fbdev: fbmem: Fix logo center image dx issue
We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
wrote fully into the zone.
The assumption is determined by "alloc_offset == capacity". This condition
won't work if the last ordered extent is canceled due to some errors. In
that case, we consider the zone is deactivated without sending the finish
command while it's still active.
This inconstancy results in activating another block group while we cannot
really activate the underlying zone, which causes the active zone exceeds
errors like below.
BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
Fix the issue by removing the optimization for now.
Fixes: 8376d9e1ed ("btrfs: zoned: finish superblock zone once no space left for new SB")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bioc would leak on the normal completion path and also on the RAID56
check (but that one won't happen in practice due to the invalid
combination with zoned mode).
Fixes: 7db1c5d14d ("btrfs: zoned: support dev-replace in zoned filesystems")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.
This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).
The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.
So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.
Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
there is unexpected word "the" in comments need to remove
Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>
Link: https://lore.kernel.org/r/20220621080240.42198-1-jiangjian@cdjrlc.com
This cycle we added support for mounting overlayfs on top of idmapped
mounts. Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.
I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.
I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.
Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.
The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:
setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:
mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
mkdir -pv /vol/contpool/ctrover/{over,work}
chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
The user now creates an idmapped mount for the rootfs:
mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
/var/lib/lxc/c2/rootfs \
/vol/contpool/lowermap
This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.
Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.
mount -t overlay overlay \
-o lowerdir=/vol/contpool/lowermap, \
upperdir=/vol/contpool/overmap/over, \
workdir=/vol/contpool/overmap/work \
/vol/contpool/merge
The user can do this in two ways:
(1) Mount overlayfs in the initial user namespace and expose it to the
container.
(2) Mount overlayfs on top of the idmapped mounts inside of the container's
user namespace.
Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.
Now the user tries to retrieve the POSIX ACLs using the getfacl command
getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
and to their surprise they see:
# file: vol/contpool/merge/home/ubuntu/.bashrc
# owner: 1000
# group: 1000
user::rw-
user:4294967295:rwx
group::r--
mask::rwx
other::r--
indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}
If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get == ovl_posix_acl_xattr_get()
| -> ovl_xattr_get()
| -> vfs_getxattr()
| -> __vfs_getxattr()
| -> handler->get() /* lower filesystem callback */
|> posix_acl_fix_xattr_to_user()
{
4 = make_kuid(&init_user_ns, 4);
4 = mapped_kuid_fs(&init_user_ns, 4);
/* FAILURE */
-1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
}
As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.
This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:
setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container. IOW, we want to see 0 as the result of the idmapping
translations.
If the user chooses option (1) we get the following callchain for fscaps:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() ________________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */
And if the user chooses option (2) we get:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
-> vfs_getxattr()
-> xattr_getsecurity()
-> security_inode_getsecurity() _______________________________
-> cap_inode_getsecurity() | |
{ V |
10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0); |
10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000); |
/* Expected result is 0 and thus that we own the fscap. */ |
0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000); |
} |
-> vfs_getxattr_alloc() |
-> handler->get == ovl_other_xattr_get() |
|-> vfs_getxattr() |
-> xattr_getsecurity() |
-> security_inode_getsecurity() |
-> cap_inode_getsecurity() |
{ |
0 = make_kuid(0:0:4k /* lower s_user_ns */, 0); |
10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
|____________________________________________________________________|
}
-> vfs_getxattr_alloc()
-> handler->get == /* lower filesystem callback */
We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.
For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.
The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().
To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().
To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { |
| V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(&init_user_ns, 10000004);
| } |_________________________________________________
| |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
}
And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:
idmapped mount /vol/contpool/merge: 0:10000000:65536
caller's idmapping: 0:10000000:65536
overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
sys_getxattr()
-> path_getxattr()
-> getxattr()
-> do_getxattr()
|> vfs_getxattr()
| |> __vfs_getxattr()
| | -> handler->get == ovl_posix_acl_xattr_get()
| | -> ovl_xattr_get()
| | -> vfs_getxattr()
| | |> __vfs_getxattr()
| | | -> handler->get() /* lower filesystem callback */
| | |> posix_acl_getxattr_idmapped_mnt()
| | {
| | 4 = make_kuid(&init_user_ns, 4);
| | 10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
| | 10000004 = from_kuid(&init_user_ns, 10000004);
| | |_______________________
| | } |
| | |
| |> posix_acl_getxattr_idmapped_mnt() |
| { V
| 10000004 = make_kuid(&init_user_ns, 10000004);
| 10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
| 10000004 = from_kuid(0(&init_user_ns, 10000004);
| |_________________________________________________
| } |
| |
|> posix_acl_fix_xattr_to_user() |
{ V
10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
/* SUCCESS */
4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
}
The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:
ovl_permission()
-> generic_permission()
-> acl_permission_check()
-> check_acl()
-> get_acl()
-> inode->i_op->get_acl() == ovl_get_acl()
> get_acl() /* on the underlying filesystem)
->inode->i_op->get_acl() == /*lower filesystem callback */
-> posix_acl_permission()
passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.
Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a49 ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
napi_build_skb() reuses NAPI skbuff_head cache in order to save some
cycles on freeing/allocating skbuff_heads on every new Rx or completed
Tx.
Use napi_consume_skb() to feed the cache with skbuff_heads of completed
Tx, so it's never empty. The budget parameter is added to indicate NAPI
context, as a value of zero can be passed in the case of netpoll.
Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>