This is an enhancement for the SG-style page handling in vmalloc
buffer handler to calculate the continuous pages.
When snd_sgbuf_get_chunk_size() is called for a vmalloc buffer,
currently we return only the size that fits into a single page.
However, this API call is rather supposed for obtaining the continuous
pages and most of vmalloc or noncontig buffers do have lots of
continuous pages indeed. So, in this patch, the callback now
calculates the possibly continuous pages up to the given size limit.
Note that the end address in the function is calculated from the last
byte, hence it's one byte shorter. This is because ofs + size can be
above the actual buffer size boundary.
Until now, this feature isn't really used, but it'll become useful in
a later patch that adds the non-contiguous buffer type that shares the
same callback function as vmalloc.
Link: https://lore.kernel.org/r/20210812113818.6479-1-tiwai@suse.de
Link: https://lore.kernel.org/r/20210813081645.4680-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
On PREEMPT_RT enabled kernels it is not possible to refill the object pool
from atomic context (preemption or interrupts disabled) as the allocator
might acquire 'sleeping' spinlocks.
Guard the invocation of fill_pool() accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/87sfzehdnl.ffs@tglx
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memcpy(), memmove(), and memset(), avoid
intentionally writing across neighboring fields.
The it_present member of struct ieee80211_radiotap_header is treated as a
flexible array (multiple u32s can be conditionally present). In order for
memcpy() to reason (or really, not reason) about the size of operations
against this struct, use of bytes beyond it_present need to be treated
as part of the flexible array. Add a trailing flexible array and
initialize its initial index via pointer arithmetic.
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210806215305.2875621-1-keescook@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The variable ret and label just used as return, so we delete it and
use the return statement instead of the goto statement.
Signed-off-by: dingsenjie <dingsenjie@yulong.com>
Link: https://lore.kernel.org/r/20210805064349.202148-1-dingsenjie@163.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The maximum MTU was set to 2304, which is the maximum MSDU size. While
this is valid for normal WLAN interfaces, it is too low for monitor
interfaces. A monitor interface may receive and inject MPDU frames, and
the maximum MPDU frame size is larger than 2304. The MPDU may also
contain an A-MSDU frame, in which case the size may be much larger than
the MTU limit. Since the maximum size of an A-MSDU depends on the PHY
mode of the transmitting STA, it is not possible to set an exact MTU
limit for a monitor interface. Now the maximum MTU for a monitor
interface is unrestricted.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Link: https://lore.kernel.org/r/20210628123246.2070558-1-johan.almbladh@anyfinetworks.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The address "&sband->iftype_data[i]" points to an array at the end of
struct. It can't be NULL and so the check can be removed.
Fixes: bac2fd3d75 ("mac80211: remove use of ieee80211_get_he_sta_cap()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YNmgHi7Rh3SISdog@mwanda
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
As commit 52dba8d7d5 ("mac80211: reject zero MAC address in add station")
said, we don't consider all-zeroes to be a valid MAC address in most places,
so also reject it here.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20210626130334.13624-1-yuehaibing@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
iwlmei allows to integrate with the CSME firmware. There are
flows that are prioprietary for this purpose:
* Get the information of the AP the CSME firmware is connected
to. This is useful when we need to speed up the connection
process in case the CSME firmware has a TCP connection
that must be kept alive across the ownership transition.
* Forbid roaming, which will happen when the CSME firmware
wants to tell the user space not disrupt the connection.
* Request ownership, upon driver boot when the CSME firmware
owns the device. This is a notification sent by the kernel.
All those commands are expected to be used by any software
managing the connection (mainly NetworkManager). Those commands
are expected to be used only in case the CSME firmware owns
the device and doesn't want to release the device unless the
host made sure that it can keep the connectivity.
Here are the steps of the expected flow:
1) The machine boots while AMT has an active TCP connection
2) iwlwifi starts and tries to access the device
3) The device is not available because of the active TCP
connection. (If there are no active connections, the CSME
firmware would have allowed iwlwifi to use the device)
Note that all the steps up to here don't involve iwlmei. All
this happens in iwlwifi (in iwl_pcie_prepare_card_hw).
4) iwlmei establishes a connection to the CSME firmware (through
SAP)
Here iwlwifi uses iwlmei to access the device's capabilities
(since it can't touch the device), but this is not relevant
for the vendor commands.
5) The CSME firmware tells iwlmei that it uses the NIC and
that there is an acitve TCP connection, and hence, the
host needs to think twice before asking the CSME firmware
to release the device
6) iwlmei tells iwlwifi to report HW RFKILL with a special
reason
Up to here, there was no user space involved.
7) The user space (NetworkManager) boots and sees that the
device is in RFKILL because the host doesn't own the
device
8) The user space asks the kernel what AP the CSME firmware
is connected to (with the first vendor command mentionned
above)
9) The user space checks if it has a profile that matches the
reply from the CSME firmware
10) The user space installs a network to the wpa_supplicant
with a specific BSSID and a specific frequency
11) The user space prevents any type of full scan
12) The user space asks iwlmei to request ownership on the
device (with the third vendor command)
13) iwlmei request ownership from the CSME firmware
14) The CSME firmware grants ownership
15) iwlmei tells iwlwifi to lift the RFKILL
16) RFKILL OFF is reported to userspace
17) The host boots the device, loads the firwmare, and
connect to a specific BSSID without scanning including IP
in less than 600ms (this is what I measured, of course
it depends on many factors)
18) The host reports to the CSME firmware that there is a
connection
19) The TCP connection is preserved and the host has now
connectivity
20) Later, the TCP connection to the CSME firmware is
terminated
21) The CSME firmware tells iwlmei that it is now free to
do whatever it likes
22) iwlwifi sends the second vendor command to tell the
user space that it can remove the special network
configuration and pick any SSID / BSSID it likes.
Co-Developed-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20210625081717.7680-4-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add Samsung Exynos850 SoC specific data to enable pinctrl support for
all platforms based on Exynos850.
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Link: https://lore.kernel.org/r/20210811114827.27322-3-semen.protsenko@linaro.org
[krzysztof: lower-case the hex-numbers]
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
* Rework DesignWare driver to use software nodes instead of platform data
* Drop duplication of forward declaration for ACPI in consumer.h
* Get rid of legacy PCI PM code in ML IOH driver
The following is an automated git shortlog grouped by driver:
dwapb:
- Get rid of legacy platform data
- Read GPIO base from gpio-base property
- Unify ACPI enumeration checks in get_irq() and configure_irqs()
gpiolib:
- Deduplicate forward declaration in the consumer.h header
mfd:
- intel_quark_i2c_gpio: Convert GPIO to use software nodes
ml-ioh:
- Convert to dev_pm_ops
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEqaflIX74DDDzMJJtb7wzTHR8rCgFAmEVJWkACgkQb7wzTHR8
rCg5ZRAAhLifQXJJBR9YCcqsccDA/DezLPUOJoM2VZNu7PdpeRKY8VPomGxbr0mD
DqjEC5OZyCyeToiXWOJmImtacYUrCw6WmOoHouzwOM/L9wEOWHMJnuGPt5207xOW
OVv5Onv2fpZUuhamsmB96CSs0jqgVmTh7SCtaWxMbroNlU2y2LZDz5FIft7+xQx/
vHi42BSBt4I6VzYC0AYJAEnqtM5LMy8GcTm3ZgzikHVyQfufhDxb+NIiWqCmji3w
+tVQpPIYdNAnDm9ApbabM3fYdDJB7JXvzHSk6Mby9HjXULfnfWPVKHoI188QjJ/F
DRyBnjK//a4tWGfJjlR6HgxQfSu99O7uYCGQHxG7emaXvSNnJ2pG094sqErhtP3A
G+bbygVAvgouzpzpb7yyxLqgagIm28o7Q7c2M291KhBFU3nQQ83cPGY7E/6ByAup
HmBSXVVVWHa6cgi0OHVKr9Vxdjo7iexG6z5JX6kZ3rUIcggbHGvmcelGfKAq5H2t
ynSWAL5+nV09bR69vJwQpQ4YMfBYCzbK+taxZ9CjIl/QgQUKFEqOHLp0F7JkLASf
abC+6DrwXHTpqAvQpzuj8DhUndG5pTCsNGRibwLqFcFt/LXfJr5MW+gn8Hu1+bAw
H5rOKMnDZ1yYsoXoLcsbcjVtW2sJo1F+kRATwUwedRn1lzL7l3k=
=L82Z
-----END PGP SIGNATURE-----
Merge tag 'intel-gpio-v5.15-1' of gitolite.kernel.org:pub/scm/linux/kernel/git/andy/linux-gpio-intel into gpio/for-next
intel-gpio for v5.15-1
* Rework DesignWare driver to use software nodes instead of platform data
* Drop duplication of forward declaration for ACPI in consumer.h
* Get rid of legacy PCI PM code in ML IOH driver
The following is an automated git shortlog grouped by driver:
dwapb:
- Get rid of legacy platform data
- Read GPIO base from gpio-base property
- Unify ACPI enumeration checks in get_irq() and configure_irqs()
gpiolib:
- Deduplicate forward declaration in the consumer.h header
mfd:
- intel_quark_i2c_gpio: Convert GPIO to use software nodes
ml-ioh:
- Convert to dev_pm_ops
Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is
defined but not yet used. It's called after kvm->debugfs_dentry is created, so
it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should
define their own versions when they want to create extra debugfs nodes.
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210730220455.26054-2-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Clear nested.pi_pending on nested VM-Enter even if L2 will run without
posted interrupts enabled. If nested.pi_pending is left set from a
previous L2, vmx_complete_nested_posted_interrupt() will pick up the
stale flag and exit to userspace with an "internal emulation error" due
the new L2 not having a valid nested.pi_desc.
Arguably, vmx_complete_nested_posted_interrupt() should first check for
posted interrupts being enabled, but it's also completely reasonable that
KVM wouldn't screw up a fundamental flag. Not to mention that the mere
existence of nested.pi_pending is a long-standing bug as KVM shouldn't
move the posted interrupt out of the IRR until it's actually processed,
e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
with a "pending" posted interrupt. Fixing the mess is a future problem.
Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
descriptor as an error, this was a benign bug as the null PI descriptor
effectively served as a check on PI not being enabled. Even then, the
new flow did not become problematic until KVM started checking the result
of kvm_check_nested_events().
Fixes: 705699a139 ("KVM: nVMX: Enable nested posted interrupt processing")
Fixes: 966eefb896 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810144526.2662272-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The ROL16(val, n) macro is repeatedly defined in several vmcs-related
files, and it has never been used outside the KVM context.
Let's move it to vmcs.h without any intended functional changes.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20210809093410.59304-4-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
it should never be called by anything other than low level KVM code.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
[sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
VMX and SVM instructions use asm goto to handle the fault (or in the
case of VMREAD, completely custom logic). Drop kvm_spurious_fault()'s
asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
flow that invoked it from assembly code.
Cc: Uros Bizjak <ubizjak@gmail.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210809173955.1710866-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that nested VMX pulls KVM's desired VMCS controls from vmcs01 instead
of re-calculating on the fly, bury the helpers that do the calcluations
in vmx.c.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the secondary execution controls cache now that it's effectively
dead code; it is only read immediately after it is written.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When preparing controls for vmcs02, grab KVM's desired controls from
vmcs01's shadow state instead of recalculating the controls from scratch,
or in the secondary execution controls, instead of using the dedicated
cache. Calculating secondary exec controls is eye-poppingly expensive
due to the guest CPUID checks, hence the dedicated cache, but the other
calculations aren't exactly free either.
Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on
exit) as appropriate as they may be set in vmcs01, whereas the previous
implementation relied on dynamic bits being cleared in the calculator.
Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from
vmcs01 to vmcs02. Whether or not PERF_GLOBAL_CTRL is loaded depends on
whether or not perf itself is active, so unless perf stops between the
exit from L1 and entry to L2, vmcs01 will hold the desired value. This
is purely an optimization as atomic_switch_perf_msrs() will set/clear
the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in
the case where perf is active (versus starting with the bits clear in
vmcs02, which was the previous behavior).
Cc: Zeng Guang <guang.zeng@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210810171952.2758100-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These stores are copied and pasted from the "if" statements above.
They are dead and while they are not really a bug, they can be
confusing to anyone reading the code as well. Remove them.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The commit efdab99281 ("KVM: x86: fix escape of guest dr6 to the host")
fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out.
But writing to debug registers is slow, and it can be visible in perf results
sometimes, even if neither the host nor the guest activate breakpoints.
Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case
where DR6 gets the guest value, and it never happens at all on SVM,
the register can be cleared in vmx.c right after reading it.
Reported-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit c77fb5fe6f ("KVM: x86: Allow the guest to run with dirty debug
registers") allows the guest accessing to DRs without exiting when
KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized
on entry to the guest---including DR6 that was not synced before the commit.
But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT,
but also when KVM_DEBUGREG_BP_ENABLED. The second case is unnecessary
and just leads to a more case which leaks stale DR6 to the host which has
to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put().
Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters
on VMX because SVM always uses the DR6 value from the VMCB. So move this
line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT.
Reported-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit ae561edeb4 ("KVM: x86: DR0-DR3 are not clear on reset") added code to
ensure eff_db are updated when they're modified through non-standard paths.
But there is no reason to also update hardware DRs unless hardware breakpoints
are active or DR exiting is disabled, and in those cases updating hardware is
handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.
KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
to be removed.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.
Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations. But underflow is the best case scenario. The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.
Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok. Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.
For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write. If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.
Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled. Holding mmu_lock for read also
prevents the indirect shadow page from being freed. But as above, keep
it simple and always take the lock.
Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP. Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).
Alternative #2, the unsync logic could be made thread safe. In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).
Fixes: a2855afc7e ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set the min_level for the TDP iterator at the root level when zapping all
SPTEs to optimize the iterator's try_step_down(). Zapping a non-leaf
SPTE will recursively zap all its children, thus there is no need for the
iterator to attempt to step down. This avoids rereading the top-level
SPTEs after they are zapped by causing try_step_down() to short-circuit.
In most cases, optimizing try_step_down() will be in the noise as the cost
of zapping SPTEs completely dominates the overall time. The optimization
is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
is zapping in response to multiple memslot updates when userspace is adding
and removing read-only memslots for option ROMs. In that case, the task
doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
for read and thus can be a noisy neighbor of sorts.
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
really zap all SPTEs in this case. As is, zap_gfn_range() skips non-leaf
SPTEs whose range exceeds the range to be zapped. If shadow_phys_bits is
not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
paging, the "zap all" flows will skip top-level SPTEs whose range extends
beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
Use the current upper bound (based on host.MAXPHYADDR) to detect that the
caller wants to zap all SPTEs, e.g. instead of using the max theoretical
gfn, 1 << (52 - 12). The more precise upper bound allows the TDP iterator
to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
help future debuggers should KVM decide to leak SPTEs again.
The bug is most easily reproduced by running (and unloading!) KVM in a
VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
=============================================================================
BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack_lvl+0x45/0x59
slab_err+0x95/0xc9
__kmem_cache_shutdown.cold+0x3c/0x158
kmem_cache_destroy+0x3d/0xf0
kvm_mmu_module_exit+0xa/0x30 [kvm]
kvm_arch_exit+0x5d/0x90 [kvm]
kvm_exit+0x78/0x90 [kvm]
vmx_exit+0x1a/0x50 [kvm_intel]
__x64_sys_delete_module+0x13f/0x220
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181414.3376143-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The "mct_tick" is a per-cpu clockevents device. Set the
CLOCK_EVT_FEAT_PERCPU feature to prevent e.g. mct_tick0 being unsafely
designated as the global broadcast timer and instead treat the device as
a per-cpu wakeup timer.
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Will Deacon <will@kernel.org>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20210608154341.10794-3-will@kernel.org
All arm64 CPUs feature an architected timer, which offers a relatively
low-latency interface to a per-cpu clocksource and timer. For the most
part, using this interface is a no-brainer, with the exception of SoCs
where it cannot be used to wake up from deep idle state (i.e.
CLOCK_EVT_FEAT_C3STOP is set).
On the contrary, the Exynos MCT is extremely slow to access yet can be
used as a wakeup source. In preparation for using the Exynos MCT as a
potential wakeup timer for the Arm architected timer, reduce its ratings
so that the architected timer is preferred.
This effectively reverts the decision made in 6282edb72b
("clocksource/drivers/exynos_mct: Increase priority over ARM arch timer")
for arm64, as the reasoning for the original change was to work around
a 32-bit SoC design.
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Will Deacon <will@kernel.org>
Tested-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> # exynos-5422
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/r/20210608154341.10794-2-will@kernel.org
- Plug race between enabling MTE and creating vcpus
- Fix off-by-one bug when checking whether an address range is RAM
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmEWEsoPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpD1IIQAIbZdNAIy68j2/H8sgaYT4GuYICLOvz3WhTI
Li/yRP2b0th4wT4LaKlATKJKQgliPxXZ0KCJMZxFr7aiKEyY1LZe+ddJBzetzgy2
S12v5V3cp/0DHQ6CEflUy0x8gM/BeudeYyZcHxSbLZcVB4bzFx9pBJeJ1WkLG+GC
Bx4zxdARNas+9zOUuHLCQbWfihMSrbj3CI6WIafpNeFOs3lLldT8WcRofgQfAsAx
V3FKETIOb5NUU6LKUHkYgyM3n1MZwAukaCsepDhayeeT5iEyIGXb1HkjcYOx6bfn
BhDvA7PH9oXBOFFL2sxlJKamXWZP3Bz7xyZ40MXDqC1lSMAUEh8TXJFptncEDxPb
OgXewTgCulKVSjT8YXnoTe1UNQ2dLqjw1TsqV5jXhVXIjeBcR8S4gM0hcqwvgWlO
BHaDt8BPd39rBzfC0gUkE5BHE04QuboK/Vz/+Qc6Slc3EUIdnuCtjefdRLvSxxgB
bEBW+s3zcZ7RhoSLvXgvTe3an11Os8BH921VCxgMyEnIvSDEbw3KypmPYuNCkSLc
t9GLAbPU139w7Gk7vp0oqhI8xIV7QoFk+b94JIHMvtS13yVaqBrZF33RrFzmAwVN
lXDiOdoR8mqbX2EPQVIn+BhSlebfvnJANm46tzgY1/u2mUgH//fu/cH3kpjgohco
kY+Ztnb9
=hL2s
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-fixes-5.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 5.14, take #2
- Plug race between enabling MTE and creating vcpus
- Fix off-by-one bug when checking whether an address range is RAM
Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
in L2 or if the VM-Exit should be forwarded to L1. The current logic fails
to account for the case where #PF is intercepted to handle
guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
L1. At best, L1 will complain and inject the #PF back into L2. At
worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
page faults.
Note, while the bug was technically introduced by the commit that added
support for the MAXPHYADDR madness, the shame is all on commit
a0c134347b ("KVM: VMX: introduce vmx_need_pf_intercept").
Fixes: 1dbf5d68af ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
Cc: stable@vger.kernel.org
Cc: Peter Shier <pshier@google.com>
Cc: Oliver Upton <oupton@google.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812045615.3167686-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a nested EPT violation/misconfig is injected into the guest,
the shadow EPT PTEs associated with that address need to be synced.
This is done by kvm_inject_emulated_page_fault() before it calls
nested_ept_inject_page_fault(). However, that will only sync the
shadow EPT PTE associated with the current L1 EPTP. Since the ASID
is based on EP4TA rather than the full EPTP, so syncing the current
EPTP is not enough. The SPTEs associated with any other L1 EPTPs
in the prev_roots cache with the same EP4TA also need to be synced.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20210806222229.1645356-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
hv_vcpu is initialized again a dozen lines below, and at this
point vcpu->arch.hyperv is not valid. Remove the initializer.
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove an ancient restriction that disallowed exposing EFER.NX to the
guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
The motivation of the check, added by commit 2cc51560ae ("KVM: VMX:
Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
the guest with the host's EFER.NX and thus avoid context switching EFER
if the only divergence was the NX bit.
Fast forward to today, and KVM has long since stopped running the guest
with the host's EFER.NX. Not only does KVM context switch EFER if
host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
guest.EFER.NX=1 when using shadow paging (to emulate SMEP). Furthermore,
the entire motivation for the restriction was made obsolete over a decade
ago when Intel added dedicated host and guest EFER fields in the VMCS
(Nehalem timeframe), which reduced the overhead of context switching EFER
from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
In practice, the removed restriction only affects non-PAE 32-bit kernels,
as EFER.NX is set during boot if NX is supported and the kernel will use
PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
will actually use NX itself (mark PTEs non-executable).
Alternatively and/or complementarily, startup_32_smp() in head_32.S could
be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
the scenario where NX is supported but not enabled. However, that runs
the risk of breaking non-KVM non-PAE kernels (though the risk is very,
very low as there are no known EFER.NX errata), and also eliminates an
easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
across nested virtualization transitions.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210805183804.1221554-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add serial driver data for Exynos850 SoC. This driver data is basically
reusing EXYNOS_COMMON_SERIAL_DRV_DATA, which is common for all Exynos
chips, but also enables USI init, which was added in previous commit:
"tty: serial: samsung: Init USI to keep clocks running".
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Link: https://lore.kernel.org/r/20210811114827.27322-7-semen.protsenko@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Make checkpatch happy by fixing this error:
ERROR: Macros with complex values should be enclosed in parentheses
Although this change is made to keep macros consistent with consequent
patches (adding driver data for new SoC), it's intentionally added as a
separate patch to ease possible porting efforts in future.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Link: https://lore.kernel.org/r/20210811114827.27322-6-semen.protsenko@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
UART block is a part of USI (Universal Serial Interface) IP-core in
Samsung SoCs since Exynos9810 (e.g. in Exynos850). USI allows one to
enable one of three types of serial interface: UART, SPI or I2C. That's
possible because USI shares almost all internal circuits within each
protocol. USI also provides some additional registers so it's possible
to configure it.
One USI register called USI_OPTION has reset value of 0x0. Because of
this the clock gating behavior is controlled by hardware (HWACG =
Hardware Auto Clock Gating), which simply means the serial won't work
after reset as is. In order to make it work, USI_OPTION[2:1] bits must
be set to 0b01, so that HWACG is controlled manually (by software).
Bits meaning:
- CLKREQ_ON = 1: clock is continuously provided to IP
- CLKSTOP_ON = 0: drive IP_CLKREQ to High (needs to be set along with
CLKREQ_ON = 1)
USI is not present on older chips, like s3c2410, s3c2412, s3c2440,
s3c6400, s5pv210, exynos5433, exynos4210. So the new boolean field
'.has_usi' was added to struct s3c24xx_uart_info. USI registers will be
only actually accessed when '.has_usi' field is set to "1".
This feature is needed for further serial enablement on Exynos850, but
some other new Exynos chips (like Exynos9810) may benefit from this
feature as well.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Link: https://lore.kernel.org/r/20210811114827.27322-5-semen.protsenko@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
serdev provides a generic helper to get UART Serial Bus resources.
Use it instead of an open coded variant.
Reviewed-by: Maximilian Luz <luzmaximilian@gmail.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20210806111736.66591-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The same as for I²C Serial Bus resource split and export
serdev_acpi_get_uart_resource(). We have already a few users
one of which is converted here.
Rationale of this is to consolidate parsing UART Serial Bus
resource in one place as it's done, e.g., for I²C Serial Bus.
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20210806111736.66591-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the function stm32_usart_init_port, intent of the code maybe when
irq returns a value of zero, the return should be '-ENODEV'. But the
conditional expression '? :' maybe clerical error, it should be
'?:' to make '-ENODEV' work.
But in fact, as the example in platform.c is
* int irq = platform_get_irq(pdev, 0);
* if (irq < 0)
* return irq;
So the return value of zero is unnecessary to check, at last remove
the unnecessary '?: -ENODEV'.
Co-developed-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: Tang Bin <tangbin@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20210811105136.25392-1-tangbin@cmss.chinamobile.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver neglects to check the result of platform_get_irq()'s call and
blithely passes the negative error codes to request_irq() (which takes
*unsigned* IRQ #), causing it to fail with -EINVAL, overriding an original
error code. Stop calling request_irq() with the invalid IRQ #s.
Fixes: 188db4435a ("usb: gadget: s3c: use platform resources")
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Acked-by: Felipe Balbi <balbi@kernel.org>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/bd69b22c-b484-5a1f-c798-78d4b78405f2@omp.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver neglects to check the result of platform_get_irq()'s call and
blithely passes the negative error codes to devm_request_irq() (which takes
*unsigned* IRQ #), causing it to fail with -EINVAL, overriding an original
error code. Stop calling devm_request_irq() with the invalid IRQ #s.
Fixes: 8b2e76687b ("USB: AT91 UDC updates, mostly power management")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Acked-by: Felipe Balbi <balbi@kernel.org>
Link: https://lore.kernel.org/r/6654a224-739a-1a80-12f0-76d920f87b6c@omp.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In dwc3_qcom_acpi_register_core(), the driver neglects to check the result
of platform_get_irq()'s call and blithely assigns the negative error codes
to the allocated child device's IRQ resource and then passing this resource
to platform_device_add_resources() and later causing dwc3_otg_get_irq() to
fail anyway. Stop calling platform_device_add_resources() with the invalid
IRQ #s, so that there's less complexity in the IRQ error checking.
Fixes: 2bc02355f8 ("usb: dwc3: qcom: Add support for booting with ACPI")
Acked-by: Felipe Balbi <balbi@kernel.org>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/45fec3da-1679-5bfe-5d74-219ca3fb28e7@omp.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver neglects to check the result of platform_get_irq()'s call and
blithely passes the negative error codes to devm_request_threaded_irq()
(which takes *unsigned* IRQ #), causing it to fail with -EINVAL, overriding
an original error code. Stop calling devm_request_threaded_irq() with the
invalid IRQ #s.
Fixes: f90db10779 ("usb: dwc3: meson-g12a: Add support for IRQ based OTG switching")
Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Acked-by: Felipe Balbi <balbi@kernel.org>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/96106462-5538-0b2f-f2ab-ee56e4853912@omp.ru
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>