linux-xiaomi-chiron

Author	SHA1	Message	Date
Ard Biesheuvel	ea6f3af4c5	ACPI: GED: add support for _Exx / _Lxx handler methods Per the ACPI spec, interrupts in the range [0, 255] may be handled in AML using individual methods whose naming is based on the format _Exx or _Lxx, where xx is the hex representation of the interrupt index. Add support for this missing feature to our ACPI GED driver. Cc: v4.9+ <stable@vger.kernel.org> # v4.9+ Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-05-15 18:30:14 +02:00
David Matlack	cb953129bf	kvm: add halt-polling cpu usage stats Two new stats for exposing halt-polling cpu usage: halt_poll_success_ns halt_poll_fail_ns Thus sum of these 2 stats is the total cpu time spent polling. "success" means the VCPU polled until a virtual interrupt was delivered. "fail" means the VCPU had to schedule out (either because the maximum poll time was reached or it needed to yield the CPU). To avoid touching every arch's kvm_vcpu_stat struct, only update and export halt-polling cpu usage stats if we're on x86. Exporting cpu usage as a u64 and in nanoseconds means we will overflow at ~500 years, which seems reasonably large. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Jon Cargille <jcargill@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200508182240.68440-1-jcargill@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:26 -04:00
Jim Mattson	93dff2fed2	KVM: nVMX: Migrate the VMX-preemption timer The hrtimer used to emulate the VMX-preemption timer must be pinned to the same logical processor as the vCPU thread to be interrupted if we want to have any hope of adhering to the architectural specification of the VMX-preemption timer. Even with this change, the emulated VMX-preemption timer VM-exit occasionally arrives too late. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:26 -04:00
Jim Mattson	ada0098df6	KVM: nVMX: Change emulated VMX-preemption timer hrtimer to absolute Prepare for migration of this hrtimer, by changing it from relative to absolute. (I couldn't get migration to work with a relative timer.) Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-3-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:25 -04:00
Jim Mattson	1739f3d56d	KVM: nVMX: Really make emulated nested preemption timer pinned The PINNED bit is ignored by hrtimer_init. It is only considered when starting the timer. When the hrtimer isn't pinned to the same logical processor as the vCPU thread to be interrupted, the emulated VMX-preemption timer often fails to adhere to the architectural specification. Fixes: `f15a75eedc` ("KVM: nVMX: make emulated nested preemption timer pinned") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:24 -04:00
Sean Christopherson	6c1c6e5835	KVM: nVMX: Remove unused 'ops' param from nested_vmx_hardware_setup() Remove a 'struct kvm_x86_ops' param that got left behind when the nested ops were moved to their own struct. Fixes: `33b2217245` ("KVM: x86: move nested-related kvm_x86_ops to a separate struct") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506204653.14683-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:24 -04:00
Suravee Suthikulpanit	de18248162	KVM: SVM: Remove unnecessary V_IRQ unsetting This has already been handled in the prior call to svm_clear_vintr(). Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-5-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:23 -04:00
Suravee Suthikulpanit	e14b7786cb	KVM: SVM: Merge svm_enable_vintr into svm_set_vintr Code clean up and remove unnecessary intercept check for INTERCEPT_VINTR. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <1588771076-73790-4-git-send-email-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:23 -04:00
Wanpeng Li	26efe2fd92	KVM: VMX: Handle preemption timer fastpath This patch implements a fastpath for the preemption timer vmexit. The vmexit can be handled quickly so it can be performed with interrupts off and going back directly to the guest. Testing on SKX Server. cyclictest in guest(w/o mwait exposed, adaptive advance lapic timer is default -1): 5540.5ns -> 4602ns 17% kvm-unit-test/vmexit.flat: w/o avanced timer: tscdeadline_immed: 3028.5 -> 2494.75 17.6% tscdeadline: 5765.7 -> 5285 8.3% w/ adaptive advance timer default -1: tscdeadline_immed: 3123.75 -> 2583 17.3% tscdeadline: 4663.75 -> 4537 2.7% Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-8-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:22 -04:00
Wanpeng Li	ae95f566b3	KVM: X86: TSCDEADLINE MSR emulation fastpath This patch implements a fast path for emulation of writes to the TSCDEADLINE MSR. Besides shortcutting various housekeeping tasks in the vCPU loop, the fast path can also deliver the timer interrupt directly without going through KVM_REQ_PENDING_TIMER because it runs in vCPU context. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-7-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:21 -04:00
Paolo Bonzini	199a8b84c4	KVM: x86: introduce kvm_can_use_hv_timer Replace the ad hoc test in vmx_set_hv_timer with a test in the caller, start_hv_timer. This test is not Intel-specific and would be duplicated when introducing the fast path for the TSC deadline MSR. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:21 -04:00
Wanpeng Li	379a3c8ee4	KVM: VMX: Optimize posted-interrupt delivery for timer fastpath While optimizing posted-interrupt delivery especially for the timer fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt() to introduce substantial latency because the processor has to perform all vmentry tasks, ack the posted interrupt notification vector, read the posted-interrupt descriptor etc. This is not only slow, it is also unnecessary when delivering an interrupt to the current CPU (as is the case for the LAPIC timer) because PIR->IRR and IRR->RVI synchronization is already performed on vmentry Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST fastpath as well. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-6-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:20 -04:00
Wanpeng Li	404d5d7bff	KVM: X86: Introduce more exit_fastpath_completion enum values Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:19 -04:00
Wanpeng Li	5a9f54435a	KVM: X86: Introduce kvm_vcpu_exit_request() helper Introduce kvm_vcpu_exit_request() helper, we need to check some conditions before enter guest again immediately, we skip invoking the exit handler and go through full run loop if complete fastpath but there is stuff preventing we enter guest again immediately. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-5-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:19 -04:00
Sean Christopherson	2c4c413255	KVM: x86: Print symbolic names of VMX VM-Exit flags in traces Use __print_flags() to display the names of VMX flags in VM-Exit traces and strip the flags when printing the basic exit reason, e.g. so that a failed VM-Entry due to invalid guest state gets recorded as "INVALID_STATE FAILED_VMENTRY" instead of "0x80000021". Opportunstically fix misaligned variables in the kvm_exit and kvm_nested_vmexit_inject tracepoints. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200508235348.19427-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:18 -04:00
Wanpeng Li	dcf068da7e	KVM: VMX: Introduce generic fastpath handler Introduce generic fastpath handler to handle MSR fastpath, VMX-preemption timer fastpath etc; move it after vmx_complete_interrupts() in order to catch events delivered to the guest, and abort the fast path in later patches. While at it, move the kvm_exit tracepoint so that it is printed for fastpath vmexits as well. There is no observed performance effect for the IPI fastpath after this patch. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <1588055009-12677-2-git-send-email-wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:17 -04:00
Sean Christopherson	9e826feb8f	KVM: nVMX: Drop superfluous VMREAD of vmcs02.GUEST_SYSENTER_* Don't propagate GUEST_SYSENTER_* from vmcs02 to vmcs12 on nested VM-Exit as the vmcs12 fields are updated in vmx_set_msr(), and writes to the corresponding MSRs are always intercepted by KVM when running L2. Dropping the propagation was intended to be done in the same commit that added vmcs12 writes in vmx_set_msr()[1], but for reasons unknown was only shuffled around[2][3]. [1] https://patchwork.kernel.org/patch/10933215 [2] https://patchwork.kernel.org/patch/10933215/#22682289 [3] https://lore.kernel.org/patchwork/patch/1088643 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:17 -04:00
Sean Christopherson	2408500dfc	KVM: nVMX: Truncate writes to vmcs.SYSENTER_EIP/ESP for 32-bit vCPU Explicitly truncate the data written to vmcs.SYSENTER_EIP/ESP on WRMSR if the virtual CPU doesn't support 64-bit mode. The SYSENTER address fields in the VMCS are natural width, i.e. bits 63:32 are dropped if the CPU doesn't support Intel 64 architectures. This behavior is visible to the guest after a VM-Exit/VM-Exit roundtrip, e.g. if the guest sets bits 63:32 in the actual MSR. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:16 -04:00
Uros Bizjak	551896e0e0	KVM: VMX: Improve handle_external_interrupt_irqoff inline assembly Improve handle_external_interrupt_irqoff inline assembly in several ways: - remove unneeded %c operand modifiers and "$" prefixes - use %rsp instead of _ASM_SP, since we are in CONFIG_X86_64 part - use $-16 immediate to align %rsp - remove unneeded use of __ASM_SIZE macro - define "ss" named operand only for X86_64 The patch introduces no functional changes. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200504155706.2516956-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:16 -04:00
Peter Xu	62315b6393	KVM: Documentation: Fix up cpuid page 0x4b564d00 and 0x4b564d01 belong to KVM_FEATURE_CLOCKSOURCE2. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155913.267562-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:15 -04:00
Peter Xu	0fd4604469	KVM: X86: Sanity check on gfn before removal The index returned by kvm_async_pf_gfn_slot() will be removed when an async pf gfn is going to be removed. However kvm_async_pf_gfn_slot() is not reliable in that it can return the last key it loops over even if the gfn is not found in the async gfn array. It should never happen, but it's still better to sanity check against that to make sure no unexpected gfn will be removed. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155910.267514-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:15 -04:00
Peter Xu	5b494aea13	KVM: No need to retry for hva_to_pfn_remapped() hva_to_pfn_remapped() calls fixup_user_fault(), which has already handled the retry gracefully. Even if "unlocked" is set to true, it means that we've got a VM_FAULT_RETRY inside fixup_user_fault(), however the page fault has already retried and we should have the pfn set correctly. No need to do that again. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155906.267462-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:14 -04:00
Peter Xu	dd03bcaad0	KVM: X86: Force ASYNC_PF_PER_VCPU to be power of two Forcing the ASYNC_PF_PER_VCPU to be power of two is much easier to be used rather than calling roundup_pow_of_two() from time to time. Do this by adding a BUILD_BUG_ON() inside the hash function. Another point is that generally async pf does not allow concurrency over ASYNC_PF_PER_VCPU after all (see kvm_setup_async_pf()), so it does not make much sense either to have it not a power of two or some of the entries will definitely be wasted. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200416155859.267366-1-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:13 -04:00
Uros Bizjak	c16312f4fa	KVM: VMX: Remove unneeded __ASM_SIZE usage with POP instruction POP [mem] defaults to the word size, and the only legal non-default size is 16 bits, e.g. a 32-bit POP will #UD in 64-bit mode and vice versa, no need to use __ASM_SIZE macro to force operating mode. Changes since v1: - Fix commit message. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200427205035.1594232-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:13 -04:00
Sean Christopherson	8123f26524	KVM: x86/mmu: Add a helper to consolidate root sp allocation Add a helper, mmu_alloc_root(), to consolidate the allocation of a root shadow page, which has the same basic mechanics for all flavors of TDP and shadow paging. Note, __pa(sp->spt) doesn't need to be protected by mmu_lock, sp->spt points at a kernel page. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428023714.31923-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:12 -04:00
Sean Christopherson	3bae0459bc	KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	e662ec3e07	KVM: x86/mmu: Move max hugepage level to a separate #define Rename PT_MAX_HUGEPAGE_LEVEL to KVM_MAX_HUGEPAGE_LEVEL and make it a separate define in anticipation of dropping KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	b2f432f872	KVM: x86/mmu: Tweak PSE hugepage handling to avoid 2M vs 4M conundrum Change the PSE hugepage handling in walk_addr_generic() to fire on any page level greater than PT_PAGE_TABLE_LEVEL, a.k.a. PG_LEVEL_4K. PSE paging only has two levels, so "== 2" and "> 1" are functionally the same, i.e. this is a nop. A future patch will drop KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums, at which point "walker->level == PG_LEVEL_2M" is semantically incorrect (though still functionally ok). No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:10 -04:00
Xiaoyao Li	a71936ab46	kvm: x86: Cleanup vcpu->arch.guest_xstate_size vcpu->arch.guest_xstate_size lost its only user since commit `df1daba7d1` ("KVM: x86: support XSAVES usage in the host"), so clean it up. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200429154312.1411-1-xiaoyao.li@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:10 -04:00
Gustavo A. R. Silva	9361797c76	PNPBIOS: Replace zero-length array with flexible-array The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2020-05-15 18:20:49 +02:00
Sean Christopherson	68cda40d9f	KVM: nVMX: Tweak handling of failure code for nested VM-Enter failure Use an enum for passing around the failure code for a failed VM-Enter that results in VM-Exit to provide a level of indirection from the final resting place of the failure code, vmcs.EXIT_QUALIFICATION. The exit qualification field is an unsigned long, e.g. passing around 'u32 exit_qual' throws up red flags as it suggests KVM may be dropping bits when reporting errors to L1. This is a red herring because the only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely close to overflowing a u32. Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated by the MSR load list, which returns the (1-based) entry that failed, and the number of MSRs to load is a 32-bit VMCS field. At first blush, it would appear that overflowing a u32 is possible, but the number of MSRs that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC). In other words, there are two completely disparate types of data that eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is an 'unsigned long' in nature. This was presumably the reasoning for switching to 'u32' when the related code was refactored in commit `ca0bde28f2` ("kvm: nVMX: Split VMCS checks from nested_vmx_run()"). Using an enum for the failure code addresses the technically-possible- but-will-never-happen scenario where Intel defines a failure code that doesn't fit in a 32-bit integer. The enum variables and values will either be automatically sized (gcc 5.4 behavior) or be subjected to some combination of truncation. The former case will simply work, while the latter will trigger a compile-time warning unless the compiler is being particularly unhelpful. Separating the failure code from the failed MSR entry allows for disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the conundrum where KVM has to choose between 'u32 exit_qual' and tracking values as 'unsigned long' that have no business being tracked as such. To cement the split, set vmcs12->exit_qualification directly from the entry error code or failed MSR index instead of bouncing through a local variable. Opportunistically rename the variables in load_vmcs12_host_state() and vmx_set_nested_state() to call out that they're ignored, set exit_reason on demand on nested VM-Enter failure, and add a comment in nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap. No functional change intended. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:07:31 -04:00
Codrin Ciubotariu	c5a2838025	ARM: dts: at91: Configure I2C SCL gpio as open drain The SCL gpio pin used by I2C bus for recovery needs to be configured as open drain. Fixes: `455fec938b` ("ARM: dts: at91: sama5d2: add i2c gpio pinctrl") Fixes: `a4bd8da893` ("ARM: dts: at91: sama5d3: add i2c gpio pinctrl") Fixes: `8fb82f050c` ("ARM: dts: at91: sama5d4: add i2c gpio pinctrl") Signed-off-by: Codrin Ciubotariu <codrin.ciubotariu@microchip.com> Link: https://lore.kernel.org/r/20200515140001.287932-1-codrin.ciubotariu@microchip.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>	2020-05-15 17:38:54 +02:00
Michael S. Tsirkin	1b0be99f1a	vhost: missing __user tags sparse warns about converting void * to void __user *. This is not new but only got noticed now that vhost is built on more systems. This is just a question of __user tags missing in a couple of places, so fix it up. Fixes: `f889491380` ("vhost: introduce O(1) vq metadata cache") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2020-05-15 11:36:31 -04:00
Sami Tolvanen	cc49c71d2a	efi/libstub: Disable Shadow Call Stack Shadow stacks are not available in the EFI stub, filter out SCS flags. Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	439dc2a117	arm64: scs: Add shadow stacks for SDEI This change adds per-CPU shadow call stacks for the SDEI handler. Similarly to how the kernel stacks are handled, we add separate shadow stacks for normal and critical events. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	5287569a79	arm64: Implement Shadow Call Stack This change implements shadow stack switching, initial SCS set-up, and interrupt shadow stacks for arm64. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	9654736891	arm64: Disable SCS for hypervisor code Disable SCS for code that runs at a different exception level by adding __noscs to __hyp_text. Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	cde5dec89e	arm64: vdso: Disable Shadow Call Stack Shadow stacks are only available in the kernel, so disable SCS instrumentation for the vDSO. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	e73f02c6eb	arm64: efi: Restore register x18 if it was corrupted If we detect a corrupted x18, restore the register before jumping back to potentially SCS instrumented code. This is safe, because the wrapper is called with preemption disabled and a separate shadow stack is used for interrupt handling. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	6d37d81f44	arm64: Preserve register x18 when CPU is suspended Don't lose the current task's shadow stack when the CPU is suspended. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:50 +01:00
Sami Tolvanen	da64e9d1f8	arm64: Reserve register x18 from general allocation with SCS Reserve the x18 register from general allocation when SCS is enabled, because the compiler uses the register to store the current task's shadow stack pointer. Note that all external kernel modules must also be compiled with -ffixed-x18 if the kernel has SCS enabled. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:49 +01:00
Sami Tolvanen	ddc9863e9e	scs: Disable when function graph tracing is enabled The graph tracer hooks returns by modifying frame records on the (regular) stack, but with SCS the return address is taken from the shadow stack, and the value in the frame record has no effect. As we don't currently have a mechanism to determine the corresponding slot on the shadow stack (and to pass this through the ftrace infrastructure), for now let's disable SCS when the graph tracer is enabled. With SCS the return address is taken from the shadow stack and the value in the frame record has no effect. The mcount based graph tracer hooks returns by modifying frame records on the (regular) stack, and thus is not compatible. The patchable-function-entry graph tracer used for DYNAMIC_FTRACE_WITH_REGS modifies the LR before it is saved to the shadow stack, and is compatible. Modifying the mcount based graph tracer to work with SCS would require a mechanism to determine the corresponding slot on the shadow stack (and to pass this through the ftrace infrastructure), and we expect that everyone will eventually move to the patchable-function-entry based graph tracer anyway, so for now let's disable SCS when the mcount-based graph tracer is enabled. SCS and patchable-function-entry are both supported from LLVM 10.x. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:49 +01:00
Sami Tolvanen	5bbaf9d1fc	scs: Add support for stack usage debugging Implements CONFIG_DEBUG_STACK_USAGE for shadow stacks. When enabled, also prints out the highest shadow stack usage per process. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> [will: rewrote most of scs_check_usage()] Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:49 +01:00
Sami Tolvanen	628d06a48f	scs: Add page accounting for shadow call stack allocations This change adds accounting for the memory allocated for shadow stacks. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:49 +01:00
Sami Tolvanen	d08b9f0ca6	scs: Add support for Clang's Shadow Call Stack (SCS) This change adds generic support for Clang's Shadow Call Stack, which uses a shadow stack to protect return addresses from being overwritten by an attacker. Details are available here: https://clang.llvm.org/docs/ShadowCallStack.html Note that security guarantees in the kernel differ from the ones documented for user space. The kernel must store addresses of shadow stacks in memory, which means an attacker capable reading and writing arbitrary memory may be able to locate them and hijack control flow by modifying the stacks. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [will: Numerous cosmetic changes] Signed-off-by: Will Deacon <will@kernel.org>	2020-05-15 16:35:45 +01:00
Michael Kao	26af2884e4	arm64: dts: mt8173: fix cooling device range When thermal reaches target temperature,it would be pinned to state 0 (max frequency and power). Fix the throttling range to no limit. Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org> Signed-off-by: Michael Kao <michael.kao@mediatek.com> Link: https://lore.kernel.org/r/20200424082340.4127-1-michael.kao@mediatek.com Signed-off-by: Matthias Brugger <matthias.bgg@gmail.com>	2020-05-15 17:32:24 +02:00
Daniel Borkmann	ed24a7a852	Merge branch 'bpf-cap' Alexei Starovoitov says: ==================== v6->v7: - permit SK_REUSEPORT program type under CAP_BPF as suggested by Marek Majkowski. It's equivalent to SOCKET_FILTER which is unpriv. v5->v6: - split allow_ptr_leaks into four flags. - retain bpf_jit_limit under cap_sys_admin. - fixed few other issues spotted by Daniel. v4->v5: Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN. The user process has to have - CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs. Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN. That could be relaxed in the future. - CAP_BPF and CAP_PERFMON to load tracing programs. - CAP_BPF and CAP_NET_ADMIN to load networking programs. (or CAP_SYS_ADMIN for backward compatibility). CAP_BPF solves three main goals: 1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF. More on this below. This is the major difference vs v4 set back from Sep 2019. 2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN prevents pointer leaks and arbitrary kernel memory access. 3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs and making BPF infra more secure. Currently fuzzers run in unpriv. They will be able to run with CAP_BPF. The patchset is long overdue follow-up from the last plumbers conference. Comparing to what was discussed at LPC the CAP* checks at attach time are gone. For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was required at load time and CAP_NET_ADMIN at attach time, but there are several ways to bypass CAP_NET_ADMIN: - if networking prog is using tail_call writing FD into prog_array will effectively attach it, but bpf_map_update_elem is an unprivileged operation. - freplace prog with CAP_SYS_ADMIN can replace networking prog Consolidating all CAP checks at load time makes security model similar to open() syscall. Once the user got an FD it can do everything with it. read/write/poll don't check permissions. The same way when bpf_prog_load command returns an FD the user can do everything (including attaching, detaching, and bpf_test_run). The important design decision is to allow ID->FD transition for CAP_SYS_ADMIN only. What it means that user processes can run with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each other unless they pass FDs via scm_rights or via pinning in bpffs. ID->FD is a mechanism for human override and introspection. An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps, links) that are owned by such processes from each other. Another significant change from LPC is that the verifier checks are split into four flags. The allow_ptr_leaks flag allows pointer manipulations. The bpf_capable flag enables all modern verifier features like bpf-to-bpf calls, BTF, bounded loops, dead code elimination, etc. All the goodness. The bypass_spec_v1 flag enables indirect stack access from bpf programs and disables speculative analysis and bpf array mitigations. The bypass_spec_v4 flag disables store sanitation. That allows networking progs with CAP_BPF + CAP_NET_ADMIN enjoy modern verifier features while being more secure. Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON, since subtracting pointers (like skb->data_end - skb->data) is a pointer leak, but the verifier may get smarter in the future. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>	2020-05-15 17:29:46 +02:00
Alexei Starovoitov	8162600118	selftests/bpf: Use CAP_BPF and CAP_PERFMON in tests Make all test_verifier test exercise CAP_BPF and CAP_PERFMON Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-4-alexei.starovoitov@gmail.com	2020-05-15 17:29:41 +02:00
Alexei Starovoitov	2c78ee898d	bpf: Implement CAP_BPF Implement permissions as stated in uapi/linux/capability.h In order to do that the verifier allow_ptr_leaks flag is split into four flags and they are set as: env->allow_ptr_leaks = bpf_allow_ptr_leaks(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); env->bpf_capable = bpf_capable(); The first three currently equivalent to perfmon_capable(), since leaking kernel pointers and reading kernel memory via side channel attacks is roughly equivalent to reading kernel memory with cap_perfmon. 'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions, subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the verifier, run time mitigations in bpf array, and enables indirect variable access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code by the verifier. That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will have speculative checks done by the verifier and other spectre mitigation applied. Such networking BPF program will not be able to leak kernel pointers and will not be able to access arbitrary kernel memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com	2020-05-15 17:29:41 +02:00
Alexei Starovoitov	a17b53c4a4	bpf, capability: Introduce CAP_BPF Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN. For backward compatibility include them in CAP_SYS_ADMIN as well. The end result provides simple safety model for applications that use BPF: - to load tracing program types BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc} use CAP_BPF and CAP_PERFMON - to load networking program types BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc} use CAP_BPF and CAP_NET_ADMIN There are few exceptions from this rule: - bpf_trace_printk() is allowed in networking programs, but it's using tracing mechanism, hence this helper needs additional CAP_PERFMON if networking program is using this helper. - BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only to discourage production use. - BPF HW offload is allowed under CAP_SYS_ADMIN. - bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only. CAPs are not checked at attach/detach time with two exceptions: - loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users, hence CAP_NET_ADMIN is required at attach time. - flow_dissector detach doesn't check prog FD at detach, hence CAP_NET_ADMIN is required at detach time. CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id command and convert them to file descriptor via GET_FD_BY_ID command. This restriction guarantees that mutliple tasks with CAP_BPF are not able to affect each other. That leads to clean isolation of tasks. For example: task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link. task B with the same capabilities cannot detach that firewall unless task A explicitly passed link FD to task B via scm_rights or bpffs. CAP_SYS_ADMIN can still detach/unload everything. Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can accidentely mess with each other programs and maps. Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other. CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data. Such networking progs cannot access arbitrary kernel memory or leak pointers. bpftool, bpftrace, bcc tools binaries should NOT be installed with CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets. But users with these two permissions will be able to use these tracing tools. CAP_PERFMON is least secure, since it allows kprobes and kernel memory access. CAP_NET_ADMIN can stop network traffic via iproute2. CAP_BPF is the safest from security point of view and harmless on its own. Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map and if that map is used by firewall-like bpf prog. CAP_BPF allows many bpf prog_load commands in parallel. The verifier may consume large amount of memory and significantly slow down the system. Existing unprivileged BPF operations are not affected. In particular unprivileged users are allowed to load socket_filter and cg_skb program types and to create array, hash, prog_array, map-in-map map types. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com	2020-05-15 17:29:41 +02:00

... 147 148 149 150 151 ...

932869 commits