From 4bf173072cd69cb2e059d710429dc6c46c896548 Mon Sep 17 00:00:00 2001 From: Chris Down Date: Mon, 6 Apr 2020 20:03:30 -0700 Subject: [PATCH 001/158] mm, memcg: bypass high reclaim iteration for cgroup hierarchy root The root of the hierarchy cannot have high set, so we will never reclaim based on it. This makes that clearer and avoids another entry. Signed-off-by: Chris Down Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Cc: Tejun Heo Cc: Roman Gushchin Cc: Michal Hocko Link: http://lkml.kernel.org/r/20200312164137.GA1753625@chrisdown.name Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ca194864d802..9965a77d53e5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2254,7 +2254,8 @@ static void reclaim_high(struct mem_cgroup *memcg, continue; memcg_memory_event(memcg, MEMCG_HIGH); try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); - } while ((memcg = parent_mem_cgroup(memcg))); + } while ((memcg = parent_mem_cgroup(memcg)) && + !mem_cgroup_is_root(memcg)); } static void high_work_func(struct work_struct *work) From 93949bb21b52aae924fd26f9775c8655dd9e885e Mon Sep 17 00:00:00 2001 From: Li Xinhai Date: Mon, 6 Apr 2020 20:03:33 -0700 Subject: [PATCH 002/158] mm: don't prepare anon_vma if vma has VM_WIPEONFORK Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path". This patchset fixes the misuse of parenet anon_vma, which mainly caused by child vma's vm_next and vm_prev are left same as its parent after duplicate vma. Finally, code reached parent vma's neighbor by referring pointer of child vma and executed wrong logic. The first two patches fix relevant issues, and the third patch sets vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse in future. Effects of the first bug is that causes rmap code to check both parent and child's page table, although a page couldn't be mapped by both parent and child, because child vma has WIPEONFORK so all pages mapped by child are 'new' and not relevant to parent. Effects of the second bug is that the relationship of anon_vma of parent and child are totallyconvoluted. It would cause 'son', 'grandson', ..., etc, to share 'parent' anon_vma, which disobey the design rule of reusing anon_vma (the rule to be followed is that reusing should among vma of same process, and vma should not gone through fork). So, both issues should cause unnecessary rmap walking and have unexpected complexity. These two issues would not be directly visible, I used debugging code to check the anon_vma pointers of parent and child when inspecting the suspicious implementation of issue #2, then find the problem. This patch (of 3): In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and ->vm_prev as its parent vma. That allows anon_vma used by parent been mistakenly shared by child (find_mergeable_anon_vma() will do this reuse work). Besides this issue, call anon_vma_prepare() should be avoided because we don't copy page for this vma. Preparing anon_vma will be handled during fault. Fixes: d2cd9ede6e19 ("mm,fork: introduce MADV_WIPEONFORK") Signed-off-by: Li Xinhai Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Johannes Weiner Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds --- kernel/fork.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2a967bf85d5..d3a5915d3cfe 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -553,10 +553,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, if (retval) goto fail_nomem_anon_vma_fork; if (tmp->vm_flags & VM_WIPEONFORK) { - /* VM_WIPEONFORK gets a clean slate in the child. */ + /* + * VM_WIPEONFORK gets a clean slate in the child. + * Don't prepare anon_vma until fault since we don't + * copy page for current vma. + */ tmp->anon_vma = NULL; - if (anon_vma_prepare(tmp)) - goto fail_nomem_anon_vma_fork; } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); From 23ab76bf90a66d45e557f238f8806d147fc3be5d Mon Sep 17 00:00:00 2001 From: Li Xinhai Date: Mon, 6 Apr 2020 20:03:36 -0700 Subject: [PATCH 003/158] Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork" This reverts commit 4e4a9eb921332b9d1 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork"). In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and ->vm_prev as its parent vma. That causes the anon_vma used by parent been mistakenly shared by child (In anon_vma_clone(), the code added by that commit will do this reuse work). Besides this issue, the design of reusing anon_vma from vma which has gone through fork should be avoided ([1]). So, this patch reverts that commit and maintains the consistent logic of reusing anon_vma for fork/split/merge vma. Reusing anon_vma within the process is fine. But if a vma has gone through fork(), then that vma's anon_vma should not be shared with its neighbor vma. As explained in [1], when vma gone through fork(), the check for list_is_singular(vma->anon_vma_chain) will be false, and don't share anon_vma. With current issue, one example can clarify more. Parent process do below two steps: 1. p_vma_1 is created and p_anon_vma_1 is prepared; 2. p_vma_2 is created and share p_anon_vma_1; (this is allowed, becaues p_vma_1 didn't gothrough fork()); parent process do fork(): 3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1 prepared; at this point, c_vma_1->anon_vma_chain has two items, one for p_anon_vma_1 and one for c_anon_vma_1; 4. c_vma_2 is dup from p_vma_2, it is not allowed to share c_anon_vma_1, because c_vma_1->anon_vma_chain has two items. [1] commit d0e9fe1758f2 ("Simplify and comment on anon_vma re-use for anon_vma_prepare()") explains the test of "list_is_singular()". Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork") Signed-off-by: Li Xinhai Signed-off-by: Andrew Morton Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Johannes Weiner Cc: Rik van Riel Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds --- mm/rmap.c | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 2df75a119c83..68fe0472c803 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -275,19 +275,6 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; - struct vm_area_struct *prev = dst->vm_prev, *pprev = src->vm_prev; - - /* - * If parent share anon_vma with its vm_prev, keep this sharing in in - * child. - * - * 1. Parent has vm_prev, which implies we have vm_prev. - * 2. Parent and its vm_prev have the same anon_vma. - */ - if (!dst->anon_vma && src->anon_vma && - pprev && pprev->anon_vma == src->anon_vma) - dst->anon_vma = prev->anon_vma; - list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; From e39a4b332df69f6850efda5ce6d932ef14a39042 Mon Sep 17 00:00:00 2001 From: Li Xinhai Date: Mon, 6 Apr 2020 20:03:39 -0700 Subject: [PATCH 004/158] mm: set vm_next and vm_prev to NULL in vm_area_dup() Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the new duplicated vma. Currently, only in fork path there are misuse for handling anon_vma. No other bugs been revealed with this patch applied. Signed-off-by: Li Xinhai Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Johannes Weiner Cc: Rik van Riel Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index d3a5915d3cfe..4385f3d639f2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -361,6 +361,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) if (new) { *new = *orig; INIT_LIST_HEAD(&new->anon_vma_chain); + new->vm_next = new->vm_prev = NULL; } return new; } @@ -562,7 +563,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); - tmp->vm_next = tmp->vm_prev = NULL; file = tmp->vm_file; if (file) { struct inode *inode = file_inode(file); From 7e96fb5710a806fd41d2eb6b41bac14a26a094e2 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:43 -0700 Subject: [PATCH 005/158] mm/vma: add missing VMA flag readable name for VM_SYNC Patch series "mm/vma: Use all available wrappers when possible", v2. Apart from adding a VMA flag readable name for trace purpose, this series does some open encoding replacements with availabe VMA specific wrappers. This skips VM_HUGETLB check in vma_migratable() as its already being done with another patch (https://patchwork.kernel.org/patch/11347831/) which is yet to be merged. This patch (of 4): This just adds the missing readable name for VM_SYNC. Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Cc: Vlastimil Babka Cc: Steven Rostedt Cc: Ingo Molnar Cc: Alexander Viro Cc: Andy Lutomirski Cc: "Aneesh Kumar K.V" Cc: Arnaldo Carvalho de Melo Cc: Arnd Bergmann Cc: Benjamin Herrenschmidt Cc: Dave Hansen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Mel Gorman Cc: Michael Ellerman Cc: Nick Piggin Cc: Paul Burton Cc: Paul Mackerras Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Rich Felker Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/1582520593-30704-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- include/trace/events/mmflags.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a1675d43777e..5fb752034386 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -154,6 +154,7 @@ IF_HAVE_PG_IDLE(PG_idle, "idle" ) {VM_ACCOUNT, "account" }, \ {VM_NORESERVE, "noreserve" }, \ {VM_HUGETLB, "hugetlb" }, \ + {VM_SYNC, "sync" }, \ __VM_ARCH_SPECIFIC_1 , \ {VM_WIPEONFORK, "wipeonfork" }, \ {VM_DONTDUMP, "dontdump" }, \ From 3122e80efc0faf4a2accba7a46c7ed795edbfded Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:47 -0700 Subject: [PATCH 006/158] mm/vma: make vma_is_accessible() available for general use Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Geert Uytterhoeven Acked-by: Guo Ren Acked-by: Vlastimil Babka Cc: Guo Ren Cc: Geert Uytterhoeven Cc: Ralf Baechle Cc: Paul Burton Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Yoshinori Sato Cc: Rich Felker Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Steven Rostedt Cc: Mel Gorman Cc: Alexander Viro Cc: "Aneesh Kumar K.V" Cc: Arnaldo Carvalho de Melo Cc: Arnd Bergmann Cc: Nick Piggin Cc: Paul Mackerras Cc: Will Deacon Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/csky/mm/fault.c | 2 +- arch/m68k/mm/fault.c | 2 +- arch/mips/mm/fault.c | 2 +- arch/powerpc/mm/fault.c | 2 +- arch/sh/mm/fault.c | 2 +- arch/x86/mm/fault.c | 2 +- include/linux/mm.h | 6 ++++++ kernel/sched/fair.c | 2 +- mm/gup.c | 2 +- mm/memory.c | 5 ----- mm/mempolicy.c | 3 +-- mm/mmap.c | 5 ++--- 12 files changed, 17 insertions(+), 18 deletions(-) diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index d3c61b83e195..a6e8230b6fbf 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -141,7 +141,7 @@ good_area: if (!(vma->vm_flags & VM_WRITE)) goto bad_area; } else { - if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) + if (!vma_is_accessible(vma)) goto bad_area; } diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index f7afb9897966..0c4a21a685d5 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -125,7 +125,7 @@ good_area: case 1: /* read, present */ goto acc_err; case 0: /* read, not present */ - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) + if (!vma_is_accessible(vma)) goto acc_err; } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 4a0eafe3d932..fb048ba2b91d 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -142,7 +142,7 @@ good_area: goto bad_area; } } else { - if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) + if (!vma_is_accessible(vma)) goto bad_area; } } diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index d15f0f0ee806..84af6c8eecf7 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -314,7 +314,7 @@ static bool access_error(bool is_write, bool is_exec, return false; } - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return true; /* * We should ideally do the vma pkey access check here. But in the diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index 13ee4d20e622..5f23d7907597 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -355,7 +355,7 @@ static inline int access_error(int error_code, struct vm_area_struct *vma) return 1; /* read, not present: */ - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return 1; return 0; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 859519f5b342..a51df516b87b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1222,7 +1222,7 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) return 1; /* read, not present: */ - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return 1; return 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 7dd5c4ccbf85..be49e371e4b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -629,6 +629,12 @@ static inline bool vma_is_foreign(struct vm_area_struct *vma) return false; } + +static inline bool vma_is_accessible(struct vm_area_struct *vma) +{ + return vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC); +} + #ifdef CONFIG_SHMEM /* * The vma_is_shmem is not inline because it is used only by slow diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d7fb20adabeb..1ea3dddafe69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2799,7 +2799,7 @@ static void task_numa_work(struct callback_head *work) * Skip inaccessible VMAs to avoid any confusion between * PROT_NONE and NUMA hinting ptes */ - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) + if (!vma_is_accessible(vma)) continue; do { diff --git a/mm/gup.c b/mm/gup.c index da3e03185144..4d505c994623 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1416,7 +1416,7 @@ long populate_vma_page_range(struct vm_area_struct *vma, * We want mlock to succeed for regions that have any permissions * other than PROT_NONE. */ - if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) + if (vma_is_accessible(vma)) gup_flags |= FOLL_FORCE; /* diff --git a/mm/memory.c b/mm/memory.c index 586271f3efc6..d2a353c345ad 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3964,11 +3964,6 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) return VM_FAULT_FALLBACK; } -static inline bool vma_is_accessible(struct vm_area_struct *vma) -{ - return vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE); -} - static vm_fault_t create_huge_pud(struct vm_fault *vmf) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 5fb427aed612..b36926ba02e2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -678,8 +678,7 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, if (flags & MPOL_MF_LAZY) { /* Similar to task_numa_work, skip inaccessible VMAs */ - if (!is_vm_hugetlb_page(vma) && - (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)) && + if (!is_vm_hugetlb_page(vma) && vma_is_accessible(vma) && !(vma->vm_flags & VM_MIXEDMAP)) change_prot_numa(vma, start, endvma); return 1; diff --git a/mm/mmap.c b/mm/mmap.c index 94ae18398c59..aa09429d5888 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2358,8 +2358,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) gap_addr = TASK_SIZE; next = vma->vm_next; - if (next && next->vm_start < gap_addr && - (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) { + if (next && next->vm_start < gap_addr && vma_is_accessible(next)) { if (!(next->vm_flags & VM_GROWSUP)) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ @@ -2440,7 +2439,7 @@ int expand_downwards(struct vm_area_struct *vma, prev = vma->vm_prev; /* Check that both stack segments have the same anon_vma? */ if (prev && !(prev->vm_flags & VM_GROWSDOWN) && - (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) { + vma_is_accessible(prev)) { if (address - prev->vm_end < stack_guard_gap) return -ENOMEM; } From 03911132aafd6727e59408e497c049402a5a11fa Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:51 -0700 Subject: [PATCH 007/158] mm/vma: replace all remaining open encodings with is_vm_hugetlb_page() This replaces all remaining open encodings with is_vm_hugetlb_page(). Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Alexander Viro Cc: Will Deacon Cc: "Aneesh Kumar K.V" Cc: Nick Piggin Cc: Peter Zijlstra Cc: Arnd Bergmann Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Andy Lutomirski Cc: Dave Hansen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Mel Gorman Cc: Paul Burton Cc: Paul Mackerras Cc: Ralf Baechle Cc: Rich Felker Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/e500_mmu_host.c | 2 +- fs/binfmt_elf.c | 3 ++- include/asm-generic/tlb.h | 3 ++- kernel/events/core.c | 3 ++- 4 files changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 425d13806645..df9989cf7ba3 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -422,7 +422,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, break; } } else if (vma && hva >= vma->vm_start && - (vma->vm_flags & VM_HUGETLB)) { + is_vm_hugetlb_page(vma)) { unsigned long psize = vma_kernel_pagesize(vma); tsize = (gtlbe->mas1 & MAS1_TSIZE_MASK) >> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index f4713ea76e82..1eb63867e266 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -1317,7 +1318,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, } /* Hugetlb memory check */ - if (vma->vm_flags & VM_HUGETLB) { + if (is_vm_hugetlb_page(vma)) { if ((vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_SHARED)) goto whole; if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE)) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index f391f6b500b4..3f1649a8cf55 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -13,6 +13,7 @@ #include #include +#include #include #include #include @@ -398,7 +399,7 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) * We rely on tlb_end_vma() to issue a flush, such that when we reset * these values the batch is empty. */ - tlb->vma_huge = !!(vma->vm_flags & VM_HUGETLB); + tlb->vma_huge = is_vm_hugetlb_page(vma); tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); } diff --git a/kernel/events/core.c b/kernel/events/core.c index 81e6d80cb219..55e44417f66d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -7973,7 +7974,7 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event) flags |= MAP_EXECUTABLE; if (vma->vm_flags & VM_LOCKED) flags |= MAP_LOCKED; - if (vma->vm_flags & VM_HUGETLB) + if (is_vm_hugetlb_page(vma)) flags |= MAP_HUGETLB; if (file) { From a0137f16dfe870d63bfb9acdc34b67745270ab4b Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:55 -0700 Subject: [PATCH 008/158] mm/vma: replace all remaining open encodings with vma_is_anonymous() This replaces all remaining open encodings with vma_is_anonymous(). Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Andy Lutomirski Cc: "Aneesh Kumar K.V" Cc: Arnaldo Carvalho de Melo Cc: Arnd Bergmann Cc: Benjamin Herrenschmidt Cc: Dave Hansen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Ingo Molnar Cc: Mel Gorman Cc: Michael Ellerman Cc: Nick Piggin Cc: Paul Burton Cc: Paul Mackerras Cc: Paul Mackerras Cc: Peter Zijlstra Cc: Ralf Baechle Cc: Rich Felker Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Will Deacon Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/1582520593-30704-5-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- mm/gup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index 4d505c994623..96af7e08db4b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -351,7 +351,8 @@ static struct page *no_page_table(struct vm_area_struct *vma, * But we can only make this optimization where a hole would surely * be zero-filled if handle_mm_fault() actually did handle it. */ - if ((flags & FOLL_DUMP) && (!vma->vm_ops || !vma->vm_ops->fault)) + if ((flags & FOLL_DUMP) && + (vma_is_anonymous(vma) || !vma->vm_ops->fault)) return ERR_PTR(-EFAULT); return NULL; } From 5093c5872be3a041917828b2b53e9f981d754768 Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:59 -0700 Subject: [PATCH 009/158] mm/vma: append unlikely() while testing VMA access permissions It is unlikely that an inaccessible VMA without required permission flags will get a page fault. Hence lets just append unlikely() directive to such checks in order to improve performance while also standardizing it across various platforms. Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Guo Ren Cc: Geert Uytterhoeven Cc: Ralf Baechle Cc: Paul Burton Cc: Mike Rapoport Link: http://lkml.kernel.org/r/1582525304-32113-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/csky/mm/fault.c | 2 +- arch/m68k/mm/fault.c | 2 +- arch/mips/mm/fault.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index a6e8230b6fbf..4e6dc68f3258 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -141,7 +141,7 @@ good_area: if (!(vma->vm_flags & VM_WRITE)) goto bad_area; } else { - if (!vma_is_accessible(vma)) + if (unlikely(!vma_is_accessible(vma))) goto bad_area; } diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 0c4a21a685d5..3bfb5c8ac3c7 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -125,7 +125,7 @@ good_area: case 1: /* read, present */ goto acc_err; case 0: /* read, not present */ - if (!vma_is_accessible(vma)) + if (unlikely(!vma_is_accessible(vma))) goto acc_err; } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index fb048ba2b91d..f8d62cd83b36 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -142,7 +142,7 @@ good_area: goto bad_area; } } else { - if (!vma_is_accessible(vma)) + if (unlikely(!vma_is_accessible(vma))) goto bad_area; } } From d8cc323d951e7903b5efffad292116148815dc4e Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Mon, 6 Apr 2020 20:04:02 -0700 Subject: [PATCH 010/158] mm/vmalloc: fix a typo in comment There is a typo in comment, fix it. "exeeds" -> "exceeds" Signed-off-by: Qiujun Huang Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/20200404060136.10838-1-hqjagain@gmail.com Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6b8eeb0ecee5..399f219544f7 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3368,7 +3368,7 @@ retry: goto overflow; /* - * If required width exeeds current VA block, move + * If required width exceeds current VA block, move * base downwards and then recheck. */ if (base + end > va->va_end) { From 29fd1897070125ab49634524a20f146fb4240a51 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Mon, 6 Apr 2020 20:04:06 -0700 Subject: [PATCH 011/158] mm: make it clear that gfp reclaim modifiers are valid only for sleepable allocations While it might be really clear to MM developers that gfp reclaim modifiers are applicable only to sleepable allocations (those with __GFP_DIRECT_RECLAIM) it seems that actual users of the API are not always sure. Make it explicit that they are not applicable for GFP_NOWAIT or GFP_ATOMIC allocations which are the most commonly used non-sleepable allocation masks. Signed-off-by: Michal Hocko Signed-off-by: Andrew Morton Reviewed-by: Joel Fernandes (Google) Acked-by: Paul E. McKenney Acked-by: David Rientjes Cc: Neil Brown Link: http://lkml.kernel.org/r/20200403083543.11552-3-mhocko@kernel.org Signed-off-by: Linus Torvalds --- include/linux/gfp.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index be2754841369..4aba4c86c626 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -124,6 +124,8 @@ struct vm_area_struct; * * Reclaim modifiers * ~~~~~~~~~~~~~~~~~ + * Please note that all the following flags are only applicable to sleepable + * allocations (e.g. %GFP_NOWAIT and %GFP_ATOMIC will ignore them). * * %__GFP_IO can start physical IO. * From 4afdacec2e313509bf9c25f3370be299a0d60b9a Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 6 Apr 2020 20:04:09 -0700 Subject: [PATCH 012/158] mm/migrate.c: no need to check for i > start in do_pages_move() Patch series "cleanup on do_pages_move()", v5. The logic in do_pages_move() is a little mess for audience to read and has some potential error on handling the return value. Especially there are three calls on do_move_pages_to_node() and store_status() with almost the same form. This patch set tries to make the code a little friendly for audience by consolidate the calls. This patch (of 4): At this point, we always have i >= start. If i == start, store_status() will return 0. So we can drop the check for i > start. [david@redhat.com rephrase changelog] Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Link: http://lkml.kernel.org/r/20200214003017.25558-2-richardw.yang@linux.intel.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 7ded07081be9..20bc562f499e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1694,11 +1694,9 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, err += nr_pages - i - 1; goto out; } - if (i > start) { - err = store_status(status, start, current_node, i - start); - if (err) - goto out; - } + err = store_status(status, start, current_node, i - start); + if (err) + goto out; current_node = NUMA_NO_NODE; } out_flush: From 7ca8783ad816972f0c64a189b8ba6615c6b78521 Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 6 Apr 2020 20:04:12 -0700 Subject: [PATCH 013/158] mm/migrate.c: wrap do_move_pages_to_node() and store_status() Usually, do_move_pages_to_node() and store_status() are used in combination. We have three similar call sites. Let's provide a wrapper for both function calls - move_pages_and_store_status - to make the calling code easier to maintain and fix (as noted by Yang Shi, the return value handling of do_move_pages_to_node() has a flaw). [david@redhat.com rephrase changelog] Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Link: http://lkml.kernel.org/r/20200214003017.25558-3-richardw.yang@linux.intel.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 61 +++++++++++++++++++++++++--------------------------- 1 file changed, 29 insertions(+), 32 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 20bc562f499e..6f6317903007 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1602,6 +1602,29 @@ out: return err; } +static int move_pages_and_store_status(struct mm_struct *mm, int node, + struct list_head *pagelist, int __user *status, + int start, int i, unsigned long nr_pages) +{ + int err; + + err = do_move_pages_to_node(mm, pagelist, node); + if (err) { + /* + * Positive err means the number of failed + * pages to migrate. Since we are going to + * abort and return the number of non-migrated + * pages, so need to incude the rest of the + * nr_pages that have not been attempted as + * well. + */ + if (err > 0) + err += nr_pages - i - 1; + return err; + } + return store_status(status, start, node, i - start); +} + /* * Migrate an array of page address onto an array of nodes and fill * the corresponding array of status. @@ -1645,21 +1668,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, current_node = node; start = i; } else if (node != current_node) { - err = do_move_pages_to_node(mm, &pagelist, current_node); - if (err) { - /* - * Positive err means the number of failed - * pages to migrate. Since we are going to - * abort and return the number of non-migrated - * pages, so need to incude the rest of the - * nr_pages that have not been attempted as - * well. - */ - if (err > 0) - err += nr_pages - i - 1; - goto out; - } - err = store_status(status, start, current_node, i - start); + err = move_pages_and_store_status(mm, current_node, + &pagelist, status, start, i, nr_pages); if (err) goto out; start = i; @@ -1688,13 +1698,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, if (err) goto out_flush; - err = do_move_pages_to_node(mm, &pagelist, current_node); - if (err) { - if (err > 0) - err += nr_pages - i - 1; - goto out; - } - err = store_status(status, start, current_node, i - start); + err = move_pages_and_store_status(mm, current_node, &pagelist, + status, start, i, nr_pages); if (err) goto out; current_node = NUMA_NO_NODE; @@ -1704,16 +1709,8 @@ out_flush: return err; /* Make sure we do not overwrite the existing error */ - err1 = do_move_pages_to_node(mm, &pagelist, current_node); - /* - * Don't have to report non-attempted pages here since: - * - If the above loop is done gracefully all pages have been - * attempted. - * - If the above loop is aborted it means a fatal error - * happened, should return ret. - */ - if (!err1) - err1 = store_status(status, start, current_node, i - start); + err1 = move_pages_and_store_status(mm, current_node, &pagelist, + status, start, i, nr_pages); if (err >= 0) err = err1; out: From 5d7ae891cdc68f30e8b6db35bbd7f0b1aeee189c Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 6 Apr 2020 20:04:15 -0700 Subject: [PATCH 014/158] mm/migrate.c: check pagelist in move_pages_and_store_status() When pagelist is empty, it is not necessary to do the move and store. Also it consolidate the empty list check in one place. Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Link: http://lkml.kernel.org/r/20200214003017.25558-4-richardw.yang@linux.intel.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 6f6317903007..80eaa99bed39 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1518,9 +1518,6 @@ static int do_move_pages_to_node(struct mm_struct *mm, { int err; - if (list_empty(pagelist)) - return 0; - err = migrate_pages(pagelist, alloc_new_node_page, NULL, node, MIGRATE_SYNC, MR_SYSCALL); if (err) @@ -1608,6 +1605,9 @@ static int move_pages_and_store_status(struct mm_struct *mm, int node, { int err; + if (list_empty(pagelist)) + return 0; + err = do_move_pages_to_node(mm, pagelist, node); if (err) { /* @@ -1705,9 +1705,6 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, current_node = NUMA_NO_NODE; } out_flush: - if (list_empty(&pagelist)) - return err; - /* Make sure we do not overwrite the existing error */ err1 = move_pages_and_store_status(mm, current_node, &pagelist, status, start, i, nr_pages); From d08221a0807b0489f0081476bcd36da88722560b Mon Sep 17 00:00:00 2001 From: Wei Yang Date: Mon, 6 Apr 2020 20:04:18 -0700 Subject: [PATCH 015/158] mm/migrate.c: unify "not queued for migration" handling in do_pages_move() It can currently happen that we store the status of a page twice: * Once we detect that it is already on the target node * Once we moved a bunch of pages, and a page that's already on the target node is contained in the current interval. Let's simplify the code and always call do_move_pages_to_node() in case we did not queue a page for migration. Note that pages that are already on the target node are not added to the pagelist and are, therefore, ignored by do_move_pages_to_node() - there is no functional change. The status of such a page is now only stored once. [david@redhat.com rephrase changelog] Signed-off-by: Wei Yang Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Link: http://lkml.kernel.org/r/20200214003017.25558-5-richardw.yang@linux.intel.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 80eaa99bed39..c550230664b1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1683,18 +1683,16 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes, err = add_page_for_migration(mm, addr, current_node, &pagelist, flags & MPOL_MF_MOVE_ALL); - if (!err) { - /* The page is already on the target node */ - err = store_status(status, i, current_node, 1); - if (err) - goto out_flush; - continue; - } else if (err > 0) { + if (err > 0) { /* The page is successfully queued for migration */ continue; } - err = store_status(status, i, err, 1); + /* + * If the page is already on the target node (!err), store the + * node, otherwise, store the err. + */ + err = store_status(status, i, err ? : current_node, 1); if (err) goto out_flush; From 6aeff241fe6c4561a842b344c1ca14a700ec8441 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Mon, 6 Apr 2020 20:04:21 -0700 Subject: [PATCH 016/158] mm/migrate.c: migrate PG_readahead flag Currently the migration code doesn't migrate PG_readahead flag. Theoretically this would incur slight performance loss as the application might have to ramp its readahead back up again. Even though such problem happens, it might be hidden by something else since migration is typically triggered by compaction and NUMA balancing, any of which should be more noticeable. Migrate the flag after end_page_writeback() since it may clear PG_reclaim flag, which is the same bit as PG_readahead, for the new page. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Yang Shi Signed-off-by: Andrew Morton Cc: Matthew Wilcox Cc: Michal Hocko Cc: Mel Gorman Link: http://lkml.kernel.org/r/1581640185-95731-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Linus Torvalds --- mm/migrate.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/migrate.c b/mm/migrate.c index c550230664b1..1a205503be3f 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -647,6 +647,14 @@ void migrate_page_states(struct page *newpage, struct page *page) if (PageWriteback(newpage)) end_page_writeback(newpage); + /* + * PG_readahead shares the same bit with PG_reclaim. The above + * end_page_writeback() may clear PG_readahead mistakenly, so set the + * bit after that. + */ + if (PageReadahead(page)) + SetPageReadahead(newpage); + copy_page_owner(page, newpage); mem_cgroup_migrate(page, newpage); From dcdf11ee144133328664d90836e712d840d047d9 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Mon, 6 Apr 2020 20:04:25 -0700 Subject: [PATCH 017/158] mm, shmem: add vmstat for hugepage fallback The existing thp_fault_fallback indicates when thp attempts to allocate a hugepage but fails, or if the hugepage cannot be charged to the mem cgroup hierarchy. Extend this to shmem as well. Adds a new thp_file_fallback to complement thp_file_alloc that gets incremented when a hugepage is attempted to be allocated but fails, or if it cannot be charged to the mem cgroup hierarchy. Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from shmem_alloc_hugepage() since it is only called with this configuration option. Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Reviewed-by: Yang Shi Acked-by: Kirill A. Shutemov Cc: Mike Rapoport Cc: Jeremy Cline Cc: Andrea Arcangeli Cc: Mike Kravetz Cc: Michal Hocko Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/transhuge.rst | 4 ++++ include/linux/vm_event_item.h | 2 ++ mm/shmem.c | 10 ++++++---- mm/vmstat.c | 1 + 4 files changed, 13 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index bd5714547cee..f79ebbcd6725 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -319,6 +319,10 @@ thp_file_alloc is incremented every time a file huge page is successfully allocated. +thp_file_fallback + is incremented if a file huge page is attempted to be allocated + but fails and instead falls back to using small pages. + thp_file_mapped is incremented every time a file huge page is mapped into user address space. diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 47a3441cf4c4..41a4c7568748 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -76,6 +76,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_COLLAPSE_ALLOC, THP_COLLAPSE_ALLOC_FAILED, THP_FILE_ALLOC, + THP_FILE_FALLBACK, THP_FILE_MAPPED, THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED, @@ -115,6 +116,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifndef CONFIG_TRANSPARENT_HUGEPAGE #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; }) +#define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; }) #define THP_FILE_MAPPED ({ BUILD_BUG(); 0; }) #endif diff --git a/mm/shmem.c b/mm/shmem.c index f47347cb30f6..8160d0762bf5 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1472,9 +1472,6 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp, pgoff_t hindex; struct page *page; - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) - return NULL; - hindex = round_down(index, HPAGE_PMD_NR); if (xa_find(&mapping->i_pages, &hindex, hindex + HPAGE_PMD_NR - 1, XA_PRESENT)) @@ -1486,6 +1483,8 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp, shmem_pseudo_vma_destroy(&pvma); if (page) prep_transhuge_page(page); + else + count_vm_event(THP_FILE_FALLBACK); return page; } @@ -1871,8 +1870,11 @@ alloc_nohuge: error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg, PageTransHuge(page)); - if (error) + if (error) { + if (PageTransHuge(page)) + count_vm_event(THP_FILE_FALLBACK); goto unacct; + } error = shmem_add_to_page_cache(page, mapping, hindex, NULL, gfp & GFP_RECLAIM_MASK); if (error) { diff --git a/mm/vmstat.c b/mm/vmstat.c index c9c0d71f917f..ebc1dcaa0539 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1259,6 +1259,7 @@ const char * const vmstat_text[] = { "thp_collapse_alloc", "thp_collapse_alloc_failed", "thp_file_alloc", + "thp_file_fallback", "thp_file_mapped", "thp_split_page", "thp_split_page_failed", From 85b9f46e8ea451633ccd60a7d8cacbfff9f34047 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Mon, 6 Apr 2020 20:04:28 -0700 Subject: [PATCH 018/158] mm, thp: track fallbacks due to failed memcg charges separately The thp_fault_fallback and thp_file_fallback vmstats are incremented if either the hugepage allocation fails through the page allocator or the hugepage charge fails through mem cgroup. This patch leaves this field untouched but adds two new fields, thp_{fault,file}_fallback_charge, which is incremented only when the mem cgroup charge fails. This distinguishes between attempted hugepage allocations that fail due to fragmentation (or low memory conditions) and those that fail due to mem cgroup limits. That can be used to determine the impact of fragmentation on the system by excluding faults that failed due to memcg usage. Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Reviewed-by: Yang Shi Acked-by: Kirill A. Shutemov Cc: Mike Rapoport Cc: Jeremy Cline Cc: Andrea Arcangeli Cc: Mike Kravetz Cc: Michal Hocko Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/transhuge.rst | 10 ++++++++++ include/linux/vm_event_item.h | 3 +++ mm/huge_memory.c | 2 ++ mm/shmem.c | 4 +++- mm/vmstat.c | 2 ++ 5 files changed, 20 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index f79ebbcd6725..2f31de8f7c74 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -310,6 +310,11 @@ thp_fault_fallback is incremented if a page fault fails to allocate a huge page and instead falls back to using small pages. +thp_fault_fallback_charge + is incremented if a page fault fails to charge a huge page and + instead falls back to using small pages even though the + allocation was successful. + thp_collapse_alloc_failed is incremented if khugepaged found a range of pages that should be collapsed into one huge page but failed @@ -323,6 +328,11 @@ thp_file_fallback is incremented if a file huge page is attempted to be allocated but fails and instead falls back to using small pages. +thp_file_fallback_charge + is incremented if a file huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + thp_file_mapped is incremented every time a file huge page is mapped into user address space. diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 41a4c7568748..ffef0f279747 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -73,10 +73,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_TRANSPARENT_HUGEPAGE THP_FAULT_ALLOC, THP_FAULT_FALLBACK, + THP_FAULT_FALLBACK_CHARGE, THP_COLLAPSE_ALLOC, THP_COLLAPSE_ALLOC_FAILED, THP_FILE_ALLOC, THP_FILE_FALLBACK, + THP_FILE_FALLBACK_CHARGE, THP_FILE_MAPPED, THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED, @@ -117,6 +119,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifndef CONFIG_TRANSPARENT_HUGEPAGE #define THP_FILE_ALLOC ({ BUILD_BUG(); 0; }) #define THP_FILE_FALLBACK ({ BUILD_BUG(); 0; }) +#define THP_FILE_FALLBACK_CHARGE ({ BUILD_BUG(); 0; }) #define THP_FILE_MAPPED ({ BUILD_BUG(); 0; }) #endif diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0f9389f9d1f8..0080e8df18ef 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -597,6 +597,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, if (mem_cgroup_try_charge_delay(page, vma->vm_mm, gfp, &memcg, true)) { put_page(page); count_vm_event(THP_FAULT_FALLBACK); + count_vm_event(THP_FAULT_FALLBACK_CHARGE); return VM_FAULT_FALLBACK; } @@ -1446,6 +1447,7 @@ alloc: put_page(page); ret |= VM_FAULT_FALLBACK; count_vm_event(THP_FAULT_FALLBACK); + count_vm_event(THP_FAULT_FALLBACK_CHARGE); goto out; } diff --git a/mm/shmem.c b/mm/shmem.c index 8160d0762bf5..b48ac3806f8f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1871,8 +1871,10 @@ alloc_nohuge: error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg, PageTransHuge(page)); if (error) { - if (PageTransHuge(page)) + if (PageTransHuge(page)) { count_vm_event(THP_FILE_FALLBACK); + count_vm_event(THP_FILE_FALLBACK_CHARGE); + } goto unacct; } error = shmem_add_to_page_cache(page, mapping, hindex, diff --git a/mm/vmstat.c b/mm/vmstat.c index ebc1dcaa0539..96d21a792b57 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1256,10 +1256,12 @@ const char * const vmstat_text[] = { #ifdef CONFIG_TRANSPARENT_HUGEPAGE "thp_fault_alloc", "thp_fault_fallback", + "thp_fault_fallback_charge", "thp_collapse_alloc", "thp_collapse_alloc_failed", "thp_file_alloc", "thp_file_fallback", + "thp_file_fallback_charge", "thp_file_mapped", "thp_split_page", "thp_split_page_failed", From a0650604a707b1926cbc32feb2f006ebb74ef47b Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:04:31 -0700 Subject: [PATCH 019/158] include/linux/pagemap.h: optimise find_subpage for !THP If THP is disabled, find_subpage() can become a no-op by using hpage_nr_pages() instead of compound_nr(). hpage_nr_pages() embeds a check for PageTail, so we can drop the check here. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Reviewed-by: Christoph Hellwig Acked-by: Kirill A. Shutemov Cc: Aneesh Kumar K.V Cc: Pankaj Gupta Link: http://lkml.kernel.org/r/20200318140253.6141-5-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/pagemap.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index f56282491a48..a8f7bd8ea1c6 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -341,9 +341,7 @@ static inline struct page *find_subpage(struct page *head, pgoff_t index) if (PageHuge(head)) return head; - VM_BUG_ON_PAGE(PageTail(head), head); - - return head + (index & (compound_nr(head) - 1)); + return head + (index & (hpage_nr_pages(head) - 1)); } struct page *find_get_entry(struct address_space *mapping, pgoff_t offset); From 396bcc5299c281e9cf1737ad0efcd97be9f83845 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:04:35 -0700 Subject: [PATCH 020/158] mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE") notes that it should be reverted when the PowerPC problem was fixed. The commit fixing the PowerPC problem (953c66c2b22a) did not revert the commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an oversight, so remove the Kconfig symbol and undo the work of commit e496cf3d7821. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Aneesh Kumar K.V Cc: Christoph Hellwig Cc: Pankaj Gupta Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org Signed-off-by: Linus Torvalds --- include/linux/shmem_fs.h | 10 +--------- mm/Kconfig | 6 +----- mm/huge_memory.c | 2 +- mm/khugepaged.c | 12 ++++-------- mm/memory.c | 5 ++--- mm/rmap.c | 2 +- mm/shmem.c | 34 +++++++++++++++++----------------- 7 files changed, 27 insertions(+), 44 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index d56fefef8905..7a35a6901221 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -78,6 +78,7 @@ extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end); extern int shmem_unuse(unsigned int type, bool frontswap, unsigned long *fs_pages_to_unuse); +extern bool shmem_huge_enabled(struct vm_area_struct *vma); extern unsigned long shmem_swap_usage(struct vm_area_struct *vma); extern unsigned long shmem_partial_swap_usage(struct address_space *mapping, pgoff_t start, pgoff_t end); @@ -114,15 +115,6 @@ static inline bool shmem_file(struct file *file) extern bool shmem_charge(struct inode *inode, long pages); extern void shmem_uncharge(struct inode *inode, long pages); -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE -extern bool shmem_huge_enabled(struct vm_area_struct *vma); -#else -static inline bool shmem_huge_enabled(struct vm_area_struct *vma) -{ - return false; -} -#endif - #ifdef CONFIG_SHMEM extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, diff --git a/mm/Kconfig b/mm/Kconfig index ab80933be65f..211a70e8d5cf 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -420,10 +420,6 @@ config THP_SWAP For selection by architectures with reasonable THP sizes. -config TRANSPARENT_HUGE_PAGECACHE - def_bool y - depends on TRANSPARENT_HUGEPAGE - # # UP and nommu archs use km based percpu allocator # @@ -714,7 +710,7 @@ config GUP_GET_PTE_LOW_HIGH config READ_ONLY_THP_FOR_FS bool "Read-only THP for filesystems (EXPERIMENTAL)" - depends on TRANSPARENT_HUGE_PAGECACHE && SHMEM + depends on TRANSPARENT_HUGEPAGE && SHMEM help Allow khugepaged to put read-only file-backed pages in THP. diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0080e8df18ef..c1e7c71db1e6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -326,7 +326,7 @@ static struct attribute *hugepage_attr[] = { &defrag_attr.attr, &use_zero_page_attr.attr, &hpage_pmd_size_attr.attr, -#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE) +#ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif #ifdef CONFIG_DEBUG_VM diff --git a/mm/khugepaged.c b/mm/khugepaged.c index c659c68728bc..b1d9a8e189b8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -414,8 +414,6 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && (vm_flags & VM_DENYWRITE))) { - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) - return false; return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, HPAGE_PMD_NR); } @@ -1258,7 +1256,7 @@ static void collect_mm_slot(struct mm_slot *mm_slot) } } -#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE) +#ifdef CONFIG_SHMEM /* * Notify khugepaged that given addr of the mm is pte-mapped THP. Then * khugepaged should try to collapse the page table. @@ -1973,6 +1971,8 @@ skip: if (khugepaged_scan.address < hstart) khugepaged_scan.address = hstart; VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); + if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma)) + goto skip; while (khugepaged_scan.address < hend) { int ret; @@ -1984,14 +1984,10 @@ skip: khugepaged_scan.address + HPAGE_PMD_SIZE > hend); if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file) { - struct file *file; + struct file *file = get_file(vma->vm_file); pgoff_t pgoff = linear_page_index(vma, khugepaged_scan.address); - if (shmem_file(vma->vm_file) - && !shmem_huge_enabled(vma)) - goto skip; - file = get_file(vma->vm_file); up_read(&mm->mmap_sem); ret = 1; khugepaged_scan_file(mm, file, pgoff, hpage); diff --git a/mm/memory.c b/mm/memory.c index d2a353c345ad..d527e0ec29c7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3373,7 +3373,7 @@ map_pte: return 0; } -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE static void deposit_prealloc_pte(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -3475,8 +3475,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg, pte_t entry; vm_fault_t ret; - if (pmd_none(*vmf->pmd) && PageTransCompound(page) && - IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) { + if (pmd_none(*vmf->pmd) && PageTransCompound(page)) { /* THP on COW? */ VM_BUG_ON_PAGE(memcg, page); diff --git a/mm/rmap.c b/mm/rmap.c index 68fe0472c803..374a9bfdbffa 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -933,7 +933,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma, set_pte_at(vma->vm_mm, address, pte, entry); ret = 1; } else { -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE pmd_t *pmd = pvmw.pmd; pmd_t entry; diff --git a/mm/shmem.c b/mm/shmem.c index b48ac3806f8f..2c255f383608 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -410,7 +410,7 @@ static bool shmem_confirm_swap(struct address_space *mapping, #define SHMEM_HUGE_DENY (-1) #define SHMEM_HUGE_FORCE (-2) -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE /* ifdef here to avoid bloating shmem.o when not necessary */ static int shmem_huge __read_mostly; @@ -580,7 +580,7 @@ static long shmem_unused_huge_count(struct super_block *sb, struct shmem_sb_info *sbinfo = SHMEM_SB(sb); return READ_ONCE(sbinfo->shrinklist_len); } -#else /* !CONFIG_TRANSPARENT_HUGE_PAGECACHE */ +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ #define shmem_huge SHMEM_HUGE_DENY @@ -589,11 +589,11 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo, { return 0; } -#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo) { - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) && shmem_huge != SHMEM_HUGE_DENY) return true; @@ -1059,7 +1059,7 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) * Part of the huge page can be beyond i_size: subject * to shrink under memory pressure. */ - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) { + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { spin_lock(&sbinfo->shrinklist_lock); /* * _careful to defend against unlocked access to @@ -1510,7 +1510,7 @@ static struct page *shmem_alloc_and_acct_page(gfp_t gfp, int nr; int err = -ENOSPC; - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) huge = false; nr = huge ? HPAGE_PMD_NR : 1; @@ -2093,7 +2093,7 @@ unsigned long shmem_get_unmapped_area(struct file *file, get_area = current->mm->get_unmapped_area; addr = get_area(file, uaddr, len, pgoff, flags); - if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE)) + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) return addr; if (IS_ERR_VALUE(addr)) return addr; @@ -2232,7 +2232,7 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) file_accessed(file); vma->vm_ops = &shmem_vm_ops; - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) < (vma->vm_end & HPAGE_PMD_MASK)) { khugepaged_enter(vma, vma->vm_flags); @@ -3459,7 +3459,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) case Opt_huge: ctx->huge = result.uint_32; if (ctx->huge != SHMEM_HUGE_NEVER && - !(IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && + !(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && has_transparent_hugepage())) goto unsupported_parameter; ctx->seen |= SHMEM_SEEN_HUGE; @@ -3605,7 +3605,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, sbinfo->gid)); -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */ if (sbinfo->huge) seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge)); @@ -3850,7 +3850,7 @@ static const struct super_operations shmem_ops = { .evict_inode = shmem_evict_inode, .drop_inode = generic_delete_inode, .put_super = shmem_put_super, -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE .nr_cached_objects = shmem_unused_huge_count, .free_cached_objects = shmem_unused_huge_scan, #endif @@ -3912,7 +3912,7 @@ int __init shmem_init(void) goto out1; } -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE if (has_transparent_hugepage() && shmem_huge > SHMEM_HUGE_DENY) SHMEM_SB(shm_mnt->mnt_sb)->huge = shmem_huge; else @@ -3928,7 +3928,7 @@ out2: return error; } -#if defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && defined(CONFIG_SYSFS) +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSFS) static ssize_t shmem_enabled_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -3980,9 +3980,9 @@ static ssize_t shmem_enabled_store(struct kobject *kobj, struct kobj_attribute shmem_enabled_attr = __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store); -#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE && CONFIG_SYSFS */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */ -#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE +#ifdef CONFIG_TRANSPARENT_HUGEPAGE bool shmem_huge_enabled(struct vm_area_struct *vma) { struct inode *inode = file_inode(vma->vm_file); @@ -4017,7 +4017,7 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) return false; } } -#endif /* CONFIG_TRANSPARENT_HUGE_PAGECACHE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #else /* !CONFIG_SHMEM */ @@ -4186,7 +4186,7 @@ int shmem_zero_setup(struct vm_area_struct *vma) vma->vm_file = file; vma->vm_ops = &shmem_vm_ops; - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && ((vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK) < (vma->vm_end & HPAGE_PMD_MASK)) { khugepaged_enter(vma, vma->vm_flags); From 7a9547fd4ebc5df52245e1df8a2e0bb481f20a92 Mon Sep 17 00:00:00 2001 From: Li Chen Date: Mon, 6 Apr 2020 20:04:38 -0700 Subject: [PATCH 021/158] mm/ksm.c: update get_user_pages() argument in comment This updates get_user_pages()'s argument in ksm_test_exit()'s comment Signed-off-by: Li Chen Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/30ac2417-f1c7-f337-0beb-df561295298c@uniontech.com Signed-off-by: Linus Torvalds --- mm/ksm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/ksm.c b/mm/ksm.c index d17c7d57d0d8..363ec8189561 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -455,7 +455,7 @@ static inline bool ksm_test_exit(struct mm_struct *mm) /* * We use break_ksm to break COW on a ksm page: it's a stripped down * - * if (get_user_pages(addr, 1, 1, 1, &page, NULL) == 1) + * if (get_user_pages(addr, 1, FOLL_WRITE, &page, NULL) == 1) * put_page(page); * * but taking great care only to touch a ksm page, in a VM_MERGEABLE vma, From 9de4f22a60f731943f050f4448bf2933ed3fa70b Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Mon, 6 Apr 2020 20:04:41 -0700 Subject: [PATCH 022/158] mm: code cleanup for MADV_FREE Some comments for MADV_FREE is revised and added to help people understand the MADV_FREE code, especially the page flag, PG_swapbacked. This makes page_is_file_cache() isn't consistent with its comments. So the function is renamed to page_is_file_lru() to make them consistent again. All these are put in one patch as one logical change. Suggested-by: David Hildenbrand Suggested-by: Johannes Weiner Suggested-by: David Rientjes Signed-off-by: "Huang, Ying" Signed-off-by: Andrew Morton Acked-by: Johannes Weiner Acked-by: David Rientjes Acked-by: Michal Hocko Acked-by: Pankaj Gupta Acked-by: Vlastimil Babka Cc: Dave Hansen Cc: Mel Gorman Cc: Minchan Kim Cc: Hugh Dickins Cc: Rik van Riel Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com Signed-off-by: Linus Torvalds --- include/linux/mm_inline.h | 15 ++++++++------- include/linux/page-flags.h | 5 +++++ include/trace/events/vmscan.h | 2 +- mm/compaction.c | 2 +- mm/gup.c | 2 +- mm/khugepaged.c | 4 ++-- mm/memory-failure.c | 2 +- mm/memory_hotplug.c | 2 +- mm/mempolicy.c | 2 +- mm/migrate.c | 16 ++++++++-------- mm/mprotect.c | 2 +- mm/swap.c | 16 ++++++++-------- mm/vmscan.c | 12 ++++++------ 13 files changed, 44 insertions(+), 38 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 6f2fef7b0784..219bef41d87c 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -6,19 +6,20 @@ #include /** - * page_is_file_cache - should the page be on a file LRU or anon LRU? + * page_is_file_lru - should the page be on a file LRU or anon LRU? * @page: the page to test * - * Returns 1 if @page is page cache page backed by a regular filesystem, - * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed. - * Used by functions that manipulate the LRU lists, to sort a page - * onto the right LRU list. + * Returns 1 if @page is a regular filesystem backed page cache page or a lazily + * freed anonymous page (e.g. via MADV_FREE). Returns 0 if @page is a normal + * anonymous page, a tmpfs page or otherwise ram or swap backed page. Used by + * functions that manipulate the LRU lists, to sort a page onto the right LRU + * list. * * We would like to get this info without a page flag, but the state * needs to survive until the page is last deleted from the LRU, which * could be as far down as __page_cache_release. */ -static inline int page_is_file_cache(struct page *page) +static inline int page_is_file_lru(struct page *page) { return !PageSwapBacked(page); } @@ -75,7 +76,7 @@ static __always_inline void del_page_from_lru_list(struct page *page, */ static inline enum lru_list page_lru_base_type(struct page *page) { - if (page_is_file_cache(page)) + if (page_is_file_lru(page)) return LRU_INACTIVE_FILE; return LRU_INACTIVE_ANON; } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 77de28bfefb0..acf7988fd640 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -63,6 +63,11 @@ * page_waitqueue(page) is a wait queue of all tasks waiting for the page * to become unlocked. * + * PG_swapbacked is set when a page uses swap as a backing storage. This are + * usually PageAnon or shmem pages but please note that even anonymous pages + * might lose their PG_swapbacked flag when they simply can be dropped (e.g. as + * a result of MADV_FREE). + * * PG_uptodate tells whether the page's contents is valid. When a read * completes, the page becomes uptodate, unless a disk I/O error happened. * diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index a5ab2973e8dc..74bb594ccb25 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -323,7 +323,7 @@ TRACE_EVENT(mm_vmscan_writepage, TP_fast_assign( __entry->pfn = page_to_pfn(page); __entry->reclaim_flags = trace_reclaim_flags( - page_is_file_cache(page)); + page_is_file_lru(page)); ), TP_printk("page=%p pfn=%lu flags=%s", diff --git a/mm/compaction.c b/mm/compaction.c index df3da2f76fdc..46af63eb8212 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -989,7 +989,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* Successfully isolated */ del_page_from_lru_list(page, lruvec, page_lru(page)); mod_node_page_state(page_pgdat(page), - NR_ISOLATED_ANON + page_is_file_cache(page), + NR_ISOLATED_ANON + page_is_file_lru(page), hpage_nr_pages(page)); isolate_success: diff --git a/mm/gup.c b/mm/gup.c index 96af7e08db4b..b185377c38b7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1677,7 +1677,7 @@ check_again: list_add_tail(&head->lru, &cma_page_list); mod_node_page_state(page_pgdat(head), NR_ISOLATED_ANON + - page_is_file_cache(head), + page_is_file_lru(head), hpage_nr_pages(head)); } } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b1d9a8e189b8..3afc1e2d7a55 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -511,7 +511,7 @@ void __khugepaged_exit(struct mm_struct *mm) static void release_pte_page(struct page *page) { - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page)); + dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_lru(page)); unlock_page(page); putback_lru_page(page); } @@ -611,7 +611,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } inc_node_page_state(page, - NR_ISOLATED_ANON + page_is_file_cache(page)); + NR_ISOLATED_ANON + page_is_file_lru(page)); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 1c961cd26c0b..a96364be8ab4 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1810,7 +1810,7 @@ static int __soft_offline_page(struct page *page, int flags) */ if (!__PageMovable(page)) inc_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_cache(page)); + page_is_file_lru(page)); list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL, MIGRATE_SYNC, MR_MEMORY_FAILURE); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 19389cdc16a5..005eab3411e5 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1317,7 +1317,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) list_add_tail(&page->lru, &source); if (!__PageMovable(page)) inc_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_cache(page)); + page_is_file_lru(page)); } else { pr_warn("failed to isolate pfn %lx\n", pfn); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b36926ba02e2..037e5f548118 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1022,7 +1022,7 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist, if (!isolate_lru_page(head)) { list_add_tail(&head->lru, pagelist); mod_node_page_state(page_pgdat(head), - NR_ISOLATED_ANON + page_is_file_cache(head), + NR_ISOLATED_ANON + page_is_file_lru(head), hpage_nr_pages(head)); } else if (flags & MPOL_MF_STRICT) { /* diff --git a/mm/migrate.c b/mm/migrate.c index 1a205503be3f..c1412e04975e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -193,7 +193,7 @@ void putback_movable_pages(struct list_head *l) put_page(page); } else { mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + - page_is_file_cache(page), -hpage_nr_pages(page)); + page_is_file_lru(page), -hpage_nr_pages(page)); putback_lru_page(page); } } @@ -1219,7 +1219,7 @@ out: */ if (likely(!__PageMovable(page))) mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + - page_is_file_cache(page), -hpage_nr_pages(page)); + page_is_file_lru(page), -hpage_nr_pages(page)); } /* @@ -1592,7 +1592,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, err = 1; list_add_tail(&head->lru, pagelist); mod_node_page_state(page_pgdat(head), - NR_ISOLATED_ANON + page_is_file_cache(head), + NR_ISOLATED_ANON + page_is_file_lru(head), hpage_nr_pages(head)); } out_putpage: @@ -1955,7 +1955,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) return 0; } - page_lru = page_is_file_cache(page); + page_lru = page_is_file_lru(page); mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru, hpage_nr_pages(page)); @@ -1991,7 +1991,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, * Don't migrate file pages that are mapped in multiple processes * with execute permissions as they are probably shared libraries. */ - if (page_mapcount(page) != 1 && page_is_file_cache(page) && + if (page_mapcount(page) != 1 && page_is_file_lru(page) && (vma->vm_flags & VM_EXEC)) goto out; @@ -1999,7 +1999,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, * Also do not migrate dirty pages as not all filesystems can move * dirty pages in MIGRATE_ASYNC mode which is a waste of cycles. */ - if (page_is_file_cache(page) && PageDirty(page)) + if (page_is_file_lru(page) && PageDirty(page)) goto out; isolated = numamigrate_isolate_page(pgdat, page); @@ -2014,7 +2014,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, if (!list_empty(&migratepages)) { list_del(&page->lru); dec_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_cache(page)); + page_is_file_lru(page)); putback_lru_page(page); } isolated = 0; @@ -2044,7 +2044,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, pg_data_t *pgdat = NODE_DATA(node); int isolated = 0; struct page *new_page = NULL; - int page_lru = page_is_file_cache(page); + int page_lru = page_is_file_lru(page); unsigned long start = address & HPAGE_PMD_MASK; new_page = alloc_pages_node(node, diff --git a/mm/mprotect.c b/mm/mprotect.c index 311c0dadf71c..0fee14b39416 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -98,7 +98,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, * it cannot move them all from MIGRATE_ASYNC * context. */ - if (page_is_file_cache(page) && PageDirty(page)) + if (page_is_file_lru(page) && PageDirty(page)) continue; /* diff --git a/mm/swap.c b/mm/swap.c index a4af8c999963..18505990c3b1 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -276,7 +276,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec, void *arg) { if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { - int file = page_is_file_cache(page); + int file = page_is_file_lru(page); int lru = page_lru_base_type(page); del_page_from_lru_list(page, lruvec, lru); @@ -394,7 +394,7 @@ void mark_page_accessed(struct page *page) else __lru_cache_activate_page(page); ClearPageReferenced(page); - if (page_is_file_cache(page)) + if (page_is_file_lru(page)) workingset_activation(page); } if (page_is_idle(page)) @@ -515,7 +515,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec, return; active = PageActive(page); - file = page_is_file_cache(page); + file = page_is_file_lru(page); lru = page_lru_base_type(page); del_page_from_lru_list(page, lruvec, lru + active); @@ -548,7 +548,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec, void *arg) { if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { - int file = page_is_file_cache(page); + int file = page_is_file_lru(page); int lru = page_lru_base_type(page); del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE); @@ -573,9 +573,9 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec, ClearPageActive(page); ClearPageReferenced(page); /* - * lazyfree pages are clean anonymous pages. They have - * SwapBacked flag cleared to distinguish normal anonymous - * pages + * Lazyfree pages are clean anonymous pages. They have + * PG_swapbacked flag cleared, to distinguish them from normal + * anonymous pages */ ClearPageSwapBacked(page); add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE); @@ -962,7 +962,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec, if (page_evictable(page)) { lru = page_lru(page); - update_page_reclaim_stat(lruvec, page_is_file_cache(page), + update_page_reclaim_stat(lruvec, page_is_file_lru(page), PageActive(page)); if (was_unevictable) count_vm_event(UNEVICTABLE_PGRESCUED); diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e8e690d2813..b06868fc4926 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -919,7 +919,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, * exceptional entries and shadow exceptional entries in the * same address_space. */ - if (reclaimed && page_is_file_cache(page) && + if (reclaimed && page_is_file_lru(page) && !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(page, target_memcg); __delete_from_page_cache(page, shadow); @@ -1043,7 +1043,7 @@ static void page_check_dirty_writeback(struct page *page, * Anonymous pages are not handled by flushers and must be written * from reclaim context. Do not stall reclaim based on them */ - if (!page_is_file_cache(page) || + if (!page_is_file_lru(page) || (PageAnon(page) && !PageSwapBacked(page))) { *dirty = false; *writeback = false; @@ -1315,7 +1315,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * the rest of the LRU for clean pages and see * the same dirty pages again (PageReclaim). */ - if (page_is_file_cache(page) && + if (page_is_file_lru(page) && (!current_is_kswapd() || !PageReclaim(page) || !test_bit(PGDAT_DIRTY, &pgdat->flags))) { /* @@ -1459,7 +1459,7 @@ activate_locked: try_to_free_swap(page); VM_BUG_ON_PAGE(PageActive(page), page); if (!PageMlocked(page)) { - int type = page_is_file_cache(page); + int type = page_is_file_lru(page); SetPageActive(page); stat->nr_activate[type] += nr_pages; count_memcg_page_event(page, PGACTIVATE); @@ -1497,7 +1497,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, LIST_HEAD(clean_pages); list_for_each_entry_safe(page, next, page_list, lru) { - if (page_is_file_cache(page) && !PageDirty(page) && + if (page_is_file_lru(page) && !PageDirty(page) && !__PageMovable(page) && !PageUnevictable(page)) { ClearPageActive(page); list_move(&page->lru, &clean_pages); @@ -2053,7 +2053,7 @@ static void shrink_active_list(unsigned long nr_to_scan, * IO, plus JVM can create lots of anon VM_EXEC pages, * so we ignore them here. */ - if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) { + if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) { list_add(&page->lru, &l_active); continue; } From a2129f24798a993abde9b4bf8b3713b52d56c121 Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:04:45 -0700 Subject: [PATCH 023/158] mm: adjust shuffle code to allow for future coalescing Patch series "mm / virtio: Provide support for free page reporting", v17. This series provides an asynchronous means of reporting free guest pages to a hypervisor so that the memory associated with those pages can be dropped and reused by other processes and/or guests on the host. Using this it is possible to avoid unnecessary I/O to disk and greatly improve performance in the case of memory overcommit on the host. When enabled we will be performing a scan of free memory every 2 seconds while pages of sufficiently high order are being freed. In each pass at least one sixteenth of each free list will be reported. By doing this we avoid racing against other threads that may be causing a high amount of memory churn. The lowest page order currently scanned when reporting pages is pageblock_order so that this feature will not interfere with the use of Transparent Huge Pages in the case of virtualization. Currently this is only in use by virtio-balloon however there is the hope that at some point in the future other hypervisors might be able to make use of it. In the virtio-balloon/QEMU implementation the hypervisor is currently using MADV_DONTNEED to indicate to the host kernel that the page is currently free. It will be zeroed and faulted back into the guest the next time the page is accessed. To track if a page is reported or not the Uptodate flag was repurposed and used as a Reported flag for Buddy pages. We walk though the free list isolating pages and adding them to the scatterlist until we either encounter the end of the list or have processed at least one sixteenth of the pages that were listed in nr_free prior to us starting. If we fill the scatterlist before we reach the end of the list we rotate the list so that the first unreported page we encounter is moved to the head of the list as that is where we will resume after we have freed the reported pages back into the tail of the list. Below are the results from various benchmarks. I primarily focused on two tests. The first is the will-it-scale/page_fault2 test, and the other is a modified version of will-it-scale/page_fault1 that was enabled to use THP. I did this as it allows for better visibility into different parts of the memory subsystem. The guest is running with 32G for RAM on one node of a E5-2630 v3. The host has had some features such as CPU turbo disabled in the BIOS. Test page_fault1 (THP) page_fault2 Name tasks Process Iter STDEV Process Iter STDEV Baseline 1 1012402.50 0.14% 361855.25 0.81% 16 8827457.25 0.09% 3282347.00 0.34% Patches Applied 1 1007897.00 0.23% 361887.00 0.26% 16 8784741.75 0.39% 3240669.25 0.48% Patches Enabled 1 1010227.50 0.39% 359749.25 0.56% 16 8756219.00 0.24% 3226608.75 0.97% Patches Enabled 1 1050982.00 4.26% 357966.25 0.14% page shuffle 16 8672601.25 0.49% 3223177.75 0.40% Patches enabled 1 1003238.00 0.22% 360211.00 0.22% shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71% The results above are for a baseline with a linux-next-20191219 kernel, that kernel with this patch set applied but page reporting disabled in virtio-balloon, the patches applied and page reporting fully enabled, the patches enabled with page shuffling enabled, and the patches applied with page shuffling enabled and an RFC patch that makes used of MADV_FREE in QEMU. These results include the deviation seen between the average value reported here versus the high and/or low value. I observed that during the test memory usage for the first three tests never dropped whereas with the patches fully enabled the VM would drop to using only a few GB of the host's memory when switching from memhog to page fault tests. Any of the overhead visible with this patch set enabled seems due to page faults caused by accessing the reported pages and the host zeroing the page before giving it back to the guest. This overhead is much more visible when using THP than with standard 4K pages. In addition page shuffling seemed to increase the amount of faults generated due to an increase in memory churn. The overehad is reduced when using MADV_FREE as we can avoid the extra zeroing of the pages when they are reintroduced to the host, as can be seen when the RFC is applied with shuffling enabled. The overall guest size is kept fairly small to only a few GB while the test is running. If the host memory were oversubscribed this patch set should result in a performance improvement as swapping memory in the host can be avoided. A brief history on the background of free page reporting can be found at: https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/ This patch (of 9): Move the head/tail adding logic out of the shuffle code and into the __free_one_page function since ultimately that is where it is really needed anyway. By doing this we should be able to reduce the overhead and can consolidate all of the list addition bits in one spot. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Reviewed-by: Dan Williams Acked-by: Mel Gorman Acked-by: David Hildenbrand Cc: Yang Zhang Cc: Pankaj Gupta Cc: Konrad Rzeszutek Wilk Cc: Nitesh Narayan Lal Cc: Rik van Riel Cc: Matthew Wilcox Cc: Luiz Capitulino Cc: Dave Hansen Cc: Wei Wang Cc: Andrea Arcangeli Cc: Paolo Bonzini Cc: Michal Hocko Cc: Vlastimil Babka Cc: Oscar Salvador Cc: Michael S. Tsirkin Cc: wei qi Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 12 ------- mm/page_alloc.c | 73 +++++++++++++++++++++++++----------------- mm/shuffle.c | 12 +++---- mm/shuffle.h | 6 ++++ 4 files changed, 55 insertions(+), 48 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e84d448988b6..c023d7968b14 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -116,18 +116,6 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar area->nr_free++; } -#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR -/* Used to preserve page allocation order entropy */ -void add_to_free_area_random(struct page *page, struct free_area *area, - int migratetype); -#else -static inline void add_to_free_area_random(struct page *page, - struct free_area *area, int migratetype) -{ - add_to_free_area(page, area, migratetype); -} -#endif - /* Used for pages which are on another list */ static inline void move_to_free_area(struct page *page, struct free_area *area, int migratetype) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5f76da8cd4e..f2b8cb8f995f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -864,6 +864,36 @@ compaction_capture(struct capture_control *capc, struct page *page, } #endif /* CONFIG_COMPACTION */ +/* + * If this is not the largest possible page, check if the buddy + * of the next-highest order is free. If it is, it's possible + * that pages are being freed that will coalesce soon. In case, + * that is happening, add the free page to the tail of the list + * so it's less likely to be used soon and more likely to be merged + * as a higher order page + */ +static inline bool +buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, + struct page *page, unsigned int order) +{ + struct page *higher_page, *higher_buddy; + unsigned long combined_pfn; + + if (order >= MAX_ORDER - 2) + return false; + + if (!pfn_valid_within(buddy_pfn)) + return false; + + combined_pfn = buddy_pfn & pfn; + higher_page = page + (combined_pfn - pfn); + buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1); + higher_buddy = higher_page + (buddy_pfn - combined_pfn); + + return pfn_valid_within(buddy_pfn) && + page_is_buddy(higher_page, higher_buddy, order + 1); +} + /* * Freeing function for a buddy system allocator. * @@ -893,11 +923,13 @@ static inline void __free_one_page(struct page *page, struct zone *zone, unsigned int order, int migratetype) { - unsigned long combined_pfn; - unsigned long uninitialized_var(buddy_pfn); - struct page *buddy; - unsigned int max_order; struct capture_control *capc = task_capc(zone); + unsigned long uninitialized_var(buddy_pfn); + unsigned long combined_pfn; + struct free_area *area; + unsigned int max_order; + struct page *buddy; + bool to_tail; max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1); @@ -966,35 +998,16 @@ continue_merging: done_merging: set_page_order(page, order); - /* - * If this is not the largest possible page, check if the buddy - * of the next-highest order is free. If it is, it's possible - * that pages are being freed that will coalesce soon. In case, - * that is happening, add the free page to the tail of the list - * so it's less likely to be used soon and more likely to be merged - * as a higher order page - */ - if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn) - && !is_shuffle_order(order)) { - struct page *higher_page, *higher_buddy; - combined_pfn = buddy_pfn & pfn; - higher_page = page + (combined_pfn - pfn); - buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1); - higher_buddy = higher_page + (buddy_pfn - combined_pfn); - if (pfn_valid_within(buddy_pfn) && - page_is_buddy(higher_page, higher_buddy, order + 1)) { - add_to_free_area_tail(page, &zone->free_area[order], - migratetype); - return; - } - } - + area = &zone->free_area[order]; if (is_shuffle_order(order)) - add_to_free_area_random(page, &zone->free_area[order], - migratetype); + to_tail = shuffle_pick_tail(); else - add_to_free_area(page, &zone->free_area[order], migratetype); + to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); + if (to_tail) + add_to_free_area_tail(page, area, migratetype); + else + add_to_free_area(page, area, migratetype); } /* diff --git a/mm/shuffle.c b/mm/shuffle.c index c716059cbd3c..44406d9977c7 100644 --- a/mm/shuffle.c +++ b/mm/shuffle.c @@ -183,11 +183,11 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) shuffle_zone(z); } -void add_to_free_area_random(struct page *page, struct free_area *area, - int migratetype) +bool shuffle_pick_tail(void) { static u64 rand; static u8 rand_bits; + bool ret; /* * The lack of locking is deliberate. If 2 threads race to @@ -198,10 +198,10 @@ void add_to_free_area_random(struct page *page, struct free_area *area, rand = get_random_u64(); } - if (rand & 1) - add_to_free_area(page, area, migratetype); - else - add_to_free_area_tail(page, area, migratetype); + ret = rand & 1; + rand_bits--; rand >>= 1; + + return ret; } diff --git a/mm/shuffle.h b/mm/shuffle.h index 777a257a0d2f..4d79f03b6658 100644 --- a/mm/shuffle.h +++ b/mm/shuffle.h @@ -22,6 +22,7 @@ enum mm_shuffle_ctl { DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key); extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl); extern void __shuffle_free_memory(pg_data_t *pgdat); +extern bool shuffle_pick_tail(void); static inline void shuffle_free_memory(pg_data_t *pgdat) { if (!static_branch_unlikely(&page_alloc_shuffle_key)) @@ -44,6 +45,11 @@ static inline bool is_shuffle_order(int order) return order >= SHUFFLE_ORDER; } #else +static inline bool shuffle_pick_tail(void) +{ + return false; +} + static inline void shuffle_free_memory(pg_data_t *pgdat) { } From 6ab0136310961ebf4b5ecb565f0bf52c233dc093 Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:04:49 -0700 Subject: [PATCH 024/158] mm: use zone and order instead of free area in free_list manipulators In order to enable the use of the zone from the list manipulator functions I will need access to the zone pointer. As it turns out most of the accessors were always just being directly passed &zone->free_area[order] anyway so it would make sense to just fold that into the function itself and pass the zone and order as arguments instead of the free area. In order to be able to reference the zone we need to move the declaration of the functions down so that we have the zone defined before we define the list manipulation functions. Since the functions are only used in the file mm/page_alloc.c we can just move them there to reduce noise in the header. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Reviewed-by: Dan Williams Reviewed-by: David Hildenbrand Reviewed-by: Pankaj Gupta Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Dave Hansen Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 32 -------------------- mm/page_alloc.c | 67 ++++++++++++++++++++++++++++++------------ 2 files changed, 49 insertions(+), 50 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c023d7968b14..42b77d3b68e8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -100,29 +100,6 @@ struct free_area { unsigned long nr_free; }; -/* Used for pages not on another list */ -static inline void add_to_free_area(struct page *page, struct free_area *area, - int migratetype) -{ - list_add(&page->lru, &area->free_list[migratetype]); - area->nr_free++; -} - -/* Used for pages not on another list */ -static inline void add_to_free_area_tail(struct page *page, struct free_area *area, - int migratetype) -{ - list_add_tail(&page->lru, &area->free_list[migratetype]); - area->nr_free++; -} - -/* Used for pages which are on another list */ -static inline void move_to_free_area(struct page *page, struct free_area *area, - int migratetype) -{ - list_move(&page->lru, &area->free_list[migratetype]); -} - static inline struct page *get_page_from_free_area(struct free_area *area, int migratetype) { @@ -130,15 +107,6 @@ static inline struct page *get_page_from_free_area(struct free_area *area, struct page, lru); } -static inline void del_page_from_free_area(struct page *page, - struct free_area *area) -{ - list_del(&page->lru); - __ClearPageBuddy(page); - set_page_private(page, 0); - area->nr_free--; -} - static inline bool free_area_empty(struct free_area *area, int migratetype) { return list_empty(&area->free_list[migratetype]); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f2b8cb8f995f..14bdf3608a6b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -864,6 +864,44 @@ compaction_capture(struct capture_control *capc, struct page *page, } #endif /* CONFIG_COMPACTION */ +/* Used for pages not on another list */ +static inline void add_to_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype) +{ + struct free_area *area = &zone->free_area[order]; + + list_add(&page->lru, &area->free_list[migratetype]); + area->nr_free++; +} + +/* Used for pages not on another list */ +static inline void add_to_free_list_tail(struct page *page, struct zone *zone, + unsigned int order, int migratetype) +{ + struct free_area *area = &zone->free_area[order]; + + list_add_tail(&page->lru, &area->free_list[migratetype]); + area->nr_free++; +} + +/* Used for pages which are on another list */ +static inline void move_to_free_list(struct page *page, struct zone *zone, + unsigned int order, int migratetype) +{ + struct free_area *area = &zone->free_area[order]; + + list_move(&page->lru, &area->free_list[migratetype]); +} + +static inline void del_page_from_free_list(struct page *page, struct zone *zone, + unsigned int order) +{ + list_del(&page->lru); + __ClearPageBuddy(page); + set_page_private(page, 0); + zone->free_area[order].nr_free--; +} + /* * If this is not the largest possible page, check if the buddy * of the next-highest order is free. If it is, it's possible @@ -926,7 +964,6 @@ static inline void __free_one_page(struct page *page, struct capture_control *capc = task_capc(zone); unsigned long uninitialized_var(buddy_pfn); unsigned long combined_pfn; - struct free_area *area; unsigned int max_order; struct page *buddy; bool to_tail; @@ -964,7 +1001,7 @@ continue_merging: if (page_is_guard(buddy)) clear_page_guard(zone, buddy, order, migratetype); else - del_page_from_free_area(buddy, &zone->free_area[order]); + del_page_from_free_list(buddy, zone, order); combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; @@ -998,16 +1035,15 @@ continue_merging: done_merging: set_page_order(page, order); - area = &zone->free_area[order]; if (is_shuffle_order(order)) to_tail = shuffle_pick_tail(); else to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order); if (to_tail) - add_to_free_area_tail(page, area, migratetype); + add_to_free_list_tail(page, zone, order, migratetype); else - add_to_free_area(page, area, migratetype); + add_to_free_list(page, zone, order, migratetype); } /* @@ -2021,13 +2057,11 @@ void __init init_cma_reserved_pageblock(struct page *page) * -- nyc */ static inline void expand(struct zone *zone, struct page *page, - int low, int high, struct free_area *area, - int migratetype) + int low, int high, int migratetype) { unsigned long size = 1 << high; while (high > low) { - area--; high--; size >>= 1; VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]); @@ -2041,7 +2075,7 @@ static inline void expand(struct zone *zone, struct page *page, if (set_page_guard(zone, &page[size], high, migratetype)) continue; - add_to_free_area(&page[size], area, migratetype); + add_to_free_list(&page[size], zone, high, migratetype); set_page_order(&page[size], high); } } @@ -2199,8 +2233,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_area(page, area); - expand(zone, page, order, current_order, area, migratetype); + del_page_from_free_list(page, zone, current_order); + expand(zone, page, order, current_order, migratetype); set_pcppage_migratetype(page, migratetype); return page; } @@ -2274,7 +2308,7 @@ static int move_freepages(struct zone *zone, VM_BUG_ON_PAGE(page_zone(page) != zone, page); order = page_order(page); - move_to_free_area(page, &zone->free_area[order], migratetype); + move_to_free_list(page, zone, order, migratetype); page += 1 << order; pages_moved += 1 << order; } @@ -2390,7 +2424,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, unsigned int alloc_flags, int start_type, bool whole_block) { unsigned int current_order = page_order(page); - struct free_area *area; int free_pages, movable_pages, alike_pages; int old_block_type; @@ -2461,8 +2494,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page, return; single_page: - area = &zone->free_area[current_order]; - move_to_free_area(page, area, start_type); + move_to_free_list(page, zone, current_order, start_type); } /* @@ -3133,7 +3165,6 @@ EXPORT_SYMBOL_GPL(split_page); int __isolate_free_page(struct page *page, unsigned int order) { - struct free_area *area = &page_zone(page)->free_area[order]; unsigned long watermark; struct zone *zone; int mt; @@ -3159,7 +3190,7 @@ int __isolate_free_page(struct page *page, unsigned int order) /* Remove page from free list */ - del_page_from_free_area(page, area); + del_page_from_free_list(page, zone, order); /* * Set the pageblock if the isolated page is at least half of a @@ -8726,7 +8757,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) BUG_ON(!PageBuddy(page)); order = page_order(page); offlined_pages += 1 << order; - del_page_from_free_area(page, &zone->free_area[order]); + del_page_from_free_list(page, zone, order); pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); From 624f58d8f4639676d2fa1238425ab0148d501c4a Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:04:53 -0700 Subject: [PATCH 025/158] mm: add function __putback_isolated_page There are cases where we would benefit from avoiding having to go through the allocation and free cycle to return an isolated page. Examples for this might include page poisoning in which we isolate a page and then put it back in the free list without ever having actually allocated it. This will enable us to also avoid notifiers for the future free page reporting which will need to avoid retriggering page reporting when returning pages that have been reported on. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Acked-by: David Hildenbrand Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- mm/internal.h | 2 ++ mm/page_alloc.c | 19 +++++++++++++++++++ mm/page_isolation.c | 6 ++---- 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 2d58ae15a958..b5634e78f01d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -180,6 +180,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, } extern int __isolate_free_page(struct page *page, unsigned int order); +extern void __putback_isolated_page(struct page *page, unsigned int order, + int mt); extern void memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order); extern void __free_pages_core(struct page *page, unsigned int order); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 14bdf3608a6b..448e439b75f2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3211,6 +3211,25 @@ int __isolate_free_page(struct page *page, unsigned int order) return 1UL << order; } +/** + * __putback_isolated_page - Return a now-isolated page back where we got it + * @page: Page that was isolated + * @order: Order of the isolated page + * + * This function is meant to return a page pulled from the free lists via + * __isolate_free_page back to the free lists they were pulled from. + */ +void __putback_isolated_page(struct page *page, unsigned int order, int mt) +{ + struct zone *zone = page_zone(page); + + /* zone lock should be held when this function is called */ + lockdep_assert_held(&zone->lock); + + /* Return isolated page to tail of freelist. */ + __free_one_page(page, page_to_pfn(page), zone, order, mt); +} + /* * Update NUMA hit/miss statistics * diff --git a/mm/page_isolation.c b/mm/page_isolation.c index a9fd7c740c23..2c11a38d6e87 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -117,13 +117,11 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype) __mod_zone_freepage_state(zone, nr_pages, migratetype); } set_pageblock_migratetype(page, migratetype); + if (isolated_page) + __putback_isolated_page(page, order, migratetype); zone->nr_isolate_pageblock--; out: spin_unlock_irqrestore(&zone->lock, flags); - if (isolated_page) { - post_alloc_hook(page, order, __GFP_MOVABLE); - __free_pages(page, order); - } } static inline struct page * From 36e66c554b5c6a9d17a229faca7a61693527b0bd Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:04:56 -0700 Subject: [PATCH 026/158] mm: introduce Reported pages In order to pave the way for free page reporting in virtualized environments we will need a way to get pages out of the free lists and identify those pages after they have been returned. To accomplish this, this patch adds the concept of a Reported Buddy, which is essentially meant to just be the Uptodate flag used in conjunction with the Buddy page type. To prevent the reported pages from leaking outside of the buddy lists I added a check to clear the PageReported bit in the del_page_from_free_list function. As a result any reported page that is split, merged, or allocated will have the flag cleared prior to the PageBuddy value being cleared. The process for reporting pages is fairly simple. Once we free a page that meets the minimum order for page reporting we will schedule a worker thread to start 2s or more in the future. That worker thread will begin working from the lowest supported page reporting order up to MAX_ORDER - 1 pulling unreported pages from the free list and storing them in the scatterlist. When processing each individual free list it is necessary for the worker thread to release the zone lock when it needs to stop and report the full scatterlist of pages. To reduce the work of the next iteration the worker thread will rotate the free list so that the first unreported page in the free list becomes the first entry in the list. It will then call a reporting function providing information on how many entries are in the scatterlist. Once the function completes it will return the pages to the free area from which they were allocated and start over pulling more pages from the free areas until there are no longer enough pages to report on to keep the worker busy, or we have processed as many pages as were contained in the free area when we started processing the list. The worker thread will work in a round-robin fashion making its way though each zone requesting reporting, and through each reportable free list within that zone. Once all free areas within the zone have been processed it will check to see if there have been any requests for reporting while it was processing. If so it will reschedule the worker thread to start up again in roughly 2s and exit. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- include/linux/page-flags.h | 11 ++ include/linux/page_reporting.h | 25 +++ mm/Kconfig | 11 ++ mm/Makefile | 1 + mm/page_alloc.c | 17 +- mm/page_reporting.c | 319 +++++++++++++++++++++++++++++++++ mm/page_reporting.h | 54 ++++++ 7 files changed, 434 insertions(+), 4 deletions(-) create mode 100644 include/linux/page_reporting.h create mode 100644 mm/page_reporting.c create mode 100644 mm/page_reporting.h diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index acf7988fd640..222f6f7b2bb3 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -168,6 +168,9 @@ enum pageflags { /* non-lru isolated movable page */ PG_isolated = PG_reclaim, + + /* Only valid for buddy pages. Used to track pages that are reported */ + PG_reported = PG_uptodate, }; #ifndef __GENERATING_BOUNDS_H @@ -436,6 +439,14 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +/* + * PageReported() is used to track reported free pages within the Buddy + * allocator. We can use the non-atomic version of the test and set + * operations as both should be shielded with the zone lock to prevent + * any possible races on the setting or clearing of the bit. + */ +__PAGEFLAG(Reported, reported, PF_NO_COMPOUND) + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h new file mode 100644 index 000000000000..32355486f572 --- /dev/null +++ b/include/linux/page_reporting.h @@ -0,0 +1,25 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PAGE_REPORTING_H +#define _LINUX_PAGE_REPORTING_H + +#include +#include + +#define PAGE_REPORTING_CAPACITY 32 + +struct page_reporting_dev_info { + /* function that alters pages to make them "reported" */ + int (*report)(struct page_reporting_dev_info *prdev, + struct scatterlist *sg, unsigned int nents); + + /* work struct for processing reports */ + struct delayed_work work; + + /* Current state of page reporting */ + atomic_t state; +}; + +/* Tear-down and bring-up for page reporting devices */ +void page_reporting_unregister(struct page_reporting_dev_info *prdev); +int page_reporting_register(struct page_reporting_dev_info *prdev); +#endif /*_LINUX_PAGE_REPORTING_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 211a70e8d5cf..d286dc54458c 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -236,6 +236,17 @@ config COMPACTION it and then we would be really interested to hear about that at linux-mm@kvack.org. +# +# support for free page reporting +config PAGE_REPORTING + bool "Free page reporting" + def_bool n + help + Free page reporting allows for the incremental acquisition of + free pages from the buddy allocator for the purpose of reporting + those pages to another entity, such as a hypervisor, so that the + memory can be freed within the host for other uses. + # # support for page migration # diff --git a/mm/Makefile b/mm/Makefile index dbc8346d16ca..fccd3756b25f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -111,3 +111,4 @@ obj-$(CONFIG_HMM_MIRROR) += hmm.o obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o +obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 448e439b75f2..114c56c3685d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -74,6 +74,7 @@ #include #include "internal.h" #include "shuffle.h" +#include "page_reporting.h" /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */ static DEFINE_MUTEX(pcp_batch_high_lock); @@ -896,6 +897,10 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, static inline void del_page_from_free_list(struct page *page, struct zone *zone, unsigned int order) { + /* clear reported state and update reported page count */ + if (page_reported(page)) + __ClearPageReported(page); + list_del(&page->lru); __ClearPageBuddy(page); set_page_private(page, 0); @@ -959,7 +964,7 @@ buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn, static inline void __free_one_page(struct page *page, unsigned long pfn, struct zone *zone, unsigned int order, - int migratetype) + int migratetype, bool report) { struct capture_control *capc = task_capc(zone); unsigned long uninitialized_var(buddy_pfn); @@ -1044,6 +1049,10 @@ done_merging: add_to_free_list_tail(page, zone, order, migratetype); else add_to_free_list(page, zone, order, migratetype); + + /* Notify page reporting subsystem of freed page */ + if (report) + page_reporting_notify_free(order); } /* @@ -1360,7 +1369,7 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (unlikely(isolated_pageblocks)) mt = get_pageblock_migratetype(page); - __free_one_page(page, page_to_pfn(page), zone, 0, mt); + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true); trace_mm_page_pcpu_drain(page, 0, mt); } spin_unlock(&zone->lock); @@ -1376,7 +1385,7 @@ static void free_one_page(struct zone *zone, is_migrate_isolate(migratetype))) { migratetype = get_pfnblock_migratetype(page, pfn); } - __free_one_page(page, pfn, zone, order, migratetype); + __free_one_page(page, pfn, zone, order, migratetype, true); spin_unlock(&zone->lock); } @@ -3227,7 +3236,7 @@ void __putback_isolated_page(struct page *page, unsigned int order, int mt) lockdep_assert_held(&zone->lock); /* Return isolated page to tail of freelist. */ - __free_one_page(page, page_to_pfn(page), zone, order, mt); + __free_one_page(page, page_to_pfn(page), zone, order, mt, false); } /* diff --git a/mm/page_reporting.c b/mm/page_reporting.c new file mode 100644 index 000000000000..1047c6872d4f --- /dev/null +++ b/mm/page_reporting.c @@ -0,0 +1,319 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include + +#include "page_reporting.h" +#include "internal.h" + +#define PAGE_REPORTING_DELAY (2 * HZ) +static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; + +enum { + PAGE_REPORTING_IDLE = 0, + PAGE_REPORTING_REQUESTED, + PAGE_REPORTING_ACTIVE +}; + +/* request page reporting */ +static void +__page_reporting_request(struct page_reporting_dev_info *prdev) +{ + unsigned int state; + + /* Check to see if we are in desired state */ + state = atomic_read(&prdev->state); + if (state == PAGE_REPORTING_REQUESTED) + return; + + /* + * If reporting is already active there is nothing we need to do. + * Test against 0 as that represents PAGE_REPORTING_IDLE. + */ + state = atomic_xchg(&prdev->state, PAGE_REPORTING_REQUESTED); + if (state != PAGE_REPORTING_IDLE) + return; + + /* + * Delay the start of work to allow a sizable queue to build. For + * now we are limiting this to running no more than once every + * couple of seconds. + */ + schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY); +} + +/* notify prdev of free page reporting request */ +void __page_reporting_notify(void) +{ + struct page_reporting_dev_info *prdev; + + /* + * We use RCU to protect the pr_dev_info pointer. In almost all + * cases this should be present, however in the unlikely case of + * a shutdown this will be NULL and we should exit. + */ + rcu_read_lock(); + prdev = rcu_dereference(pr_dev_info); + if (likely(prdev)) + __page_reporting_request(prdev); + + rcu_read_unlock(); +} + +static void +page_reporting_drain(struct page_reporting_dev_info *prdev, + struct scatterlist *sgl, unsigned int nents, bool reported) +{ + struct scatterlist *sg = sgl; + + /* + * Drain the now reported pages back into their respective + * free lists/areas. We assume at least one page is populated. + */ + do { + struct page *page = sg_page(sg); + int mt = get_pageblock_migratetype(page); + unsigned int order = get_order(sg->length); + + __putback_isolated_page(page, order, mt); + + /* If the pages were not reported due to error skip flagging */ + if (!reported) + continue; + + /* + * If page was not comingled with another page we can + * consider the result to be "reported" since the page + * hasn't been modified, otherwise we will need to + * report on the new larger page when we make our way + * up to that higher order. + */ + if (PageBuddy(page) && page_order(page) == order) + __SetPageReported(page); + } while ((sg = sg_next(sg))); + + /* reinitialize scatterlist now that it is empty */ + sg_init_table(sgl, nents); +} + +/* + * The page reporting cycle consists of 4 stages, fill, report, drain, and + * idle. We will cycle through the first 3 stages until we cannot obtain a + * full scatterlist of pages, in that case we will switch to idle. + */ +static int +page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, + unsigned int order, unsigned int mt, + struct scatterlist *sgl, unsigned int *offset) +{ + struct free_area *area = &zone->free_area[order]; + struct list_head *list = &area->free_list[mt]; + unsigned int page_len = PAGE_SIZE << order; + struct page *page, *next; + int err = 0; + + /* + * Perform early check, if free area is empty there is + * nothing to process so we can skip this free_list. + */ + if (list_empty(list)) + return err; + + spin_lock_irq(&zone->lock); + + /* loop through free list adding unreported pages to sg list */ + list_for_each_entry_safe(page, next, list, lru) { + /* We are going to skip over the reported pages. */ + if (PageReported(page)) + continue; + + /* Attempt to pull page from list */ + if (!__isolate_free_page(page, order)) + break; + + /* Add page to scatter list */ + --(*offset); + sg_set_page(&sgl[*offset], page, page_len, 0); + + /* If scatterlist isn't full grab more pages */ + if (*offset) + continue; + + /* release lock before waiting on report processing */ + spin_unlock_irq(&zone->lock); + + /* begin processing pages in local list */ + err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY); + + /* reset offset since the full list was reported */ + *offset = PAGE_REPORTING_CAPACITY; + + /* reacquire zone lock and resume processing */ + spin_lock_irq(&zone->lock); + + /* flush reported pages from the sg list */ + page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err); + + /* + * Reset next to first entry, the old next isn't valid + * since we dropped the lock to report the pages + */ + next = list_first_entry(list, struct page, lru); + + /* exit on error */ + if (err) + break; + } + + spin_unlock_irq(&zone->lock); + + return err; +} + +static int +page_reporting_process_zone(struct page_reporting_dev_info *prdev, + struct scatterlist *sgl, struct zone *zone) +{ + unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY; + unsigned long watermark; + int err = 0; + + /* Generate minimum watermark to be able to guarantee progress */ + watermark = low_wmark_pages(zone) + + (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER); + + /* + * Cancel request if insufficient free memory or if we failed + * to allocate page reporting statistics for the zone. + */ + if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA)) + return err; + + /* Process each free list starting from lowest order/mt */ + for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) { + for (mt = 0; mt < MIGRATE_TYPES; mt++) { + /* We do not pull pages from the isolate free list */ + if (is_migrate_isolate(mt)) + continue; + + err = page_reporting_cycle(prdev, zone, order, mt, + sgl, &offset); + if (err) + return err; + } + } + + /* report the leftover pages before going idle */ + leftover = PAGE_REPORTING_CAPACITY - offset; + if (leftover) { + sgl = &sgl[offset]; + err = prdev->report(prdev, sgl, leftover); + + /* flush any remaining pages out from the last report */ + spin_lock_irq(&zone->lock); + page_reporting_drain(prdev, sgl, leftover, !err); + spin_unlock_irq(&zone->lock); + } + + return err; +} + +static void page_reporting_process(struct work_struct *work) +{ + struct delayed_work *d_work = to_delayed_work(work); + struct page_reporting_dev_info *prdev = + container_of(d_work, struct page_reporting_dev_info, work); + int err = 0, state = PAGE_REPORTING_ACTIVE; + struct scatterlist *sgl; + struct zone *zone; + + /* + * Change the state to "Active" so that we can track if there is + * anyone requests page reporting after we complete our pass. If + * the state is not altered by the end of the pass we will switch + * to idle and quit scheduling reporting runs. + */ + atomic_set(&prdev->state, state); + + /* allocate scatterlist to store pages being reported on */ + sgl = kmalloc_array(PAGE_REPORTING_CAPACITY, sizeof(*sgl), GFP_KERNEL); + if (!sgl) + goto err_out; + + sg_init_table(sgl, PAGE_REPORTING_CAPACITY); + + for_each_zone(zone) { + err = page_reporting_process_zone(prdev, sgl, zone); + if (err) + break; + } + + kfree(sgl); +err_out: + /* + * If the state has reverted back to requested then there may be + * additional pages to be processed. We will defer for 2s to allow + * more pages to accumulate. + */ + state = atomic_cmpxchg(&prdev->state, state, PAGE_REPORTING_IDLE); + if (state == PAGE_REPORTING_REQUESTED) + schedule_delayed_work(&prdev->work, PAGE_REPORTING_DELAY); +} + +static DEFINE_MUTEX(page_reporting_mutex); +DEFINE_STATIC_KEY_FALSE(page_reporting_enabled); + +int page_reporting_register(struct page_reporting_dev_info *prdev) +{ + int err = 0; + + mutex_lock(&page_reporting_mutex); + + /* nothing to do if already in use */ + if (rcu_access_pointer(pr_dev_info)) { + err = -EBUSY; + goto err_out; + } + + /* initialize state and work structures */ + atomic_set(&prdev->state, PAGE_REPORTING_IDLE); + INIT_DELAYED_WORK(&prdev->work, &page_reporting_process); + + /* Begin initial flush of zones */ + __page_reporting_request(prdev); + + /* Assign device to allow notifications */ + rcu_assign_pointer(pr_dev_info, prdev); + + /* enable page reporting notification */ + if (!static_key_enabled(&page_reporting_enabled)) { + static_branch_enable(&page_reporting_enabled); + pr_info("Free page reporting enabled\n"); + } +err_out: + mutex_unlock(&page_reporting_mutex); + + return err; +} +EXPORT_SYMBOL_GPL(page_reporting_register); + +void page_reporting_unregister(struct page_reporting_dev_info *prdev) +{ + mutex_lock(&page_reporting_mutex); + + if (rcu_access_pointer(pr_dev_info) == prdev) { + /* Disable page reporting notification */ + RCU_INIT_POINTER(pr_dev_info, NULL); + synchronize_rcu(); + + /* Flush any existing work, and lock it out */ + cancel_delayed_work_sync(&prdev->work); + } + + mutex_unlock(&page_reporting_mutex); +} +EXPORT_SYMBOL_GPL(page_reporting_unregister); diff --git a/mm/page_reporting.h b/mm/page_reporting.h new file mode 100644 index 000000000000..aa6d37f4dc22 --- /dev/null +++ b/mm/page_reporting.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _MM_PAGE_REPORTING_H +#define _MM_PAGE_REPORTING_H + +#include +#include +#include +#include +#include +#include +#include + +#define PAGE_REPORTING_MIN_ORDER pageblock_order + +#ifdef CONFIG_PAGE_REPORTING +DECLARE_STATIC_KEY_FALSE(page_reporting_enabled); +void __page_reporting_notify(void); + +static inline bool page_reported(struct page *page) +{ + return static_branch_unlikely(&page_reporting_enabled) && + PageReported(page); +} + +/** + * page_reporting_notify_free - Free page notification to start page processing + * + * This function is meant to act as a screener for __page_reporting_notify + * which will determine if a give zone has crossed over the high-water mark + * that will justify us beginning page treatment. If we have crossed that + * threshold then it will start the process of pulling some pages and + * placing them in the batch list for treatment. + */ +static inline void page_reporting_notify_free(unsigned int order) +{ + /* Called from hot path in __free_one_page() */ + if (!static_branch_unlikely(&page_reporting_enabled)) + return; + + /* Determine if we have crossed reporting threshold */ + if (order < PAGE_REPORTING_MIN_ORDER) + return; + + /* This will add a few cycles, but should be called infrequently */ + __page_reporting_notify(); +} +#else /* CONFIG_PAGE_REPORTING */ +#define page_reported(_page) false + +static inline void page_reporting_notify_free(unsigned int order) +{ +} +#endif /* CONFIG_PAGE_REPORTING */ +#endif /*_MM_PAGE_REPORTING_H */ From d74b78fabe043158e26196291d2e439b487395bd Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:05:01 -0700 Subject: [PATCH 027/158] virtio-balloon: pull page poisoning config out of free page hinting Currently the page poisoning setting wasn't being enabled unless free page hinting was enabled. However we will need the page poisoning tracking logic as well for free page reporting. As such pull it out and make it a separate bit of config in the probe function. In addition we need to add support for the more recent init_on_free feature which expects a behavior similar to page poisoning in that we expect the page to be pre-zeroed. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michael S. Tsirkin Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224646.29318.695.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- drivers/virtio/virtio_balloon.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 341458fd95ca..e41a62d84260 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -864,7 +864,6 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb) static int virtballoon_probe(struct virtio_device *vdev) { struct virtio_balloon *vb; - __u32 poison_val; int err; if (!vdev->config->get) { @@ -930,11 +929,20 @@ static int virtballoon_probe(struct virtio_device *vdev) VIRTIO_BALLOON_CMD_ID_STOP); spin_lock_init(&vb->free_page_list_lock); INIT_LIST_HEAD(&vb->free_page_list); - if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) { + } + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) { + /* Start with poison val of 0 representing general init */ + __u32 poison_val = 0; + + /* + * Let the hypervisor know that we are expecting a + * specific value to be written back in balloon pages. + */ + if (!want_init_on_free()) memset(&poison_val, PAGE_POISON, sizeof(poison_val)); - virtio_cwrite(vb->vdev, struct virtio_balloon_config, - poison_val, &poison_val); - } + + virtio_cwrite(vb->vdev, struct virtio_balloon_config, + poison_val, &poison_val); } /* * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a @@ -1045,7 +1053,10 @@ static int virtballoon_restore(struct virtio_device *vdev) static int virtballoon_validate(struct virtio_device *vdev) { - if (!page_poisoning_enabled()) + /* Tell the host whether we care about poisoned pages. */ + if (!want_init_on_free() && + (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) || + !page_poisoning_enabled())) __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON); __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM); From b0c504f154718904ae49349147e3b7e6ae91ffdc Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:05:05 -0700 Subject: [PATCH 028/158] virtio-balloon: add support for providing free page reports to host Add support for the page reporting feature provided by virtio-balloon. Reporting differs from the regular balloon functionality in that is is much less durable than a standard memory balloon. Instead of creating a list of pages that cannot be accessed the pages are only inaccessible while they are being indicated to the virtio interface. Once the interface has acknowledged them they are placed back into their respective free lists and are once again accessible by the guest system. Unlike a standard balloon we don't inflate and deflate the pages. Instead we perform the reporting, and once the reporting is completed it is assumed that the page has been dropped from the guest and will be faulted back in the next time the page is accessed. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michael S. Tsirkin Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- drivers/virtio/Kconfig | 1 + drivers/virtio/virtio_balloon.c | 64 +++++++++++++++++++++++++++++ include/uapi/linux/virtio_balloon.h | 1 + 3 files changed, 66 insertions(+) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 078615cf2afc..4b2dd8259ff5 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -58,6 +58,7 @@ config VIRTIO_BALLOON tristate "Virtio balloon driver" depends on VIRTIO select MEMORY_BALLOON + select PAGE_REPORTING ---help--- This driver supports increasing and decreasing the amount of memory within a KVM guest. diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index e41a62d84260..9e0177529bad 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -19,6 +19,7 @@ #include #include #include +#include /* * Balloon device works in 4K page units. So each page is pointed to by @@ -47,6 +48,7 @@ enum virtio_balloon_vq { VIRTIO_BALLOON_VQ_DEFLATE, VIRTIO_BALLOON_VQ_STATS, VIRTIO_BALLOON_VQ_FREE_PAGE, + VIRTIO_BALLOON_VQ_REPORTING, VIRTIO_BALLOON_VQ_MAX }; @@ -114,6 +116,10 @@ struct virtio_balloon { /* To register a shrinker to shrink memory upon memory pressure */ struct shrinker shrinker; + + /* Free page reporting device */ + struct virtqueue *reporting_vq; + struct page_reporting_dev_info pr_dev_info; }; static struct virtio_device_id id_table[] = { @@ -153,6 +159,33 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq) } +int virtballoon_free_page_report(struct page_reporting_dev_info *pr_dev_info, + struct scatterlist *sg, unsigned int nents) +{ + struct virtio_balloon *vb = + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info); + struct virtqueue *vq = vb->reporting_vq; + unsigned int unused, err; + + /* We should always be able to add these buffers to an empty queue. */ + err = virtqueue_add_inbuf(vq, sg, nents, vb, GFP_NOWAIT | __GFP_NOWARN); + + /* + * In the extremely unlikely case that something has occurred and we + * are able to trigger an error we will simply display a warning + * and exit without actually processing the pages. + */ + if (WARN_ON_ONCE(err)) + return err; + + virtqueue_kick(vq); + + /* When host has read buffer, this completes via balloon_ack */ + wait_event(vb->acked, virtqueue_get_buf(vq, &unused)); + + return 0; +} + static void set_page_pfns(struct virtio_balloon *vb, __virtio32 pfns[], struct page *page) { @@ -481,6 +514,7 @@ static int init_vqs(struct virtio_balloon *vb) names[VIRTIO_BALLOON_VQ_STATS] = NULL; callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL; names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL; + names[VIRTIO_BALLOON_VQ_REPORTING] = NULL; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) { names[VIRTIO_BALLOON_VQ_STATS] = "stats"; @@ -492,6 +526,11 @@ static int init_vqs(struct virtio_balloon *vb) callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL; } + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) { + names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq"; + callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack; + } + err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX, vqs, callbacks, names, NULL, NULL); if (err) @@ -524,6 +563,9 @@ static int init_vqs(struct virtio_balloon *vb) if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE]; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) + vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING]; + return 0; } @@ -953,12 +995,31 @@ static int virtballoon_probe(struct virtio_device *vdev) if (err) goto out_del_balloon_wq; } + + vb->pr_dev_info.report = virtballoon_free_page_report; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) { + unsigned int capacity; + + capacity = virtqueue_get_vring_size(vb->reporting_vq); + if (capacity < PAGE_REPORTING_CAPACITY) { + err = -ENOSPC; + goto out_unregister_shrinker; + } + + err = page_reporting_register(&vb->pr_dev_info); + if (err) + goto out_unregister_shrinker; + } + virtio_device_ready(vdev); if (towards_target(vb)) virtballoon_changed(vdev); return 0; +out_unregister_shrinker: + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) + virtio_balloon_unregister_shrinker(vb); out_del_balloon_wq: if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) destroy_workqueue(vb->balloon_wq); @@ -997,6 +1058,8 @@ static void virtballoon_remove(struct virtio_device *vdev) { struct virtio_balloon *vb = vdev->priv; + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) + page_reporting_unregister(&vb->pr_dev_info); if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) virtio_balloon_unregister_shrinker(vb); spin_lock_irq(&vb->stop_update_lock); @@ -1069,6 +1132,7 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_DEFLATE_ON_OOM, VIRTIO_BALLOON_F_FREE_PAGE_HINT, VIRTIO_BALLOON_F_PAGE_POISON, + VIRTIO_BALLOON_F_REPORTING, }; static struct virtio_driver virtio_balloon_driver = { diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index a1966cd7b677..19974392d324 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -36,6 +36,7 @@ #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */ #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */ #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */ +#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */ /* Size of a PFN in the balloon interface. */ #define VIRTIO_BALLOON_PFN_SHIFT 12 From 02cf8719b8cb2b474b37bcbeee4706950b3d1d3d Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:05:10 -0700 Subject: [PATCH 029/158] mm/page_reporting: rotate reported pages to the tail of the list Rather than walking over the same pages again and again to get to the pages that have yet to be reported we can save ourselves a significant amount of time by simply rotating the list so that when we have a full list of reported pages the head of the list is pointing to the next non-reported page. Doing this should save us some significant time when processing each free list. This doesn't gain us much in the standard case as all of the non-reported pages should be near the top of the list already. However in the case of page shuffling this results in a noticeable improvement. Below are the will-it-scale page_fault1 w/ THP numbers for 16 tasks with and without this patch. Without: tasks processes processes_idle threads threads_idle 16 8093776.25 0.17 5393242.00 38.20 With: tasks processes processes_idle threads threads_idle 16 8283274.75 0.17 5594261.00 38.15 Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224708.29318.16862.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- mm/page_reporting.c | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/mm/page_reporting.c b/mm/page_reporting.c index 1047c6872d4f..6885e74c2367 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -131,17 +131,27 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, if (PageReported(page)) continue; - /* Attempt to pull page from list */ - if (!__isolate_free_page(page, order)) - break; + /* Attempt to pull page from list and place in scatterlist */ + if (*offset) { + if (!__isolate_free_page(page, order)) { + next = page; + break; + } - /* Add page to scatter list */ - --(*offset); - sg_set_page(&sgl[*offset], page, page_len, 0); + /* Add page to scatter list */ + --(*offset); + sg_set_page(&sgl[*offset], page, page_len, 0); - /* If scatterlist isn't full grab more pages */ - if (*offset) continue; + } + + /* + * Make the first non-processed page in the free list + * the new head of the free list before we release the + * zone lock. + */ + if (&page->lru != list && !list_is_first(&page->lru, list)) + list_rotate_to_front(&page->lru, list); /* release lock before waiting on report processing */ spin_unlock_irq(&zone->lock); @@ -169,6 +179,10 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, break; } + /* Rotate any leftover pages to the head of the freelist */ + if (&next->lru != list && !list_is_first(&next->lru, list)) + list_rotate_to_front(&next->lru, list); + spin_unlock_irq(&zone->lock); return err; From 43b76f298f023d32273c2b9c25dd83ae02711019 Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:05:14 -0700 Subject: [PATCH 030/158] mm/page_reporting: add budget limit on how many pages can be reported per pass In order to keep ourselves from reporting pages that are just going to be reused again in the case of heavy churn we can put a limit on how many total pages we will process per pass. Doing this will allow the worker thread to go into idle much more quickly so that we avoid competing with other threads that might be allocating or freeing pages. The logic added here will limit the worker thread to no more than one sixteenth of the total free pages in a given area per list. Once that limit is reached it will update the state so that at the end of the pass we will reschedule the worker to try again in 2 seconds when the memory churn has hopefully settled down. Again this optimization doesn't show much of a benefit in the standard case as the memory churn is minmal. However with page allocator shuffling enabled the gain is quite noticeable. Below are the results with a THP enabled version of the will-it-scale page_fault1 test showing the improvement in iterations for 16 processes or threads. Without: tasks processes processes_idle threads threads_idle 16 8283274.75 0.17 5594261.00 38.15 With: tasks processes processes_idle threads threads_idle 16 8767010.50 0.21 5791312.75 36.98 Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Acked-by: Mel Gorman Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- include/linux/page_reporting.h | 1 + mm/page_reporting.c | 33 ++++++++++++++++++++++++++++++++- 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h index 32355486f572..3b99e0ec24f2 100644 --- a/include/linux/page_reporting.h +++ b/include/linux/page_reporting.h @@ -5,6 +5,7 @@ #include #include +/* This value should always be a power of 2, see page_reporting_cycle() */ #define PAGE_REPORTING_CAPACITY 32 struct page_reporting_dev_info { diff --git a/mm/page_reporting.c b/mm/page_reporting.c index 6885e74c2367..3bbd471cfc81 100644 --- a/mm/page_reporting.c +++ b/mm/page_reporting.c @@ -114,6 +114,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, struct list_head *list = &area->free_list[mt]; unsigned int page_len = PAGE_SIZE << order; struct page *page, *next; + long budget; int err = 0; /* @@ -125,12 +126,39 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, spin_lock_irq(&zone->lock); + /* + * Limit how many calls we will be making to the page reporting + * device for this list. By doing this we avoid processing any + * given list for too long. + * + * The current value used allows us enough calls to process over a + * sixteenth of the current list plus one additional call to handle + * any pages that may have already been present from the previous + * list processed. This should result in us reporting all pages on + * an idle system in about 30 seconds. + * + * The division here should be cheap since PAGE_REPORTING_CAPACITY + * should always be a power of 2. + */ + budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16); + /* loop through free list adding unreported pages to sg list */ list_for_each_entry_safe(page, next, list, lru) { /* We are going to skip over the reported pages. */ if (PageReported(page)) continue; + /* + * If we fully consumed our budget then update our + * state to indicate that we are requesting additional + * processing and exit this list. + */ + if (budget < 0) { + atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED); + next = page; + break; + } + /* Attempt to pull page from list and place in scatterlist */ if (*offset) { if (!__isolate_free_page(page, order)) { @@ -146,7 +174,7 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, } /* - * Make the first non-processed page in the free list + * Make the first non-reported page in the free list * the new head of the free list before we release the * zone lock. */ @@ -162,6 +190,9 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone, /* reset offset since the full list was reported */ *offset = PAGE_REPORTING_CAPACITY; + /* update budget to reflect call to report function */ + budget--; + /* reacquire zone lock and resume processing */ spin_lock_irq(&zone->lock); From 1edca85e768a0819c1bed6dc0adcfa766e28e2cd Mon Sep 17 00:00:00 2001 From: Alexander Duyck Date: Mon, 6 Apr 2020 20:05:18 -0700 Subject: [PATCH 031/158] mm/page_reporting: add free page reporting documentation Add documentation for free page reporting. Currently the only consumer is virtio-balloon, however it is possible that other drivers might make use of this so it is best to add a bit of documetation explaining at a high level how to use the API. Signed-off-by: Alexander Duyck Signed-off-by: Andrew Morton Cc: Andrea Arcangeli Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Konrad Rzeszutek Wilk Cc: Luiz Capitulino Cc: Matthew Wilcox Cc: Mel Gorman Cc: Michael S. Tsirkin Cc: Michal Hocko Cc: Nitesh Narayan Lal Cc: Oscar Salvador Cc: Pankaj Gupta Cc: Paolo Bonzini Cc: Rik van Riel Cc: Vlastimil Babka Cc: Wei Wang Cc: Yang Zhang Cc: wei qi Link: http://lkml.kernel.org/r/20200211224730.29318.43815.stgit@localhost.localdomain Signed-off-by: Linus Torvalds --- Documentation/vm/free_page_reporting.rst | 40 ++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 Documentation/vm/free_page_reporting.rst diff --git a/Documentation/vm/free_page_reporting.rst b/Documentation/vm/free_page_reporting.rst new file mode 100644 index 000000000000..8c05e62d8b2b --- /dev/null +++ b/Documentation/vm/free_page_reporting.rst @@ -0,0 +1,40 @@ +.. _free_page_reporting: + +===================== +Free Page Reporting +===================== + +Free page reporting is an API by which a device can register to receive +lists of pages that are currently unused by the system. This is useful in +the case of virtualization where a guest is then able to use this data to +notify the hypervisor that it is no longer using certain pages in memory. + +For the driver, typically a balloon driver, to use of this functionality +it will allocate and initialize a page_reporting_dev_info structure. The +field within the structure it will populate is the "report" function +pointer used to process the scatterlist. It must also guarantee that it can +handle at least PAGE_REPORTING_CAPACITY worth of scatterlist entries per +call to the function. A call to page_reporting_register will register the +page reporting interface with the reporting framework assuming no other +page reporting devices are already registered. + +Once registered the page reporting API will begin reporting batches of +pages to the driver. The API will start reporting pages 2 seconds after +the interface is registered and will continue to do so 2 seconds after any +page of a sufficiently high order is freed. + +Pages reported will be stored in the scatterlist passed to the reporting +function with the final entry having the end bit set in entry nent - 1. +While pages are being processed by the report function they will not be +accessible to the allocator. Once the report function has been completed +the pages will be returned to the free area from which they were obtained. + +Prior to removing a driver that is making use of free page reporting it +is necessary to call page_reporting_unregister to have the +page_reporting_dev_info structure that is currently in use by free page +reporting removed. Doing this will prevent further reports from being +issued via the interface. If another driver or the same driver is +registered it is possible for it to resume where the previous driver had +left off in terms of reporting free pages. + +Alexander Duyck, Dec 04, 2019 From da10329cb057e49be7f099eb7864d16f7705f798 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:05:22 -0700 Subject: [PATCH 032/158] virtio-balloon: switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM Commit 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker") changed the behavior when deflation happens automatically. Instead of deflating when called by the OOM handler, the shrinker is used. However, the balloon is not simply some other slab cache that should be shrunk when under memory pressure. The shrinker does not have a concept of priorities yet, so this behavior cannot be configured. Eventually once that is in place, we might want to switch back after doing proper testing. There was a report that this results in undesired side effects when inflating the balloon to shrink the page cache. [1] "When inflating the balloon against page cache (i.e. no free memory remains) vmscan.c will both shrink page cache, but also invoke the shrinkers -- including the balloon's shrinker. So the balloon driver allocates memory which requires reclaim, vmscan gets this memory by shrinking the balloon, and then the driver adds the memory back to the balloon. Basically a busy no-op." The name "deflate on OOM" makes it pretty clear when deflation should happen - after other approaches to reclaim memory failed, not while reclaiming. This allows to minimize the footprint of a guest - memory will only be taken out of the balloon when really needed. Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because this has no such side effects. Always register the shrinker with VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free pages that are still to be processed by the guest. The hypervisor takes care of identifying and resolving possible races between processing a hinting request and the guest reusing a page. In contrast to pre commit 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker"), don't add a module parameter to configure the number of pages to deflate on OOM. Can be re-added if really needed. Also, pay attention that leak_balloon() returns the number of 4k pages - convert it properly in virtio_balloon_oom_notify(). Testing done by Tyler for future reference: Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42 GB file full of random bytes that we continually cat to /dev/null. This fills the page cache as the file is read. Meanwhile, we trigger the balloon to inflate, with a target size of 53 GB. This setup causes the balloon inflation to pressure the page cache as the page cache is also trying to grow. Afterwards we shrink the balloon back to zero (so total deflate == total inflate). Without this patch (kernel 4.19.0-5): Inflation never reaches the target until we stop the "cat file > /dev/null" process. Total inflation time was 542 seconds. The longest period that made no net forward progress was 315 seconds. Result of "grep balloon /proc/vmstat" after the test: balloon_inflate 154828377 balloon_deflate 154828377 With this patch (kernel 5.6.0-rc4+): Total inflation duration was 63 seconds. No deflate-queue activity occurs when pressuring the page-cache. Result of "grep balloon /proc/vmstat" after the test: balloon_inflate 12968539 balloon_deflate 12968539 Conclusion: This patch fixes the issue. In the test it reduced inflate/deflate activity by 12x, and reduced inflation time by 8.6x. But more importantly, if we hadn't killed the "cat file > /dev/null" process then, without the patch, the inflation process would never reach the target. [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html Link: http://lkml.kernel.org/r/20200311135523.18512-2-david@redhat.com Fixes: 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker") Signed-off-by: David Hildenbrand Reported-by: Tyler Sanderson Tested-by: Tyler Sanderson Acked-by: David Rientjes Acked-by: Michael S. Tsirkin Cc: Wei Wang Cc: Alexander Duyck Cc: David Rientjes Cc: Nadav Amit Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/virtio/virtio_balloon.c | 103 +++++++++++++++----------------- 1 file changed, 47 insertions(+), 56 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 9e0177529bad..0ef16566c3f3 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -28,7 +29,9 @@ */ #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT) #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256 -#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80 +/* Maximum number of (4k) pages to deflate on OOM notifications. */ +#define VIRTIO_BALLOON_OOM_NR_PAGES 256 +#define VIRTIO_BALLOON_OOM_NOTIFY_PRIORITY 80 #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \ __GFP_NOMEMALLOC) @@ -114,9 +117,12 @@ struct virtio_balloon { /* Memory statistics */ struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR]; - /* To register a shrinker to shrink memory upon memory pressure */ + /* Shrinker to return free pages - VIRTIO_BALLOON_F_FREE_PAGE_HINT */ struct shrinker shrinker; + /* OOM notifier to deflate on OOM - VIRTIO_BALLOON_F_DEFLATE_ON_OOM */ + struct notifier_block oom_nb; + /* Free page reporting device */ struct virtqueue *reporting_vq; struct page_reporting_dev_info pr_dev_info; @@ -830,50 +836,13 @@ static unsigned long shrink_free_pages(struct virtio_balloon *vb, return blocks_freed * VIRTIO_BALLOON_HINT_BLOCK_PAGES; } -static unsigned long leak_balloon_pages(struct virtio_balloon *vb, - unsigned long pages_to_free) -{ - return leak_balloon(vb, pages_to_free * VIRTIO_BALLOON_PAGES_PER_PAGE) / - VIRTIO_BALLOON_PAGES_PER_PAGE; -} - -static unsigned long shrink_balloon_pages(struct virtio_balloon *vb, - unsigned long pages_to_free) -{ - unsigned long pages_freed = 0; - - /* - * One invocation of leak_balloon can deflate at most - * VIRTIO_BALLOON_ARRAY_PFNS_MAX balloon pages, so we call it - * multiple times to deflate pages till reaching pages_to_free. - */ - while (vb->num_pages && pages_freed < pages_to_free) - pages_freed += leak_balloon_pages(vb, - pages_to_free - pages_freed); - - update_balloon_size(vb); - - return pages_freed; -} - static unsigned long virtio_balloon_shrinker_scan(struct shrinker *shrinker, struct shrink_control *sc) { - unsigned long pages_to_free, pages_freed = 0; struct virtio_balloon *vb = container_of(shrinker, struct virtio_balloon, shrinker); - pages_to_free = sc->nr_to_scan; - - if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) - pages_freed = shrink_free_pages(vb, pages_to_free); - - if (pages_freed >= pages_to_free) - return pages_freed; - - pages_freed += shrink_balloon_pages(vb, pages_to_free - pages_freed); - - return pages_freed; + return shrink_free_pages(vb, sc->nr_to_scan); } static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker, @@ -881,12 +850,22 @@ static unsigned long virtio_balloon_shrinker_count(struct shrinker *shrinker, { struct virtio_balloon *vb = container_of(shrinker, struct virtio_balloon, shrinker); - unsigned long count; - count = vb->num_pages / VIRTIO_BALLOON_PAGES_PER_PAGE; - count += vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES; + return vb->num_free_page_blocks * VIRTIO_BALLOON_HINT_BLOCK_PAGES; +} - return count; +static int virtio_balloon_oom_notify(struct notifier_block *nb, + unsigned long dummy, void *parm) +{ + struct virtio_balloon *vb = container_of(nb, + struct virtio_balloon, oom_nb); + unsigned long *freed = parm; + + *freed += leak_balloon(vb, VIRTIO_BALLOON_OOM_NR_PAGES) / + VIRTIO_BALLOON_PAGES_PER_PAGE; + update_balloon_size(vb); + + return NOTIFY_OK; } static void virtio_balloon_unregister_shrinker(struct virtio_balloon *vb) @@ -971,7 +950,23 @@ static int virtballoon_probe(struct virtio_device *vdev) VIRTIO_BALLOON_CMD_ID_STOP); spin_lock_init(&vb->free_page_list_lock); INIT_LIST_HEAD(&vb->free_page_list); + /* + * We're allowed to reuse any free pages, even if they are + * still to be processed by the host. + */ + err = virtio_balloon_register_shrinker(vb); + if (err) + goto out_del_balloon_wq; } + + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) { + vb->oom_nb.notifier_call = virtio_balloon_oom_notify; + vb->oom_nb.priority = VIRTIO_BALLOON_OOM_NOTIFY_PRIORITY; + err = register_oom_notifier(&vb->oom_nb); + if (err < 0) + goto out_unregister_shrinker; + } + if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) { /* Start with poison val of 0 representing general init */ __u32 poison_val = 0; @@ -986,15 +981,6 @@ static int virtballoon_probe(struct virtio_device *vdev) virtio_cwrite(vb->vdev, struct virtio_balloon_config, poison_val, &poison_val); } - /* - * We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a - * shrinker needs to be registered to relieve memory pressure. - */ - if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) { - err = virtio_balloon_register_shrinker(vb); - if (err) - goto out_del_balloon_wq; - } vb->pr_dev_info.report = virtballoon_free_page_report; if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) { @@ -1003,12 +989,12 @@ static int virtballoon_probe(struct virtio_device *vdev) capacity = virtqueue_get_vring_size(vb->reporting_vq); if (capacity < PAGE_REPORTING_CAPACITY) { err = -ENOSPC; - goto out_unregister_shrinker; + goto out_unregister_oom; } err = page_reporting_register(&vb->pr_dev_info); if (err) - goto out_unregister_shrinker; + goto out_unregister_oom; } virtio_device_ready(vdev); @@ -1017,8 +1003,11 @@ static int virtballoon_probe(struct virtio_device *vdev) virtballoon_changed(vdev); return 0; -out_unregister_shrinker: +out_unregister_oom: if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) + unregister_oom_notifier(&vb->oom_nb); +out_unregister_shrinker: + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) virtio_balloon_unregister_shrinker(vb); out_del_balloon_wq: if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) @@ -1061,6 +1050,8 @@ static void virtballoon_remove(struct virtio_device *vdev) if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) page_reporting_unregister(&vb->pr_dev_info); if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) + unregister_oom_notifier(&vb->oom_nb); + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) virtio_balloon_unregister_shrinker(vb); spin_lock_irq(&vb->stop_update_lock); vb->stop_update = true; From 1df319e0b4dee11436fe2ab1a0d536d3fad7cfef Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Mon, 6 Apr 2020 20:05:25 -0700 Subject: [PATCH 033/158] userfaultfd: wp: add helper for writeprotect check Patch series "userfaultfd: write protection support", v6. Overview ======== The uffd-wp work was initialized by Shaohua Li [1], and later continued by Andrea [2]. This series is based upon Andrea's latest userfaultfd tree, and it is a continuous works from both Shaohua and Andrea. Many of the follow up ideas come from Andrea too. Besides the old MISSING register mode of userfaultfd, the new uffd-wp support provides another alternative register mode called UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing page faults but also write protection page faults, or even they can be registered together. At the same time, the new feature also provides a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the userspace to write protect a range or memory or fixup write permission of faulted pages. Please refer to the document patch "userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update" for more information on the new interface and what it can do. The major workflow of an uffd-wp program should be: 1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP 2. Write protect part of the whole registered region using UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to show that we want to write protect the range. 3. Start a working thread that modifies the protected pages, meanwhile listening to UFFD messages. 4. When a write is detected upon the protected range, page fault happens, a UFFD message will be generated and reported to the page fault handling thread 5. The page fault handler thread resolves the page fault using the new UFFDIO_WRITEPROTECT ioctl, but this time passing in !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to recover the write permission. Before this operation, the fault handler thread can do anything it wants, e.g., dumps the page to a persistent storage. 6. The worker thread will continue running with the correctly applied write permission from step 5. Currently there are already two projects that are based on this new userfaultfd feature. QEMU Live Snapshot: The project provides a way to allow the QEMU hypervisor to take snapshot of VMs without stopping the VM [3]. LLNL umap library: The project provides a mmap-like interface and "allow to have an application specific buffer of pages cached from a large file, i.e. out-of-core execution using memory map" [4][5]. Before posting the patchset, this series was smoke tested against QEMU live snapshot and the LLNL umap library (by doing parallel quicksort using 128 sorting threads + 80 uffd servicing threads). My sincere thanks to Marty Mcfadden and Denis Plotnikov for the help along the way. TODO ==== - hugetlbfs/shmem support - performance - more architectures - cooperate with mprotect()-allowed processes (???) - ... References ========== [1] https://lwn.net/Articles/666187/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm [4] https://github.com/LLNL/umap [5] https://llnl-umap.readthedocs.io/en/develop/ [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5 [7] https://lkml.org/lkml/2018/11/21/370 [8] https://lkml.org/lkml/2018/12/30/64 This patch (of 19): Add helper for writeprotect check. Will use it later. Signed-off-by: Shaohua Li Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Mel Gorman Cc: Hugh Dickins Cc: Johannes Weiner Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mike Kravetz Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/20200220163112.11409-2-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/userfaultfd_k.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ac9d71e24b81..5dc247af0f2e 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -52,6 +52,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma) return vma->vm_flags & VM_UFFD_MISSING; } +static inline bool userfaultfd_wp(struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_UFFD_WP; +} + static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP); @@ -96,6 +101,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct *vma) return false; } +static inline bool userfaultfd_wp(struct vm_area_struct *vma) +{ + return false; +} + static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return false; From 529b930b87d997c3d231c9d8638a0bf8db569d70 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 6 Apr 2020 20:05:29 -0700 Subject: [PATCH 034/158] userfaultfd: wp: hook userfault handler to write protection fault There are several cases write protection fault happens. It could be a write to zero page, swaped page or userfault write protected page. When the fault happens, there is no way to know if userfault write protect the page before. Here we just blindly issue a userfault notification for vma with VM_UFFD_WP regardless if app write protects it yet. Application should be ready to handle such wp fault. In the swapin case, always swapin as readonly. This will cause false positive userfaults. We need to decide later if to eliminate them with a flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY). hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be handled by a swap entry bit like anonymous memory. The main problem with no easy solution to eliminate the false positives, will be if/when userfaultfd is extended to real filesystem pagecache. When the pagecache is freed by reclaim we can't leave the radix tree pinned if the inode and in turn the radix tree is reclaimed as well. The estimation is that full accuracy and lack of false positives could be easily provided only to anonymous memory (as long as there's no fork or as long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs and hugetlbfs, it's most certainly worth to achieve it but in a later incremental patch. [peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page] Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Mike Rapoport Reviewed-by: Jerome Glisse Cc: Shaohua Li Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.com Signed-off-by: Linus Torvalds --- mm/memory.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index d527e0ec29c7..46aa79600ed8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2752,6 +2752,11 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; + if (userfaultfd_wp(vma)) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return handle_userfault(vmf, VM_UFFD_WP); + } + vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); if (!vmf->page) { /* @@ -3948,8 +3953,11 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) /* `inline' is required to avoid gcc 4.1.2 build error */ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) { - if (vma_is_anonymous(vmf->vma)) + if (vma_is_anonymous(vmf->vma)) { + if (userfaultfd_wp(vmf->vma)) + return handle_userfault(vmf, VM_UFFD_WP); return do_huge_pmd_wp_page(vmf, orig_pmd); + } if (vmf->vma->vm_ops->huge_fault) { vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); From 5a281062af1d43d3f3956a6b429c2d727bc92603 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 6 Apr 2020 20:05:33 -0700 Subject: [PATCH 035/158] userfaultfd: wp: add WP pagetable tracking to x86 Accurate userfaultfd WP tracking is possible by tracking exactly which virtual memory ranges were writeprotected by userland. We can't relay only on the RW bit of the mapped pagetable because that information is destroyed by fork() or KSM or swap. If we were to relay on that, we'd need to stay on the safe side and generate false positive wp faults for every swapped out page. [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK] Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 52 ++++++++++++++++++++++++++++ arch/x86/include/asm/pgtable_64.h | 8 ++++- arch/x86/include/asm/pgtable_types.h | 12 ++++++- include/asm-generic/pgtable.h | 1 + include/asm-generic/pgtable_uffd.h | 51 +++++++++++++++++++++++++++ init/Kconfig | 5 +++ 7 files changed, 128 insertions(+), 2 deletions(-) create mode 100644 include/asm-generic/pgtable_uffd.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1edf788d301c..8d078642b4be 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -149,6 +149,7 @@ config X86 select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 + select HAVE_ARCH_USERFAULTFD_WP if USERFAULTFD select HAVE_ARCH_VMAP_STACK if X86_64 select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_ASM_MODVERSIONS diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index afda66a6d325..c37e1649fb7e 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -25,6 +25,7 @@ #include #include #include +#include extern pgd_t early_top_pgt[PTRS_PER_PGD]; int __init __early_make_pgtable(unsigned long address, pmdval_t pmd); @@ -313,6 +314,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) return native_make_pte(v & ~clear); } +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static inline int pte_uffd_wp(pte_t pte) +{ + return pte_flags(pte) & _PAGE_UFFD_WP; +} + +static inline pte_t pte_mkuffd_wp(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_UFFD_WP); +} + +static inline pte_t pte_clear_uffd_wp(pte_t pte) +{ + return pte_clear_flags(pte, _PAGE_UFFD_WP); +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + static inline pte_t pte_mkclean(pte_t pte) { return pte_clear_flags(pte, _PAGE_DIRTY); @@ -392,6 +410,23 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) return native_make_pmd(v & ~clear); } +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static inline int pmd_uffd_wp(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_UFFD_WP; +} + +static inline pmd_t pmd_mkuffd_wp(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_UFFD_WP); +} + +static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_UFFD_WP); +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + static inline pmd_t pmd_mkold(pmd_t pmd) { return pmd_clear_flags(pmd, _PAGE_ACCESSED); @@ -1374,6 +1409,23 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t pmd) #endif #endif +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static inline pte_t pte_swp_mkuffd_wp(pte_t pte) +{ + return pte_set_flags(pte, _PAGE_SWP_UFFD_WP); +} + +static inline int pte_swp_uffd_wp(pte_t pte) +{ + return pte_flags(pte) & _PAGE_SWP_UFFD_WP; +} + +static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +{ + return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP); +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + #define PKRU_AD_BIT 0x1 #define PKRU_WD_BIT 0x2 #define PKRU_BITS_PER_PKEY 2 diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 0b6c4042942a..df1373415f11 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -189,7 +189,7 @@ extern void sync_global_pgds(unsigned long start, unsigned long end); * * | ... | 11| 10| 9|8|7|6|5| 4| 3|2| 1|0| <- bit number * | ... |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names - * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|X|SD|0| <- swp entry + * | TYPE (59-63) | ~OFFSET (9-58) |0|0|X|X| X| X|F|SD|0| <- swp entry * * G (8) is aliased and used as a PROT_NONE indicator for * !present ptes. We need to start storing swap entries above @@ -197,9 +197,15 @@ extern void sync_global_pgds(unsigned long start, unsigned long end); * erratum where they can be incorrectly set by hardware on * non-present PTEs. * + * SD Bits 1-4 are not used in non-present format and available for + * special use described below: + * * SD (1) in swp entry is used to store soft dirty bit, which helps us * remember soft dirty over page migration * + * F (2) in swp entry is used to record when a pagetable is + * writeprotected by userfaultfd WP support. + * * Bit 7 in swp entry should be 0 because pmd_present checks not only P, * but also L and G. * diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 65c2ecd730c5..b6606fe6cfdf 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -32,6 +32,7 @@ #define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1 #define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1 +#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 @@ -100,6 +101,14 @@ #define _PAGE_SWP_SOFT_DIRTY (_AT(pteval_t, 0)) #endif +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP +#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) +#define _PAGE_SWP_UFFD_WP _PAGE_USER +#else +#define _PAGE_UFFD_WP (_AT(pteval_t, 0)) +#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0)) +#endif + #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) @@ -118,7 +127,8 @@ */ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC) + _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ + _PAGE_UFFD_WP) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) /* diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index e2e2bef07dd2..329b8c8ca703 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -10,6 +10,7 @@ #include #include #include +#include #if 5 - defined(__PAGETABLE_P4D_FOLDED) - defined(__PAGETABLE_PUD_FOLDED) - \ defined(__PAGETABLE_PMD_FOLDED) != CONFIG_PGTABLE_LEVELS diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h new file mode 100644 index 000000000000..643d1bf559c2 --- /dev/null +++ b/include/asm-generic/pgtable_uffd.h @@ -0,0 +1,51 @@ +#ifndef _ASM_GENERIC_PGTABLE_UFFD_H +#define _ASM_GENERIC_PGTABLE_UFFD_H + +#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP +static __always_inline int pte_uffd_wp(pte_t pte) +{ + return 0; +} + +static __always_inline int pmd_uffd_wp(pmd_t pmd) +{ + return 0; +} + +static __always_inline pte_t pte_mkuffd_wp(pte_t pte) +{ + return pte; +} + +static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd) +{ + return pmd; +} + +static __always_inline pte_t pte_clear_uffd_wp(pte_t pte) +{ + return pte; +} + +static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) +{ + return pmd; +} + +static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte) +{ + return pte; +} + +static __always_inline int pte_swp_uffd_wp(pte_t pte) +{ + return 0; +} + +static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +{ + return pte; +} +#endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ + +#endif /* _ASM_GENERIC_PGTABLE_UFFD_H */ diff --git a/init/Kconfig b/init/Kconfig index 1c12059e0f7e..2bc0e47689d8 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1556,6 +1556,11 @@ config ADVISE_SYSCALLS applications use these syscalls, you can disable this option to save space. +config HAVE_ARCH_USERFAULTFD_WP + bool + help + Arch has userfaultfd write protection support + config MEMBARRIER bool "Enable membarrier() system call" if EXPERT default y From 55adf4de30346e0ec6b988cb9b885a5dddc954af Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 6 Apr 2020 20:05:37 -0700 Subject: [PATCH 036/158] userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpers Implement helpers methods to invoke userfaultfd wp faults more selectively: not only when a wp fault triggers on a vma with vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable too. Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/userfaultfd_k.h | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 5dc247af0f2e..7b91b76aac58 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -14,6 +14,8 @@ #include /* linux/include/uapi/linux/userfaultfd.h */ #include +#include +#include /* * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining @@ -57,6 +59,18 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma) return vma->vm_flags & VM_UFFD_WP; } +static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, + pte_t pte) +{ + return userfaultfd_wp(vma) && pte_uffd_wp(pte); +} + +static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma, + pmd_t pmd) +{ + return userfaultfd_wp(vma) && pmd_uffd_wp(pmd); +} + static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP); @@ -106,6 +120,19 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma) return false; } +static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, + pte_t pte) +{ + return false; +} + +static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma, + pmd_t pmd) +{ + return false; +} + + static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return false; From 72981e0e7b609c741d7764cc920c8fec00920bd5 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 6 Apr 2020 20:05:41 -0700 Subject: [PATCH 037/158] userfaultfd: wp: add UFFDIO_COPY_MODE_WP This allows UFFDIO_COPY to map pages write-protected. [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and commit messages] Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 5 +++-- include/linux/userfaultfd_k.h | 2 +- include/uapi/linux/userfaultfd.h | 13 ++++++------ mm/userfaultfd.c | 36 ++++++++++++++++++++++---------- 4 files changed, 36 insertions(+), 20 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 703c1c3faa6e..c49bef505775 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1724,11 +1724,12 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx, ret = -EINVAL; if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src) goto out; - if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE) + if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP)) goto out; if (mmget_not_zero(ctx->mm)) { ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src, - uffdio_copy.len, &ctx->mmap_changing); + uffdio_copy.len, &ctx->mmap_changing, + uffdio_copy.mode); mmput(ctx->mm); } else { return -ESRCH; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 7b91b76aac58..dcd33172b728 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -36,7 +36,7 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, - bool *mmap_changing); + bool *mmap_changing, __u64 mode); extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 48f1a7c2f1f0..340f23bc251d 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -203,13 +203,14 @@ struct uffdio_copy { __u64 dst; __u64 src; __u64 len; - /* - * There will be a wrprotection flag later that allows to map - * pages wrprotected on the fly. And such a flag will be - * available if the wrprotection ioctl are implemented for the - * range according to the uffdio_register.ioctls. - */ #define UFFDIO_COPY_MODE_DONTWAKE ((__u64)1<<0) + /* + * UFFDIO_COPY_MODE_WP will map the page write protected on + * the fly. UFFDIO_COPY_MODE_WP is available only if the + * write protected ioctl is implemented for the range + * according to the uffdio_register.ioctls. + */ +#define UFFDIO_COPY_MODE_WP ((__u64)1<<1) __u64 mode; /* diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bd96855f3961..05dbbcafdcc0 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -53,7 +53,8 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, - struct page **pagep) + struct page **pagep, + bool wp_copy) { struct mem_cgroup *memcg; pte_t _dst_pte, *dst_pte; @@ -99,9 +100,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false)) goto out_release; - _dst_pte = mk_pte(page, dst_vma->vm_page_prot); - if (dst_vma->vm_flags & VM_WRITE) - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); + _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot)); + if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy) + _dst_pte = pte_mkwrite(_dst_pte); dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); if (dst_vma->vm_file) { @@ -415,7 +416,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, unsigned long dst_addr, unsigned long src_addr, struct page **page, - bool zeropage) + bool zeropage, + bool wp_copy) { ssize_t err; @@ -432,11 +434,13 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, if (!(dst_vma->vm_flags & VM_SHARED)) { if (!zeropage) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, src_addr, page); + dst_addr, src_addr, page, + wp_copy); else err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, dst_addr); } else { + VM_WARN_ON_ONCE(wp_copy); if (!zeropage) err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, @@ -454,7 +458,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, unsigned long src_start, unsigned long len, bool zeropage, - bool *mmap_changing) + bool *mmap_changing, + __u64 mode) { struct vm_area_struct *dst_vma; ssize_t err; @@ -462,6 +467,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, unsigned long src_addr, dst_addr; long copied; struct page *page; + bool wp_copy; /* * Sanitize the command parameters: @@ -507,6 +513,14 @@ retry: dst_vma->vm_flags & VM_SHARED)) goto out_unlock; + /* + * validate 'mode' now that we know the dst_vma: don't allow + * a wrprotect copy if the userfaultfd didn't register as WP. + */ + wp_copy = mode & UFFDIO_COPY_MODE_WP; + if (wp_copy && !(dst_vma->vm_flags & VM_UFFD_WP)) + goto out_unlock; + /* * If this is a HUGETLB vma, pass off to appropriate routine */ @@ -562,7 +576,7 @@ retry: BUG_ON(pmd_trans_huge(*dst_pmd)); err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, - src_addr, &page, zeropage); + src_addr, &page, zeropage, wp_copy); cond_resched(); if (unlikely(err == -ENOENT)) { @@ -609,14 +623,14 @@ out: ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, - bool *mmap_changing) + bool *mmap_changing, __u64 mode) { return __mcopy_atomic(dst_mm, dst_start, src_start, len, false, - mmap_changing); + mmap_changing, mode); } ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool *mmap_changing) { - return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing); + return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0); } From 58705444c45b3ca987b03bd9beb41bbbe41ae439 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:05:45 -0700 Subject: [PATCH 038/158] mm: merge parameters for change_protection() change_protection() was used by either the NUMA or mprotect() code, there's one parameter for each of the callers (dirty_accountable and prot_numa). Further, these parameters are passed along the calls: - change_protection_range() - change_p4d_range() - change_pud_range() - change_pmd_range() - ... Now we introduce a flag for change_protect() and all these helpers to replace these parameters. Then we can avoid passing multiple parameters multiple times along the way. More importantly, it'll greatly simplify the work if we want to introduce any new parameters to change_protection(). In the follow up patches, a new parameter for userfaultfd write protection will be introduced. No functional change at all. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 2 +- include/linux/mm.h | 14 +++++++++++++- mm/huge_memory.c | 3 ++- mm/mempolicy.c | 2 +- mm/mprotect.c | 29 ++++++++++++++++------------- 5 files changed, 33 insertions(+), 17 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f2df2247026a..cfbb0a87c5f0 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -46,7 +46,7 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot, - int prot_numa); + unsigned long cp_flags); vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn, pgprot_t pgprot, bool write); diff --git a/include/linux/mm.h b/include/linux/mm.h index be49e371e4b5..c7d87ff5027b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1771,9 +1771,21 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, bool need_rmap_locks); + +/* + * Flags used by change_protection(). For now we make it a bitmap so + * that we can pass in multiple flags just like parameters. However + * for now all the callers are only use one of the flags at the same + * time. + */ +/* Whether we should allow dirty bit accounting */ +#define MM_CP_DIRTY_ACCT (1UL << 0) +/* Whether this protection change is for NUMA hints */ +#define MM_CP_PROT_NUMA (1UL << 1) + extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot, - int dirty_accountable, int prot_numa); + unsigned long cp_flags); extern int mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, unsigned long start, unsigned long end, unsigned long newflags); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c1e7c71db1e6..dc12249af6df 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1979,13 +1979,14 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, * - HPAGE_PMD_NR is protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, - unsigned long addr, pgprot_t newprot, int prot_numa) + unsigned long addr, pgprot_t newprot, unsigned long cp_flags) { struct mm_struct *mm = vma->vm_mm; spinlock_t *ptl; pmd_t entry; bool preserve_write; int ret; + bool prot_numa = cp_flags & MM_CP_PROT_NUMA; ptl = __pmd_trans_huge_lock(pmd, vma); if (!ptl) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 037e5f548118..145be04b7108 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -627,7 +627,7 @@ unsigned long change_prot_numa(struct vm_area_struct *vma, { int nr_updated; - nr_updated = change_protection(vma, addr, end, PAGE_NONE, 0, 1); + nr_updated = change_protection(vma, addr, end, PAGE_NONE, MM_CP_PROT_NUMA); if (nr_updated) count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated); diff --git a/mm/mprotect.c b/mm/mprotect.c index 0fee14b39416..046e0889e65f 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -37,12 +37,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, - int dirty_accountable, int prot_numa) + unsigned long cp_flags) { pte_t *pte, oldpte; spinlock_t *ptl; unsigned long pages = 0; int target_node = NUMA_NO_NODE; + bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; + bool prot_numa = cp_flags & MM_CP_PROT_NUMA; /* * Can be called with only the mmap_sem for reading by @@ -188,7 +190,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, - pgprot_t newprot, int dirty_accountable, int prot_numa) + pgprot_t newprot, unsigned long cp_flags) { pmd_t *pmd; unsigned long next; @@ -229,7 +231,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, __split_huge_pmd(vma, pmd, addr, false, NULL); } else { int nr_ptes = change_huge_pmd(vma, pmd, addr, - newprot, prot_numa); + newprot, cp_flags); if (nr_ptes) { if (nr_ptes == HPAGE_PMD_NR) { @@ -244,7 +246,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, /* fall through, the trans huge pmd just split */ } this_pages = change_pte_range(vma, pmd, addr, next, newprot, - dirty_accountable, prot_numa); + cp_flags); pages += this_pages; next: cond_resched(); @@ -260,7 +262,7 @@ next: static inline unsigned long change_pud_range(struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, - pgprot_t newprot, int dirty_accountable, int prot_numa) + pgprot_t newprot, unsigned long cp_flags) { pud_t *pud; unsigned long next; @@ -272,7 +274,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, if (pud_none_or_clear_bad(pud)) continue; pages += change_pmd_range(vma, pud, addr, next, newprot, - dirty_accountable, prot_numa); + cp_flags); } while (pud++, addr = next, addr != end); return pages; @@ -280,7 +282,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, static inline unsigned long change_p4d_range(struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr, unsigned long end, - pgprot_t newprot, int dirty_accountable, int prot_numa) + pgprot_t newprot, unsigned long cp_flags) { p4d_t *p4d; unsigned long next; @@ -292,7 +294,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma, if (p4d_none_or_clear_bad(p4d)) continue; pages += change_pud_range(vma, p4d, addr, next, newprot, - dirty_accountable, prot_numa); + cp_flags); } while (p4d++, addr = next, addr != end); return pages; @@ -300,7 +302,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma, static unsigned long change_protection_range(struct vm_area_struct *vma, unsigned long addr, unsigned long end, pgprot_t newprot, - int dirty_accountable, int prot_numa) + unsigned long cp_flags) { struct mm_struct *mm = vma->vm_mm; pgd_t *pgd; @@ -317,7 +319,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, if (pgd_none_or_clear_bad(pgd)) continue; pages += change_p4d_range(vma, pgd, addr, next, newprot, - dirty_accountable, prot_numa); + cp_flags); } while (pgd++, addr = next, addr != end); /* Only flush the TLB if we actually modified any entries: */ @@ -330,14 +332,15 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot, - int dirty_accountable, int prot_numa) + unsigned long cp_flags) { unsigned long pages; if (is_vm_hugetlb_page(vma)) pages = hugetlb_change_protection(vma, start, end, newprot); else - pages = change_protection_range(vma, start, end, newprot, dirty_accountable, prot_numa); + pages = change_protection_range(vma, start, end, newprot, + cp_flags); return pages; } @@ -459,7 +462,7 @@ success: vma_set_page_prot(vma); change_protection(vma, start, end, vma->vm_page_prot, - dirty_accountable, 0); + dirty_accountable ? MM_CP_DIRTY_ACCT : 0); /* * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major From 292924b260247483a58916f6d3550d8c92f32f55 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:05:49 -0700 Subject: [PATCH 039/158] userfaultfd: wp: apply _PAGE_UFFD_WP bit Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for change_protection() when used with uffd-wp and make sure the two new flags are exclusively used. Then, - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW when a range of memory is write protected by uffd - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover _PAGE_RW when write protection is resolved from userspace And use this new interface in mwriteprotect_range() to replace the old MM_CP_DIRTY_ACCT. Do this change for both PTEs and huge PMDs. Then we can start to identify which PTE/PMD is write protected by general (e.g., COW or soft dirty tracking), and which is for userfaultfd-wp. Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we can be even more strict when detecting uffd-wp page faults in either do_wp_page() or wp_huge_pmd(). After we're with _PAGE_UFFD_WP, a special case is when a page is both protected by the general COW logic and also userfault-wp. Here the userfault-wp will have higher priority and will be handled first. Only after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle the general COW. These are the steps on what will happen with such a page: 1. CPU accesses write protected shared page (so both protected by general COW and uffd-wp), blocked by uffd-wp first because in do_wp_page we'll handle uffd-wp first, so it has higher priority than general COW. 2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT to remove the uffd-wp bit upon the PTE/PMD. However here we still keep the write bit cleared. Notify the blocked CPU. 3. The blocked CPU resumes the page fault process with a fault retry, during retry it'll notice it was not with the uffd-wp bit this time but it is still write protected by general COW, then it'll go though the COW path in the fault handler, copy the page, apply write bit where necessary, and retry again. 4. The CPU will be able to access this page with write bit set. Suggested-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Brian Geffon Cc: Pavel Emelyanov Cc: Mike Kravetz Cc: David Hildenbrand Cc: Martin Cracauer Cc: Mel Gorman Cc: Bobby Powers Cc: Mike Rapoport Cc: "Kirill A . Shutemov" Cc: Maya Gokhale Cc: Johannes Weiner Cc: Marty McFadden Cc: Denis Plotnikov Cc: Hugh Dickins Cc: "Dr . David Alan Gilbert" Cc: Jerome Glisse Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/mm.h | 5 +++++ mm/huge_memory.c | 18 +++++++++++++++++- mm/memory.c | 4 ++-- mm/mprotect.c | 17 +++++++++++++++++ mm/userfaultfd.c | 8 ++++++-- 5 files changed, 47 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index c7d87ff5027b..e2f938c5a9d8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1782,6 +1782,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, #define MM_CP_DIRTY_ACCT (1UL << 0) /* Whether this protection change is for NUMA hints */ #define MM_CP_PROT_NUMA (1UL << 1) +/* Whether this change is for write protecting */ +#define MM_CP_UFFD_WP (1UL << 2) /* do wp */ +#define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */ +#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ + MM_CP_UFFD_WP_RESOLVE) extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgprot_t newprot, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index dc12249af6df..425339491677 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1987,6 +1987,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, bool preserve_write; int ret; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; + bool uffd_wp = cp_flags & MM_CP_UFFD_WP; + bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; ptl = __pmd_trans_huge_lock(pmd, vma); if (!ptl) @@ -2053,6 +2055,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, entry = pmd_modify(entry, newprot); if (preserve_write) entry = pmd_mk_savedwrite(entry); + if (uffd_wp) { + entry = pmd_wrprotect(entry); + entry = pmd_mkuffd_wp(entry); + } else if (uffd_wp_resolve) { + /* + * Leave the write bit to be handled by PF interrupt + * handler, then things like COW could be properly + * handled. + */ + entry = pmd_clear_uffd_wp(entry); + } ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry)); @@ -2201,7 +2214,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, struct page *page; pgtable_t pgtable; pmd_t old_pmd, _pmd; - bool young, write, soft_dirty, pmd_migration = false; + bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false; unsigned long addr; int i; @@ -2283,6 +2296,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, write = pmd_write(old_pmd); young = pmd_young(old_pmd); soft_dirty = pmd_soft_dirty(old_pmd); + uffd_wp = pmd_uffd_wp(old_pmd); } VM_BUG_ON_PAGE(!page_count(page), page); page_ref_add(page, HPAGE_PMD_NR - 1); @@ -2316,6 +2330,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = pte_mkold(entry); if (soft_dirty) entry = pte_mksoft_dirty(entry); + if (uffd_wp) + entry = pte_mkuffd_wp(entry); } pte = pte_offset_map(&_pmd, addr); BUG_ON(!pte_none(*pte)); diff --git a/mm/memory.c b/mm/memory.c index 46aa79600ed8..f35821b43c1b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2752,7 +2752,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - if (userfaultfd_wp(vma)) { + if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); } @@ -3954,7 +3954,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) { if (vma_is_anonymous(vmf->vma)) { - if (userfaultfd_wp(vmf->vma)) + if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) return handle_userfault(vmf, VM_UFFD_WP); return do_huge_pmd_wp_page(vmf, orig_pmd); } diff --git a/mm/mprotect.c b/mm/mprotect.c index 046e0889e65f..e4fa41a24bec 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -45,6 +45,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, int target_node = NUMA_NO_NODE; bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; + bool uffd_wp = cp_flags & MM_CP_UFFD_WP; + bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; /* * Can be called with only the mmap_sem for reading by @@ -116,6 +118,19 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (preserve_write) ptent = pte_mk_savedwrite(ptent); + if (uffd_wp) { + ptent = pte_wrprotect(ptent); + ptent = pte_mkuffd_wp(ptent); + } else if (uffd_wp_resolve) { + /* + * Leave the write bit to be handled + * by PF interrupt handler, then + * things like COW could be properly + * handled. + */ + ptent = pte_clear_uffd_wp(ptent); + } + /* Avoid taking write faults for known dirty pages */ if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || @@ -336,6 +351,8 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start, { unsigned long pages; + BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) == MM_CP_UFFD_WP_ALL); + if (is_vm_hugetlb_page(vma)) pages = hugetlb_change_protection(vma, start, end, newprot); else diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 05dbbcafdcc0..7d6ab05be019 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -101,8 +101,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, goto out_release; _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot)); - if ((dst_vma->vm_flags & VM_WRITE) && !wp_copy) - _dst_pte = pte_mkwrite(_dst_pte); + if (dst_vma->vm_flags & VM_WRITE) { + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else + _dst_pte = pte_mkwrite(_dst_pte); + } dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); if (dst_vma->vm_file) { From b569a1760782f3da03ff718d61f74163dea599ff Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:05:53 -0700 Subject: [PATCH 040/158] userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork UFFD_EVENT_FORK support for uffd-wp should be already there, except that we should clean the uffd-wp bit if uffd fork event is not enabled. Detect that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked by VM_UFFD_WP. Do this for both small PTEs and huge PMDs. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 8 ++++++++ mm/memory.c | 8 ++++++++ 2 files changed, 16 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 425339491677..8164787cd51f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1044,6 +1044,14 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = -EAGAIN; pmd = *src_pmd; + /* + * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA + * does not have the VM_UFFD_WP, which means that the uffd + * fork event is not enabled. + */ + if (!(vma->vm_flags & VM_UFFD_WP)) + pmd = pmd_clear_uffd_wp(pmd); + #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION if (unlikely(is_swap_pmd(pmd))) { swp_entry_t entry = pmd_to_swp_entry(pmd); diff --git a/mm/memory.c b/mm/memory.c index f35821b43c1b..f8b1969669b7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -785,6 +785,14 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte = pte_mkclean(pte); pte = pte_mkold(pte); + /* + * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA + * does not have the VM_UFFD_WP, which means that the uffd + * fork event is not enabled. + */ + if (!(vm_flags & VM_UFFD_WP)) + pte = pte_clear_uffd_wp(pte); + page = vm_normal_page(vma, addr, pte); if (page) { get_page(page); From 2e3d5dc508cf001c4fb2d15515ebe6f30df88f76 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:05:57 -0700 Subject: [PATCH 041/158] userfaultfd: wp: add pmd_swp_*uffd_wp() helpers Adding these missing helpers for uffd-wp operations with pmd swap/migration entries. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.com Signed-off-by: Linus Torvalds --- arch/x86/include/asm/pgtable.h | 15 +++++++++++++++ include/asm-generic/pgtable_uffd.h | 15 +++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index c37e1649fb7e..28838d790191 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1424,6 +1424,21 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) { return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP); } + +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP); +} + +static inline int pmd_swp_uffd_wp(pmd_t pmd) +{ + return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP; +} + +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) +{ + return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP); +} #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ #define PKRU_AD_BIT 0x1 diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtable_uffd.h index 643d1bf559c2..828966d4c281 100644 --- a/include/asm-generic/pgtable_uffd.h +++ b/include/asm-generic/pgtable_uffd.h @@ -46,6 +46,21 @@ static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte) { return pte; } + +static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) +{ + return pmd; +} + +static inline int pmd_swp_uffd_wp(pmd_t pmd) +{ + return 0; +} + +static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) +{ + return pmd; +} #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ #endif /* _ASM_GENERIC_PGTABLE_UFFD_H */ From f45ec5ff16a75f96dac8c89862d75f1d8739efd4 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:06:01 -0700 Subject: [PATCH 042/158] userfaultfd: wp: support swap and page migration For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/swapops.h | 2 ++ mm/huge_memory.c | 3 +++ mm/memory.c | 8 ++++++++ mm/migrate.c | 6 ++++++ mm/mprotect.c | 28 +++++++++++++++++----------- mm/rmap.c | 6 ++++++ 6 files changed, 42 insertions(+), 11 deletions(-) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 877fd239b6ff..9a6f06de183b 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -68,6 +68,8 @@ static inline swp_entry_t pte_to_swp_entry(pte_t pte) if (pte_swp_soft_dirty(pte)) pte = pte_swp_clear_soft_dirty(pte); + if (pte_swp_uffd_wp(pte)) + pte = pte_swp_clear_uffd_wp(pte); arch_entry = __pte_to_swp_entry(pte); return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8164787cd51f..6ecd1045113b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2297,6 +2297,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, write = is_write_migration_entry(entry); young = false; soft_dirty = pmd_swp_soft_dirty(old_pmd); + uffd_wp = pmd_swp_uffd_wp(old_pmd); } else { page = pmd_page(old_pmd); if (pmd_dirty(old_pmd)) @@ -2329,6 +2330,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = swp_entry_to_pte(swp_entry); if (soft_dirty) entry = pte_swp_mksoft_dirty(entry); + if (uffd_wp) + entry = pte_swp_mkuffd_wp(entry); } else { entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); entry = maybe_mkwrite(entry, vma); diff --git a/mm/memory.c b/mm/memory.c index f8b1969669b7..8ac9af73e9d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -733,6 +733,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(*src_pte)) pte = pte_swp_mksoft_dirty(pte); + if (pte_swp_uffd_wp(*src_pte)) + pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } } else if (is_device_private_entry(entry)) { @@ -762,6 +764,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, is_cow_mapping(vm_flags)) { make_device_private_entry_read(&entry); pte = swp_entry_to_pte(entry); + if (pte_swp_uffd_wp(*src_pte)) + pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } } @@ -3098,6 +3102,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) flush_icache_page(vma, page); if (pte_swp_soft_dirty(vmf->orig_pte)) pte = pte_mksoft_dirty(pte); + if (pte_swp_uffd_wp(vmf->orig_pte)) { + pte = pte_mkuffd_wp(pte); + pte = pte_wrprotect(pte); + } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte); vmf->orig_pte = pte; diff --git a/mm/migrate.c b/mm/migrate.c index c1412e04975e..7160c1556f79 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, entry = pte_to_swp_entry(*pvmw.pte); if (is_write_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); + else if (pte_swp_uffd_wp(*pvmw.pte)) + pte = pte_mkuffd_wp(pte); if (unlikely(is_zone_device_page(new))) { if (is_device_private_page(new)) { entry = make_device_private_entry(new, pte_write(pte)); pte = swp_entry_to_pte(entry); + if (pte_swp_uffd_wp(*pvmw.pte)) + pte = pte_mkuffd_wp(pte); } } @@ -2338,6 +2342,8 @@ again: swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pte)) swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pte)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); set_pte_at(mm, addr, ptep, swp_pte); /* diff --git a/mm/mprotect.c b/mm/mprotect.c index e4fa41a24bec..1d823b050329 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -139,11 +139,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; - } else if (IS_ENABLED(CONFIG_MIGRATION)) { + } else if (is_swap_pte(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); + pte_t newpte; if (is_write_migration_entry(entry)) { - pte_t newpte; /* * A protection check is difficult so * just be safe and disable write @@ -152,22 +152,28 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, newpte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(oldpte)) newpte = pte_swp_mksoft_dirty(newpte); - set_pte_at(vma->vm_mm, addr, pte, newpte); - - pages++; - } - - if (is_write_device_private_entry(entry)) { - pte_t newpte; - + if (pte_swp_uffd_wp(oldpte)) + newpte = pte_swp_mkuffd_wp(newpte); + } else if (is_write_device_private_entry(entry)) { /* * We do not preserve soft-dirtiness. See * copy_one_pte() for explanation. */ make_device_private_entry_read(&entry); newpte = swp_entry_to_pte(entry); - set_pte_at(vma->vm_mm, addr, pte, newpte); + if (pte_swp_uffd_wp(oldpte)) + newpte = pte_swp_mkuffd_wp(newpte); + } else { + newpte = oldpte; + } + if (uffd_wp) + newpte = pte_swp_mkuffd_wp(newpte); + else if (uffd_wp_resolve) + newpte = pte_swp_clear_uffd_wp(newpte); + + if (!pte_same(oldpte, newpte)) { + set_pte_at(vma->vm_mm, addr, pte, newpte); pages++; } } diff --git a/mm/rmap.c b/mm/rmap.c index 374a9bfdbffa..ed8889bf4ede 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1502,6 +1502,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); /* * No need to invalidate here it will synchronize on @@ -1601,6 +1603,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); set_pte_at(mm, address, pvmw.pte, swp_pte); /* * No need to invalidate here it will synchronize on @@ -1667,6 +1671,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); set_pte_at(mm, address, pvmw.pte, swp_pte); /* Invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, From e1e267c7928fe387e5e1cffeafb0de2d0473663a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:06:04 -0700 Subject: [PATCH 043/158] khugepaged: skip collapse if uffd-wp detected Don't collapse the huge PMD if there is any userfault write protected small PTEs. The problem is that the write protection is in small page granularity and there's no way to keep all these write protection information if the small pages are going to be merged into a huge PMD. The same thing needs to be considered for swap entries and migration entries. So do the check as well disregarding khugepaged_max_ptes_swap. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/trace/events/huge_memory.h | 1 + mm/khugepaged.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index d82a0f4e824d..70e32ff096ec 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -13,6 +13,7 @@ EM( SCAN_PMD_NULL, "pmd_null") \ EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ + EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ EM( SCAN_PAGE_RO, "no_writable_page") \ EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ EM( SCAN_PAGE_NULL, "page_null") \ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 3afc1e2d7a55..99d77ffb79c2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -29,6 +29,7 @@ enum scan_result { SCAN_PMD_NULL, SCAN_EXCEED_NONE_PTE, SCAN_PTE_NON_PRESENT, + SCAN_PTE_UFFD_WP, SCAN_PAGE_RO, SCAN_LACK_REFERENCED_PAGE, SCAN_PAGE_NULL, @@ -1137,6 +1138,15 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, pte_t pteval = *_pte; if (is_swap_pte(pteval)) { if (++unmapped <= khugepaged_max_ptes_swap) { + /* + * Always be strict with uffd-wp + * enabled swap entries. Please see + * comment below for pte_uffd_wp(). + */ + if (pte_swp_uffd_wp(pteval)) { + result = SCAN_PTE_UFFD_WP; + goto out_unmap; + } continue; } else { result = SCAN_EXCEED_SWAP_PTE; @@ -1156,6 +1166,19 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, result = SCAN_PTE_NON_PRESENT; goto out_unmap; } + if (pte_uffd_wp(pteval)) { + /* + * Don't collapse the page if any of the small + * PTEs are armed with uffd write protection. + * Here we can also mark the new huge pmd as + * write protected if any of the small ones is + * marked but that could bring uknown + * userfault messages that falls outside of + * the registered range. So, just be simple. + */ + result = SCAN_PTE_UFFD_WP; + goto out_unmap; + } if (pte_write(pteval)) writable = true; From ffd05793963a44bd119311df3c02b191982574ee Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Mon, 6 Apr 2020 20:06:09 -0700 Subject: [PATCH 044/158] userfaultfd: wp: support write protection for userfault vma range Add API to enable/disable writeprotect a vma range. Unlike mprotect, this doesn't split/merge vmas. [peterx@redhat.com: - use the helper to find VMA; - return -ENOENT if not found to match mcopy case; - use the new MM_CP_UFFD_WP* flags for change_protection - check against mmap_changing for failures - replace find_dst_vma with vma_find_uffd] Signed-off-by: Shaohua Li Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Mel Gorman Cc: Hugh Dickins Cc: Johannes Weiner Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mike Kravetz Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/linux/userfaultfd_k.h | 3 ++ mm/userfaultfd.c | 54 +++++++++++++++++++++++++++++++++++ 2 files changed, 57 insertions(+) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index dcd33172b728..a8e5f3ea9bb2 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -41,6 +41,9 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, bool *mmap_changing); +extern int mwriteprotect_range(struct mm_struct *dst_mm, + unsigned long start, unsigned long len, + bool enable_wp, bool *mmap_changing); /* mm helpers */ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma, diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 7d6ab05be019..512576e171ce 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -638,3 +638,57 @@ ssize_t mfill_zeropage(struct mm_struct *dst_mm, unsigned long start, { return __mcopy_atomic(dst_mm, start, 0, len, true, mmap_changing, 0); } + +int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, bool enable_wp, bool *mmap_changing) +{ + struct vm_area_struct *dst_vma; + pgprot_t newprot; + int err; + + /* + * Sanitize the command parameters: + */ + BUG_ON(start & ~PAGE_MASK); + BUG_ON(len & ~PAGE_MASK); + + /* Does the address range wrap, or is the span zero-sized? */ + BUG_ON(start + len <= start); + + down_read(&dst_mm->mmap_sem); + + /* + * If memory mappings are changing because of non-cooperative + * operation (e.g. mremap) running in parallel, bail out and + * request the user to retry later + */ + err = -EAGAIN; + if (mmap_changing && READ_ONCE(*mmap_changing)) + goto out_unlock; + + err = -ENOENT; + dst_vma = find_dst_vma(dst_mm, start, len); + /* + * Make sure the vma is not shared, that the dst range is + * both valid and fully within a single existing vma. + */ + if (!dst_vma || (dst_vma->vm_flags & VM_SHARED)) + goto out_unlock; + if (!userfaultfd_wp(dst_vma)) + goto out_unlock; + if (!vma_is_anonymous(dst_vma)) + goto out_unlock; + + if (enable_wp) + newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE)); + else + newprot = vm_get_page_prot(dst_vma->vm_flags); + + change_protection(dst_vma, start, start + len, newprot, + enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE); + + err = 0; +out_unlock: + up_read(&dst_mm->mmap_sem); + return err; +} From 63b2d4174c4ad1f40b48d7138e71bcb564c1fe03 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Mon, 6 Apr 2020 20:06:12 -0700 Subject: [PATCH 045/158] userfaultfd: wp: add the writeprotect API to userfaultfd ioctl Introduce the new uffd-wp APIs for userspace. Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the userspace program can not only resolve missing page faults, and at the same time tracking page data changes along the way. Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level write protection tracking. Note that we will need to register the memory region with UFFDIO_REGISTER_MODE_WP before that. [peterx@redhat.com: write up the commit message] [peterx@redhat.com: remove useless block, write commit message, check against VM_MAYWRITE rather than VM_WRITE when register] Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 82 +++++++++++++++++++++++++------- include/uapi/linux/userfaultfd.h | 23 +++++++++ 2 files changed, 89 insertions(+), 16 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index c49bef505775..59e9e399fddb 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -314,8 +314,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, if (!pmd_present(_pmd)) goto out; - if (pmd_trans_huge(_pmd)) + if (pmd_trans_huge(_pmd)) { + if (!pmd_write(_pmd) && (reason & VM_UFFD_WP)) + ret = true; goto out; + } /* * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it @@ -328,6 +331,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, */ if (pte_none(*pte)) ret = true; + if (!pte_write(*pte) && (reason & VM_UFFD_WP)) + ret = true; pte_unmap(pte); out: @@ -1287,10 +1292,13 @@ static __always_inline int validate_range(struct mm_struct *mm, return 0; } -static inline bool vma_can_userfault(struct vm_area_struct *vma) +static inline bool vma_can_userfault(struct vm_area_struct *vma, + unsigned long vm_flags) { - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); + /* FIXME: add WP support to hugetlbfs and shmem */ + return vma_is_anonymous(vma) || + ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) && + !(vm_flags & VM_UFFD_WP)); } static int userfaultfd_register(struct userfaultfd_ctx *ctx, @@ -1322,15 +1330,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, vm_flags = 0; if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING) vm_flags |= VM_UFFD_MISSING; - if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) { + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) vm_flags |= VM_UFFD_WP; - /* - * FIXME: remove the below error constraint by - * implementing the wprotect tracking mode. - */ - ret = -EINVAL; - goto out; - } ret = validate_range(mm, &uffdio_register.range.start, uffdio_register.range.len); @@ -1380,7 +1381,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, /* check not compatible vmas */ ret = -EINVAL; - if (!vma_can_userfault(cur)) + if (!vma_can_userfault(cur, vm_flags)) goto out_unlock; /* @@ -1408,6 +1409,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, if (end & (vma_hpagesize - 1)) goto out_unlock; } + if ((vm_flags & VM_UFFD_WP) && !(cur->vm_flags & VM_MAYWRITE)) + goto out_unlock; /* * Check that this vma isn't already owned by a @@ -1437,7 +1440,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, do { cond_resched(); - BUG_ON(!vma_can_userfault(vma)); + BUG_ON(!vma_can_userfault(vma, vm_flags)); BUG_ON(vma->vm_userfaultfd_ctx.ctx && vma->vm_userfaultfd_ctx.ctx != ctx); WARN_ON(!(vma->vm_flags & VM_MAYWRITE)); @@ -1575,7 +1578,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, * provides for more strict behavior to notice * unregistration errors. */ - if (!vma_can_userfault(cur)) + if (!vma_can_userfault(cur, cur->vm_flags)) goto out_unlock; found = true; @@ -1589,7 +1592,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, do { cond_resched(); - BUG_ON(!vma_can_userfault(vma)); + BUG_ON(!vma_can_userfault(vma, vma->vm_flags)); /* * Nothing to do: this vma is already registered into this @@ -1802,6 +1805,50 @@ out: return ret; } +static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, + unsigned long arg) +{ + int ret; + struct uffdio_writeprotect uffdio_wp; + struct uffdio_writeprotect __user *user_uffdio_wp; + struct userfaultfd_wake_range range; + + if (READ_ONCE(ctx->mmap_changing)) + return -EAGAIN; + + user_uffdio_wp = (struct uffdio_writeprotect __user *) arg; + + if (copy_from_user(&uffdio_wp, user_uffdio_wp, + sizeof(struct uffdio_writeprotect))) + return -EFAULT; + + ret = validate_range(ctx->mm, &uffdio_wp.range.start, + uffdio_wp.range.len); + if (ret) + return ret; + + if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | + UFFDIO_WRITEPROTECT_MODE_WP)) + return -EINVAL; + if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) && + (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) + return -EINVAL; + + ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, + uffdio_wp.range.len, uffdio_wp.mode & + UFFDIO_WRITEPROTECT_MODE_WP, + &ctx->mmap_changing); + if (ret) + return ret; + + if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) { + range.start = uffdio_wp.range.start; + range.len = uffdio_wp.range.len; + wake_userfault(ctx, &range); + } + return ret; +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -1883,6 +1930,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_ZEROPAGE: ret = userfaultfd_zeropage(ctx, arg); break; + case UFFDIO_WRITEPROTECT: + ret = userfaultfd_writeprotect(ctx, arg); + break; } return ret; } diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 340f23bc251d..95c4a160e5f8 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -52,6 +52,7 @@ #define _UFFDIO_WAKE (0x02) #define _UFFDIO_COPY (0x03) #define _UFFDIO_ZEROPAGE (0x04) +#define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_API (0x3F) /* userfaultfd ioctl ids */ @@ -68,6 +69,8 @@ struct uffdio_copy) #define UFFDIO_ZEROPAGE _IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \ struct uffdio_zeropage) +#define UFFDIO_WRITEPROTECT _IOWR(UFFDIO, _UFFDIO_WRITEPROTECT, \ + struct uffdio_writeprotect) /* read() structure */ struct uffd_msg { @@ -232,4 +235,24 @@ struct uffdio_zeropage { __s64 zeropage; }; +struct uffdio_writeprotect { + struct uffdio_range range; +/* + * UFFDIO_WRITEPROTECT_MODE_WP: set the flag to write protect a range, + * unset the flag to undo protection of a range which was previously + * write protected. + * + * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up + * any wait thread after the operation succeeds. + * + * NOTE: Write protecting a region (WP=1) is unrelated to page faults, + * therefore DONTWAKE flag is meaningless with WP=1. Removing write + * protection (WP=0) in response to a page fault wakes the faulting + * task unless DONTWAKE is set. + */ +#define UFFDIO_WRITEPROTECT_MODE_WP ((__u64)1<<0) +#define UFFDIO_WRITEPROTECT_MODE_DONTWAKE ((__u64)1<<1) + __u64 mode; +}; + #endif /* _LINUX_USERFAULTFD_H */ From e06f1e1dd4998ffc9da37f580703b55a93fc4de4 Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Mon, 6 Apr 2020 20:06:16 -0700 Subject: [PATCH 046/158] userfaultfd: wp: enabled write protection in userfaultfd API Now it's safe to enable write protection in userfaultfd API Signed-off-by: Shaohua Li Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Mel Gorman Cc: Hugh Dickins Cc: Johannes Weiner Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mike Kravetz Cc: Pavel Emelyanov Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com Signed-off-by: Linus Torvalds --- include/uapi/linux/userfaultfd.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 95c4a160e5f8..e7e98bde221f 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -19,7 +19,8 @@ * means the userland is reading). */ #define UFFD_API ((__u64)0xAA) -#define UFFD_API_FEATURES (UFFD_FEATURE_EVENT_FORK | \ +#define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \ + UFFD_FEATURE_EVENT_FORK | \ UFFD_FEATURE_EVENT_REMAP | \ UFFD_FEATURE_EVENT_REMOVE | \ UFFD_FEATURE_EVENT_UNMAP | \ @@ -34,7 +35,8 @@ #define UFFD_API_RANGE_IOCTLS \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \ - (__u64)1 << _UFFDIO_ZEROPAGE) + (__u64)1 << _UFFDIO_ZEROPAGE | \ + (__u64)1 << _UFFDIO_WRITEPROTECT) #define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY) From 23080e2783ba45208795faad124eab0e0bdae85a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:06:20 -0700 Subject: [PATCH 047/158] userfaultfd: wp: don't wake up when doing write protect It does not make sense to try to wake up any waiting thread when we're write-protecting a memory region. Only wake up when resolving a write protected page fault. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.com Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 59e9e399fddb..6e878d23547c 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1812,6 +1812,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, struct uffdio_writeprotect uffdio_wp; struct uffdio_writeprotect __user *user_uffdio_wp; struct userfaultfd_wake_range range; + bool mode_wp, mode_dontwake; if (READ_ONCE(ctx->mmap_changing)) return -EAGAIN; @@ -1830,18 +1831,20 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE | UFFDIO_WRITEPROTECT_MODE_WP)) return -EINVAL; - if ((uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP) && - (uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) + + mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP; + mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE; + + if (mode_wp && mode_dontwake) return -EINVAL; ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, - uffdio_wp.range.len, uffdio_wp.mode & - UFFDIO_WRITEPROTECT_MODE_WP, + uffdio_wp.range.len, mode_wp, &ctx->mmap_changing); if (ret) return ret; - if (!(uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE)) { + if (!mode_wp && !mode_dontwake) { range.start = uffdio_wp.range.start; range.len = uffdio_wp.range.len; wake_userfault(ctx, &range); From 57e5d4f278b9522646b49a3a97ebf5f2b8f9d4cf Mon Sep 17 00:00:00 2001 From: Martin Cracauer Date: Mon, 6 Apr 2020 20:06:24 -0700 Subject: [PATCH 048/158] userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update Add documentation about the write protection support. [peterx@redhat.com: rewrite in rst format; fixups here and there] Signed-off-by: Martin Cracauer Signed-off-by: Andrea Arcangeli Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Jerome Glisse Reviewed-by: Mike Rapoport Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-17-peterx@redhat.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/userfaultfd.rst | 51 ++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 5048cf661a8a..c30176e67900 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -108,6 +108,57 @@ UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an half copied page since it'll keep userfaulting until the copy has finished. +Notes: + +- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then + you must provide some kind of page in your thread after reading from + the uffd. You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE. + The normal behavior of the OS automatically providing a zero page on + an annonymous mmaping is not in place. + +- None of the page-delivering ioctls default to the range that you + registered with. You must fill in all fields for the appropriate + ioctl struct including the range. + +- You get the address of the access that triggered the missing page + event out of a struct uffd_msg that you read in the thread from the + uffd. You can supply as many pages as you want with UFFDIO_COPY or + UFFDIO_ZEROPAGE. Keep in mind that unless you used DONTWAKE then + the first of any of those IOCTLs wakes up the faulting thread. + +- Be sure to test for all errors including (pollfd[0].revents & + POLLERR). This can happen, e.g. when ranges supplied were + incorrect. + +Write Protect Notifications +--------------------------- + +This is equivalent to (but faster than) using mprotect and a SIGSEGV +signal handler. + +Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP. +Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT, +struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP +in the struct passed in. The range does not default to and does not +have to be identical to the range you registered with. You can write +protect as many ranges as you like (inside the registered range). +Then, in the thread reading from uffd the struct will have +msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send +ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again +while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set. +This wakes up the thread which will continue to run with writes. This +allows you to do the bookkeeping about the write in the uffd reading +thread before the ioctl. + +If you registered with both UFFDIO_REGISTER_MODE_MISSING and +UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in +which you supply a page and undo write protect. Note that there is a +difference between writes into a WP area and into a !WP area. The +former will have UFFD_PAGEFAULT_FLAG_WP set, the latter +UFFD_PAGEFAULT_FLAG_WRITE. The latter did not fail on protection but +you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was +used. + QEMU/KVM ======== From 14819305e09fe4fda546f0dfa12134c8e5366616 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:06:29 -0700 Subject: [PATCH 049/158] userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally Only declare _UFFDIO_WRITEPROTECT if the user specified UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the user registers regions with shmem/hugetlbfs we won't expose the new ioctl to them. Even with complete anonymous memory range, we'll only expose the new WP ioctl bit if the register mode has MODE_WP. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.com Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 6e878d23547c..e39fdec8a0b0 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1495,14 +1495,24 @@ out_unlock: up_write(&mm->mmap_sem); mmput(mm); if (!ret) { + __u64 ioctls_out; + + ioctls_out = basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC : + UFFD_API_RANGE_IOCTLS; + + /* + * Declare the WP ioctl only if the WP mode is + * specified and all checks passed with the range + */ + if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_WP)) + ioctls_out &= ~((__u64)1 << _UFFDIO_WRITEPROTECT); + /* * Now that we scanned all vmas we can already tell * userland which ioctls methods are guaranteed to * succeed on this range. */ - if (put_user(basic_ioctls ? UFFD_API_RANGE_IOCTLS_BASIC : - UFFD_API_RANGE_IOCTLS, - &user_uffdio_register->ioctls)) + if (put_user(ioctls_out, &user_uffdio_register->ioctls)) ret = -EFAULT; } out: From 5c8aed6c1b95c3c6de68bd2814611d5d54da5057 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:06:32 -0700 Subject: [PATCH 050/158] userfaultfd: selftests: refactor statistics Introduce uffd_stats structure for statistics of the self test, at the same time refactor the code to always pass in the uffd_stats for either read() or poll() typed fault handling threads instead of using two different ways to return the statistic results. No functional change. With the new structure, it's very easy to introduce new statistics. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Reviewed-by: Mike Rapoport Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-19-peterx@redhat.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 76 +++++++++++++++--------- 1 file changed, 49 insertions(+), 27 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index d3362777a425..3911a9ccb0bb 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -86,6 +86,12 @@ static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; static char *zeropage; pthread_attr_t attr; +/* Userfaultfd test statistics */ +struct uffd_stats { + int cpu; + unsigned long missing_faults; +}; + /* pthread_mutex_t starts at page offset 0 */ #define area_mutex(___area, ___nr) \ ((pthread_mutex_t *) ((___area) + (___nr)*page_size)) @@ -125,6 +131,17 @@ static void usage(void) exit(1); } +static void uffd_stats_reset(struct uffd_stats *uffd_stats, + unsigned long n_cpus) +{ + int i; + + for (i = 0; i < n_cpus; i++) { + uffd_stats[i].cpu = i; + uffd_stats[i].missing_faults = 0; + } +} + static int anon_release_pages(char *rel_area) { int ret = 0; @@ -467,8 +484,8 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg) return 0; } -/* Return 1 if page fault handled by us; otherwise 0 */ -static int uffd_handle_page_fault(struct uffd_msg *msg) +static void uffd_handle_page_fault(struct uffd_msg *msg, + struct uffd_stats *stats) { unsigned long offset; @@ -483,18 +500,19 @@ static int uffd_handle_page_fault(struct uffd_msg *msg) offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; offset &= ~(page_size-1); - return copy_page(uffd, offset); + if (copy_page(uffd, offset)) + stats->missing_faults++; } static void *uffd_poll_thread(void *arg) { - unsigned long cpu = (unsigned long) arg; + struct uffd_stats *stats = (struct uffd_stats *)arg; + unsigned long cpu = stats->cpu; struct pollfd pollfd[2]; struct uffd_msg msg; struct uffdio_register uffd_reg; int ret; char tmp_chr; - unsigned long userfaults = 0; pollfd[0].fd = uffd; pollfd[0].events = POLLIN; @@ -524,7 +542,7 @@ static void *uffd_poll_thread(void *arg) msg.event), exit(1); break; case UFFD_EVENT_PAGEFAULT: - userfaults += uffd_handle_page_fault(&msg); + uffd_handle_page_fault(&msg, stats); break; case UFFD_EVENT_FORK: close(uffd); @@ -543,28 +561,27 @@ static void *uffd_poll_thread(void *arg) break; } } - return (void *)userfaults; + + return NULL; } pthread_mutex_t uffd_read_mutex = PTHREAD_MUTEX_INITIALIZER; static void *uffd_read_thread(void *arg) { - unsigned long *this_cpu_userfaults; + struct uffd_stats *stats = (struct uffd_stats *)arg; struct uffd_msg msg; - this_cpu_userfaults = (unsigned long *) arg; - *this_cpu_userfaults = 0; - pthread_mutex_unlock(&uffd_read_mutex); /* from here cancellation is ok */ for (;;) { if (uffd_read_msg(uffd, &msg)) continue; - (*this_cpu_userfaults) += uffd_handle_page_fault(&msg); + uffd_handle_page_fault(&msg, stats); } - return (void *)NULL; + + return NULL; } static void *background_thread(void *arg) @@ -580,13 +597,12 @@ static void *background_thread(void *arg) return NULL; } -static int stress(unsigned long *userfaults) +static int stress(struct uffd_stats *uffd_stats) { unsigned long cpu; pthread_t locking_threads[nr_cpus]; pthread_t uffd_threads[nr_cpus]; pthread_t background_threads[nr_cpus]; - void **_userfaults = (void **) userfaults; finished = 0; for (cpu = 0; cpu < nr_cpus; cpu++) { @@ -595,12 +611,13 @@ static int stress(unsigned long *userfaults) return 1; if (bounces & BOUNCE_POLL) { if (pthread_create(&uffd_threads[cpu], &attr, - uffd_poll_thread, (void *)cpu)) + uffd_poll_thread, + (void *)&uffd_stats[cpu])) return 1; } else { if (pthread_create(&uffd_threads[cpu], &attr, uffd_read_thread, - &_userfaults[cpu])) + (void *)&uffd_stats[cpu])) return 1; pthread_mutex_lock(&uffd_read_mutex); } @@ -637,7 +654,8 @@ static int stress(unsigned long *userfaults) fprintf(stderr, "pipefd write error\n"); return 1; } - if (pthread_join(uffd_threads[cpu], &_userfaults[cpu])) + if (pthread_join(uffd_threads[cpu], + (void *)&uffd_stats[cpu])) return 1; } else { if (pthread_cancel(uffd_threads[cpu])) @@ -908,11 +926,11 @@ static int userfaultfd_events_test(void) { struct uffdio_register uffdio_register; unsigned long expected_ioctls; - unsigned long userfaults; pthread_t uffd_mon; int err, features; pid_t pid; char c; + struct uffd_stats stats = { 0 }; printf("testing events (fork, remap, remove): "); fflush(stdout); @@ -939,7 +957,7 @@ static int userfaultfd_events_test(void) "unexpected missing ioctl for anon memory\n"), exit(1); - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL)) + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) perror("uffd_poll_thread create"), exit(1); pid = fork(); @@ -955,13 +973,13 @@ static int userfaultfd_events_test(void) if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) perror("pipe write"), exit(1); - if (pthread_join(uffd_mon, (void **)&userfaults)) + if (pthread_join(uffd_mon, NULL)) return 1; close(uffd); - printf("userfaults: %ld\n", userfaults); + printf("userfaults: %ld\n", stats.missing_faults); - return userfaults != nr_pages; + return stats.missing_faults != nr_pages; } static int userfaultfd_sig_test(void) @@ -973,6 +991,7 @@ static int userfaultfd_sig_test(void) int err, features; pid_t pid; char c; + struct uffd_stats stats = { 0 }; printf("testing signal delivery: "); fflush(stdout); @@ -1004,7 +1023,7 @@ static int userfaultfd_sig_test(void) if (uffd_test_ops->release_pages(area_dst)) return 1; - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, NULL)) + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) perror("uffd_poll_thread create"), exit(1); pid = fork(); @@ -1030,6 +1049,7 @@ static int userfaultfd_sig_test(void) close(uffd); return userfaults != 0; } + static int userfaultfd_stress(void) { void *area; @@ -1038,7 +1058,7 @@ static int userfaultfd_stress(void) struct uffdio_register uffdio_register; unsigned long cpu; int err; - unsigned long userfaults[nr_cpus]; + struct uffd_stats uffd_stats[nr_cpus]; uffd_test_ops->allocate_area((void **)&area_src); if (!area_src) @@ -1167,8 +1187,10 @@ static int userfaultfd_stress(void) if (uffd_test_ops->release_pages(area_dst)) return 1; + uffd_stats_reset(uffd_stats, nr_cpus); + /* bounce pass */ - if (stress(userfaults)) + if (stress(uffd_stats)) return 1; /* unregister */ @@ -1211,7 +1233,7 @@ static int userfaultfd_stress(void) printf("userfaults:"); for (cpu = 0; cpu < nr_cpus; cpu++) - printf(" %lu", userfaults[cpu]); + printf(" %lu", uffd_stats[cpu].missing_faults); printf("\n"); } From 9b12488a7711b9aa2d0915f6949a8ad2069eb072 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 6 Apr 2020 20:06:36 -0700 Subject: [PATCH 051/158] userfaultfd: selftests: add write-protect test Add uffd tests for write protection. Instead of introducing new tests for it, let's simply squashing uffd-wp tests into existing uffd-missing test cases. Changes are: (1) Bouncing tests We do the write-protection in two ways during the bouncing test: - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then we'll make sure for each bounce process every single page will be at least fault twice: once for MISSING, once for WP. - By direct call UFFDIO_WRITEPROTECT on existing faulted memories: To further torture the explicit page protection procedures of uffd-wp, we split each bounce procedure into two halves (in the background thread): the first half will be MISSING+WP for each page as explained above. After the first half, we write protect the faulted region in the background thread to make sure at least half of the pages will be write protected again which is the first half to test the new UFFDIO_WRITEPROTECT call. Then we continue with the 2nd half, which will contain both MISSING and WP faulting tests for the 2nd half and WP-only faults from the 1st half. (2) Event/Signal test Mostly previous tests but will do MISSING+WP for each page. For sigbus-mode test we'll need to provide standalone path to handle the write protection faults. For all tests, do statistics as well for uffd-wp pages. Signed-off-by: Peter Xu Signed-off-by: Andrew Morton Cc: Andrea Arcangeli Cc: Bobby Powers Cc: Brian Geffon Cc: David Hildenbrand Cc: Denis Plotnikov Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Johannes Weiner Cc: "Kirill A . Shutemov" Cc: Martin Cracauer Cc: Marty McFadden Cc: Maya Gokhale Cc: Mel Gorman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Pavel Emelyanov Cc: Rik van Riel Cc: Shaohua Li Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.com Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 157 +++++++++++++++++++---- 1 file changed, 133 insertions(+), 24 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 3911a9ccb0bb..61e5cfeb1350 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "../kselftest.h" @@ -76,6 +77,8 @@ static int test_type; #define ALARM_INTERVAL_SECS 10 static volatile bool test_uffdio_copy_eexist = true; static volatile bool test_uffdio_zeropage_eexist = true; +/* Whether to test uffd write-protection */ +static bool test_uffdio_wp = false; static bool map_shared; static int huge_fd; @@ -90,6 +93,7 @@ pthread_attr_t attr; struct uffd_stats { int cpu; unsigned long missing_faults; + unsigned long wp_faults; }; /* pthread_mutex_t starts at page offset 0 */ @@ -139,9 +143,29 @@ static void uffd_stats_reset(struct uffd_stats *uffd_stats, for (i = 0; i < n_cpus; i++) { uffd_stats[i].cpu = i; uffd_stats[i].missing_faults = 0; + uffd_stats[i].wp_faults = 0; } } +static void uffd_stats_report(struct uffd_stats *stats, int n_cpus) +{ + int i; + unsigned long long miss_total = 0, wp_total = 0; + + for (i = 0; i < n_cpus; i++) { + miss_total += stats[i].missing_faults; + wp_total += stats[i].wp_faults; + } + + printf("userfaults: %llu missing (", miss_total); + for (i = 0; i < n_cpus; i++) + printf("%lu+", stats[i].missing_faults); + printf("\b), %llu wp (", wp_total); + for (i = 0; i < n_cpus; i++) + printf("%lu+", stats[i].wp_faults); + printf("\b)\n"); +} + static int anon_release_pages(char *rel_area) { int ret = 0; @@ -262,10 +286,15 @@ struct uffd_test_ops { void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset); }; -#define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \ +#define SHMEM_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \ (1 << _UFFDIO_COPY) | \ (1 << _UFFDIO_ZEROPAGE)) +#define ANON_EXPECTED_IOCTLS ((1 << _UFFDIO_WAKE) | \ + (1 << _UFFDIO_COPY) | \ + (1 << _UFFDIO_ZEROPAGE) | \ + (1 << _UFFDIO_WRITEPROTECT)) + static struct uffd_test_ops anon_uffd_test_ops = { .expected_ioctls = ANON_EXPECTED_IOCTLS, .allocate_area = anon_allocate_area, @@ -274,7 +303,7 @@ static struct uffd_test_ops anon_uffd_test_ops = { }; static struct uffd_test_ops shmem_uffd_test_ops = { - .expected_ioctls = ANON_EXPECTED_IOCTLS, + .expected_ioctls = SHMEM_EXPECTED_IOCTLS, .allocate_area = shmem_allocate_area, .release_pages = shmem_release_pages, .alias_mapping = noop_alias_mapping, @@ -298,6 +327,21 @@ static int my_bcmp(char *str1, char *str2, size_t n) return 0; } +static void wp_range(int ufd, __u64 start, __u64 len, bool wp) +{ + struct uffdio_writeprotect prms = { 0 }; + + /* Write protection page faults */ + prms.range.start = start; + prms.range.len = len; + /* Undo write-protect, do wakeup after that */ + prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0; + + if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) + fprintf(stderr, "clear WP failed for address 0x%Lx\n", + start), exit(1); +} + static void *locking_thread(void *arg) { unsigned long cpu = (unsigned long) arg; @@ -436,7 +480,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry) uffdio_copy.dst = (unsigned long) area_dst + offset; uffdio_copy.src = (unsigned long) area_src + offset; uffdio_copy.len = page_size; - uffdio_copy.mode = 0; + if (test_uffdio_wp) + uffdio_copy.mode = UFFDIO_COPY_MODE_WP; + else + uffdio_copy.mode = 0; uffdio_copy.copy = 0; if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) { /* real retval in ufdio_copy.copy */ @@ -493,15 +540,21 @@ static void uffd_handle_page_fault(struct uffd_msg *msg, fprintf(stderr, "unexpected msg event %u\n", msg->event), exit(1); - if (bounces & BOUNCE_VERIFY && - msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) - fprintf(stderr, "unexpected write fault\n"), exit(1); + if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) { + wp_range(uffd, msg->arg.pagefault.address, page_size, false); + stats->wp_faults++; + } else { + /* Missing page faults */ + if (bounces & BOUNCE_VERIFY && + msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) + fprintf(stderr, "unexpected write fault\n"), exit(1); - offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; - offset &= ~(page_size-1); + offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; + offset &= ~(page_size-1); - if (copy_page(uffd, offset)) - stats->missing_faults++; + if (copy_page(uffd, offset)) + stats->missing_faults++; + } } static void *uffd_poll_thread(void *arg) @@ -587,11 +640,30 @@ static void *uffd_read_thread(void *arg) static void *background_thread(void *arg) { unsigned long cpu = (unsigned long) arg; - unsigned long page_nr; + unsigned long page_nr, start_nr, mid_nr, end_nr; - for (page_nr = cpu * nr_pages_per_cpu; - page_nr < (cpu+1) * nr_pages_per_cpu; - page_nr++) + start_nr = cpu * nr_pages_per_cpu; + end_nr = (cpu+1) * nr_pages_per_cpu; + mid_nr = (start_nr + end_nr) / 2; + + /* Copy the first half of the pages */ + for (page_nr = start_nr; page_nr < mid_nr; page_nr++) + copy_page_retry(uffd, page_nr * page_size); + + /* + * If we need to test uffd-wp, set it up now. Then we'll have + * at least the first half of the pages mapped already which + * can be write-protected for testing + */ + if (test_uffdio_wp) + wp_range(uffd, (unsigned long)area_dst + start_nr * page_size, + nr_pages_per_cpu * page_size, true); + + /* + * Continue the 2nd half of the page copying, handling write + * protection faults if any + */ + for (page_nr = mid_nr; page_nr < end_nr; page_nr++) copy_page_retry(uffd, page_nr * page_size); return NULL; @@ -753,17 +825,31 @@ static int faulting_process(int signal_test) } for (nr = 0; nr < split_nr_pages; nr++) { + int steps = 1; + unsigned long offset = nr * page_size; + if (signal_test) { if (sigsetjmp(*sigbuf, 1) != 0) { - if (nr == lastnr) { + if (steps == 1 && nr == lastnr) { fprintf(stderr, "Signal repeated\n"); return 1; } lastnr = nr; if (signal_test == 1) { - if (copy_page(uffd, nr * page_size)) - signalled++; + if (steps == 1) { + /* This is a MISSING request */ + steps++; + if (copy_page(uffd, offset)) + signalled++; + } else { + /* This is a WP request */ + assert(steps == 2); + wp_range(uffd, + (__u64)area_dst + + offset, + page_size, false); + } } else { signalled++; continue; @@ -776,8 +862,13 @@ static int faulting_process(int signal_test) fprintf(stderr, "nr %lu memory corruption %Lu %Lu\n", nr, count, - count_verify[nr]), exit(1); - } + count_verify[nr]); + } + /* + * Trigger write protection if there is by writting + * the same value back. + */ + *area_count(area_dst, nr) = count; } if (signal_test) @@ -799,6 +890,11 @@ static int faulting_process(int signal_test) nr, count, count_verify[nr]), exit(1); } + /* + * Trigger write protection if there is by writting + * the same value back. + */ + *area_count(area_dst, nr) = count; } if (uffd_test_ops->release_pages(area_dst)) @@ -902,6 +998,8 @@ static int userfaultfd_zeropage_test(void) uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; + if (test_uffdio_wp) + uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) fprintf(stderr, "register failure\n"), exit(1); @@ -947,6 +1045,8 @@ static int userfaultfd_events_test(void) uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; + if (test_uffdio_wp) + uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) fprintf(stderr, "register failure\n"), exit(1); @@ -977,7 +1077,8 @@ static int userfaultfd_events_test(void) return 1; close(uffd); - printf("userfaults: %ld\n", stats.missing_faults); + + uffd_stats_report(&stats, 1); return stats.missing_faults != nr_pages; } @@ -1007,6 +1108,8 @@ static int userfaultfd_sig_test(void) uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; + if (test_uffdio_wp) + uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) fprintf(stderr, "register failure\n"), exit(1); @@ -1139,6 +1242,8 @@ static int userfaultfd_stress(void) uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; + if (test_uffdio_wp) + uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { fprintf(stderr, "register failure\n"); return 1; @@ -1193,6 +1298,11 @@ static int userfaultfd_stress(void) if (stress(uffd_stats)) return 1; + /* Clear all the write protections if there is any */ + if (test_uffdio_wp) + wp_range(uffd, (unsigned long)area_dst, + nr_pages * page_size, false); + /* unregister */ if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) { fprintf(stderr, "unregister failure\n"); @@ -1231,10 +1341,7 @@ static int userfaultfd_stress(void) area_src_alias = area_dst_alias; area_dst_alias = tmp_area; - printf("userfaults:"); - for (cpu = 0; cpu < nr_cpus; cpu++) - printf(" %lu", uffd_stats[cpu].missing_faults); - printf("\n"); + uffd_stats_report(uffd_stats, nr_cpus); } if (err) @@ -1274,6 +1381,8 @@ static void set_test_type(const char *type) if (!strcmp(type, "anon")) { test_type = TEST_ANON; uffd_test_ops = &anon_uffd_test_ops; + /* Only enable write-protect test for anonymous test */ + test_uffdio_wp = true; } else if (!strcmp(type, "hugetlb")) { test_type = TEST_HUGETLB; uffd_test_ops = &hugetlb_uffd_test_ops; From 68c3a6ac65f675b4b783635787fa0ed896f5b3d5 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:06:40 -0700 Subject: [PATCH 052/158] drivers/base/memory.c: drop section_count Patch series "mm: drop superfluous section checks when onlining/offlining". Let's drop some superfluous section checks on the onlining/offlining path. This patch (of 3): Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes") we have a generic check in offline_pages() that disallows offlining memory blocks with holes. Memory blocks with missing sections are just another variant of these type of blocks. We can stop checking (and especially storing) present sections. A proper error message is now printed why offlining failed. section_count was initially introduced in commit 07681215975e ("Driver core: Add section count to memory_block struct") in order to detect when it is okay to remove a memory block. It was used in commit 26bbe7ef6d5c ("drivers/base/memory.c: prohibit offlining of memory blocks with missing sections") to disallow offlining memory blocks with missing sections. As we refactored creation/removal of memory devices and have a proper check for holes in place, we can drop the section_count. This also removes a leftover comment regarding the mem_sysfs_mutex, which was removed in commit 848e19ad3c33 ("drivers/base/memory.c: drop the mem_sysfs_mutex"). Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Michal Hocko Cc: Dan Williams Cc: Pavel Tatashin Cc: Anshuman Khandual Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 17 +++-------------- include/linux/memory.h | 1 - 2 files changed, 3 insertions(+), 15 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 4086718f6876..086997212dbb 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -267,10 +267,6 @@ static int memory_subsys_offline(struct device *dev) if (mem->state == MEM_OFFLINE) return 0; - /* Can't offline block with non-present sections */ - if (mem->section_count != sections_per_block) - return -EINVAL; - return memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE); } @@ -627,7 +623,7 @@ static int init_memory_block(struct memory_block **memory, static int add_memory_block(unsigned long base_section_nr) { - int ret, section_count = 0; + int section_count = 0; struct memory_block *mem; unsigned long nr; @@ -638,12 +634,8 @@ static int add_memory_block(unsigned long base_section_nr) if (section_count == 0) return 0; - ret = init_memory_block(&mem, base_memory_block_id(base_section_nr), - MEM_ONLINE); - if (ret) - return ret; - mem->section_count = section_count; - return 0; + return init_memory_block(&mem, base_memory_block_id(base_section_nr), + MEM_ONLINE); } static void unregister_memory(struct memory_block *memory) @@ -679,7 +671,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size) ret = init_memory_block(&mem, block_id, MEM_OFFLINE); if (ret) break; - mem->section_count = sections_per_block; } if (ret) { end_block_id = block_id; @@ -688,7 +679,6 @@ int create_memory_block_devices(unsigned long start, unsigned long size) mem = find_memory_block_by_id(block_id); if (WARN_ON_ONCE(!mem)) continue; - mem->section_count = 0; unregister_memory(mem); } } @@ -717,7 +707,6 @@ void remove_memory_block_devices(unsigned long start, unsigned long size) mem = find_memory_block_by_id(block_id); if (WARN_ON_ONCE(!mem)) continue; - mem->section_count = 0; unregister_memory_block_under_nodes(mem); unregister_memory(mem); } diff --git a/include/linux/memory.h b/include/linux/memory.h index 0b8d791b6669..439a89e758d8 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -26,7 +26,6 @@ struct memory_block { unsigned long start_section_nr; unsigned long state; /* serialized by the dev->lock */ - int section_count; /* serialized by mem_sysfs_mutex */ int online_type; /* for passing data to online routine */ int phys_device; /* to which fru does this belong? */ struct device dev; From fada9ae3edeb4175e0f0ac9b369333806dcffbaa Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:06:43 -0700 Subject: [PATCH 053/158] drivers/base/memory.c: drop pages_correctly_probed() pages_correctly_probed() is a leftover from ancient times. It dates back to commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove functions"), where Pg_reserved checks were added as a sfety net: /* * The probe routines leave the pages reserved, just * as the bootmem code does. Make sure they're still * that way. */ The checks were refactored quite a bit over the years, especially in commit b77eab7079d9 ("mm/memory_hotplug: optimize probe routine"), where checks for present, valid, and online sections were added. Hotplugged memory is added via add_memory(), which will create the full memmap for the hotplugged memory, and mark all sections valid and present. Only full memory blocks are onlined/offlined, so we also cannot have an inconsistency in that regard (especially, memory blocks with some sections being online and some being offline). 1. Boot memory always starts online. Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes") we disallow to offline any memory with holes. Therefore, we never online memory with holes. Present and validity checks are superfluous. 2. Only complete memory blocks are onlined/offlined (and especially, the state - online or offline - is stored for whole memory blocks). Besides the core, only arch/powerpc/platforms/powernv/memtrace.c manually calls offline_pages() and fiddels with memory block states. But it also only offlines complete memory blocks. 3. To make any of these conditions trigger, something would have to be terribly messed up in the core. (e.g., online/offline only some sections of a memory block). 4. Memory unplug properly makes sure that all sysfs attributes were removed (and therefore, that all threads left the sysfs handlers). We don't have to worry about zombie devices at this point. 5. The valid_section_nr(section_nr) check is actually dead code, as it would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)). No wonder we haven't seen any of these errors in a long time (or even ever, according to my search). Let's just get rid of them. Now, all checks that could hinder onlining and offlining are completely contained in online_pages()/offline_pages(). Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Andrew Morton Cc: Michal Hocko Cc: Dan Williams Cc: Pavel Tatashin Cc: Anshuman Khandual Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 42 ------------------------------------------ 1 file changed, 42 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 086997212dbb..96c80dfaac90 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -144,45 +144,6 @@ int memory_notify(unsigned long val, void *v) return blocking_notifier_call_chain(&memory_chain, val, v); } -/* - * The probe routines leave the pages uninitialized, just as the bootmem code - * does. Make sure we do not access them, but instead use only information from - * within sections. - */ -static bool pages_correctly_probed(unsigned long start_pfn) -{ - unsigned long section_nr = pfn_to_section_nr(start_pfn); - unsigned long section_nr_end = section_nr + sections_per_block; - unsigned long pfn = start_pfn; - - /* - * memmap between sections is not contiguous except with - * SPARSEMEM_VMEMMAP. We lookup the page once per section - * and assume memmap is contiguous within each section - */ - for (; section_nr < section_nr_end; section_nr++) { - if (WARN_ON_ONCE(!pfn_valid(pfn))) - return false; - - if (!present_section_nr(section_nr)) { - pr_warn("section %ld pfn[%lx, %lx) not present\n", - section_nr, pfn, pfn + PAGES_PER_SECTION); - return false; - } else if (!valid_section_nr(section_nr)) { - pr_warn("section %ld pfn[%lx, %lx) no valid memmap\n", - section_nr, pfn, pfn + PAGES_PER_SECTION); - return false; - } else if (online_section_nr(section_nr)) { - pr_warn("section %ld pfn[%lx, %lx) is already online\n", - section_nr, pfn, pfn + PAGES_PER_SECTION); - return false; - } - pfn += PAGES_PER_SECTION; - } - - return true; -} - /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. @@ -199,9 +160,6 @@ memory_block_action(unsigned long start_section_nr, unsigned long action, switch (action) { case MEM_ONLINE: - if (!pages_correctly_probed(start_pfn)) - return -EBUSY; - ret = online_pages(start_pfn, nr_pages, online_type, nid); break; case MEM_OFFLINE: From dccacf8def2b718528754867b58b37734b59b0d7 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:06:47 -0700 Subject: [PATCH 054/158] mm/page_ext.c: drop pfn_present() check when onlining Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes") we disallow to offline any memory with holes. As all boot memory is online and hotplugged memory cannot contain holes, we never online memory with holes. This present check can be dropped. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Cc: Michal Hocko Cc: Anshuman Khandual Cc: Dan Williams Cc: Greg Kroah-Hartman Cc: Pavel Tatashin Cc: "Rafael J. Wysocki" Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.com Signed-off-by: Linus Torvalds --- mm/page_ext.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/mm/page_ext.c b/mm/page_ext.c index 08ded037f89f..a3616f7a0e9e 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -303,11 +303,8 @@ static int __meminit online_page_ext(unsigned long start_pfn, VM_BUG_ON(!node_state(nid, N_ONLINE)); } - for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) { - if (!pfn_in_present_section(pfn)) - continue; + for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) fail = init_section_page_ext(pfn, nid); - } if (!fail) return 0; From f3cd4c865b8ad61ff6dc61feba759a98ca06f039 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Mon, 6 Apr 2020 20:06:50 -0700 Subject: [PATCH 055/158] mm/memory_hotplug.c: only respect mem= parameter during boot stage In commit 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") a global varialbe max_mem_size is added to store the value parsed from 'mem= ', then checked when memory region is added. This truly stops those DIMMs from being added into system memory during boot-time. However, it also limits the later memory hotplug functionality. Any DIMM can't be hotplugged any more if its region is beyond the max_mem_size. We will get errors like: [ 216.387164] acpi PNP0C80:02: add_memory failed [ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error [ 216.392187] acpi PNP0C80:02: Enumeration failure This will cause issue in a known use case where 'mem=' is added to the hypervisor. The memory that lies after 'mem=' boundary will be assigned to KVM guests. After commit 357b4da50a62 merged, memory can't be extended dynamically if system memory on hypervisor is not sufficient. So fix it by also checking if it's during boot-time restricting to add memory. Otherwise, skip the restriction. And also add this use case to document of 'mem=' kernel parameter. Fixes: 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: Juergen Gross Acked-by: Michal Hocko Cc: Ingo Molnar Cc: William Kucharski Cc: David Hildenbrand Cc: Balbir Singh Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 13 +++++++++++-- mm/memory_hotplug.c | 8 +++++++- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 2d31d811a52d..86aae1fa099a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2573,13 +2573,22 @@ For details see: Documentation/admin-guide/hw-vuln/mds.rst mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory - Amount of memory to be used when the kernel is not able - to see the whole system memory or for test. + Amount of memory to be used in cases as follows: + + 1 for test; + 2 when the kernel is not able to see the whole system memory; + 3 memory that lies after 'mem=' boundary is excluded from + the hypervisor, then assigned to KVM guests. + [X86] Work as limiting max address. Use together with memmap= to avoid physical address space collisions. Without memmap= PCI devices could be placed at addresses belonging to unused RAM. + Note that this only takes effects during boot time since + in above case 3, memory may need be hot added after boot + if system memory of hypervisor is not sufficient. + mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel memory. diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 005eab3411e5..d6f813cc12c1 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -105,7 +105,13 @@ static struct resource *register_memory_resource(u64 start, u64 size) unsigned long flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; char *resource_name = "System RAM"; - if (start + size > max_mem_size) + /* + * Make sure value parsed from 'mem=' only restricts memory adding + * while booting, so that memory hotplug won't be impacted. Please + * refer to document of 'mem=' in kernel-parameters.txt for more + * details. + */ + if (start + size > max_mem_size && system_state < SYSTEM_RUNNING) return ERR_PTR(-E2BIG); /* From a11b9419243b9c5c7c5778d623926d9a12f68d15 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:06:53 -0700 Subject: [PATCH 056/158] mm/memory_hotplug.c: simplify calculation of number of pages in __remove_pages() In commit 52fb87c81f11 ("mm/memory_hotplug: cleanup __remove_pages()"), we cleaned up __remove_pages(), and introduced a shorter variant to calculate the number of pages to the next section boundary. Turns out we can make this calculation easier to read. We always want to have the number of pages (> 0) to the next section boundary, starting from the current pfn. We'll clean up __remove_pages() in a follow-up patch and directly make use of this computation. Suggested-by: Segher Boessenkool Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Wei Yang Cc: Oscar Salvador Cc: Michal Hocko Cc: Dan Williams Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d6f813cc12c1..df748b783418 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -534,7 +534,8 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages, for (; pfn < end_pfn; pfn += cur_nr_pages) { cond_resched(); /* Select all remaining pages up to the next section boundary */ - cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK)); + cur_nr_pages = min(end_pfn - pfn, + SECTION_ALIGN_UP(pfn + 1) - pfn); __remove_section(pfn, cur_nr_pages, map_offset, altmap); map_offset = 0; } From 6cdd0b30a920b35d901c8ca6b82e9ca4f44f54d6 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:06:56 -0700 Subject: [PATCH 057/158] mm/memory_hotplug.c: cleanup __add_pages() Let's drop the basically unused section stuff and simplify. The logic now matches the logic in __remove_pages(). Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Reviewed-by: Wei Yang Cc: Segher Boessenkool Cc: Oscar Salvador Cc: Michal Hocko Cc: Dan Williams Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 18 +++++++----------- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index df748b783418..04cb87ec39d0 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -307,8 +307,9 @@ static int check_hotplug_memory_addressable(unsigned long pfn, int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, struct mhp_restrictions *restrictions) { + const unsigned long end_pfn = pfn + nr_pages; + unsigned long cur_nr_pages; int err; - unsigned long nr, start_sec, end_sec; struct vmem_altmap *altmap = restrictions->altmap; err = check_hotplug_memory_addressable(pfn, nr_pages); @@ -331,18 +332,13 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, if (err) return err; - start_sec = pfn_to_section_nr(pfn); - end_sec = pfn_to_section_nr(pfn + nr_pages - 1); - for (nr = start_sec; nr <= end_sec; nr++) { - unsigned long pfns; - - pfns = min(nr_pages, PAGES_PER_SECTION - - (pfn & ~PAGE_SECTION_MASK)); - err = sparse_add_section(nid, pfn, pfns, altmap); + for (; pfn < end_pfn; pfn += cur_nr_pages) { + /* Select all remaining pages up to the next section boundary */ + cur_nr_pages = min(end_pfn - pfn, + SECTION_ALIGN_UP(pfn + 1) - pfn); + err = sparse_add_section(nid, pfn, cur_nr_pages, altmap); if (err) break; - pfn += pfns; - nr_pages -= pfns; cond_resched(); } vmemmap_populate_print_last(); From 5d87255cadde243763ca22b35e01312550114167 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Mon, 6 Apr 2020 20:07:00 -0700 Subject: [PATCH 058/158] mm/sparse.c: introduce new function fill_subsection_map() Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4. Memory sub-section hotplug was added to fix the issue that nvdimm could be mapped at non-section aligned starting address. A subsection map is added into struct mem_section_usage to implement it. However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the classic sparse, subsection map is meaningless and confusing. About the classic sparse which doesn't support subsection hotplug, Dan said it's more because the effort and maintenance burden outweighs the benefit. Besides, the current 64 bit ARCHes all enable SPARSEMEM_VMEMMAP_ENABLE by default. This patch (of 5): Factor out the code that fills the subsection map from section_activate() into fill_subsection_map(), this makes section_activate() cleaner and easier to follow. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: David Hildenbrand Acked-by: Pankaj Gupta Cc: Dan Williams Cc: Michal Hocko Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com Signed-off-by: Linus Torvalds --- mm/sparse.c | 32 +++++++++++++++++++++----------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index f1af4d4ee80b..51965d56cd39 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -777,24 +777,15 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->section_mem_map = (unsigned long)NULL; } -static struct page * __meminit section_activate(int nid, unsigned long pfn, - unsigned long nr_pages, struct vmem_altmap *altmap) +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) { - DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; struct mem_section *ms = __pfn_to_section(pfn); - struct mem_section_usage *usage = NULL; + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; unsigned long *subsection_map; - struct page *memmap; int rc = 0; subsection_mask_set(map, pfn, nr_pages); - if (!ms->usage) { - usage = kzalloc(mem_section_usage_size(), GFP_KERNEL); - if (!usage) - return ERR_PTR(-ENOMEM); - ms->usage = usage; - } subsection_map = &ms->usage->subsection_map[0]; if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) @@ -805,6 +796,25 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn, bitmap_or(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); + return rc; +} + +static struct page * __meminit section_activate(int nid, unsigned long pfn, + unsigned long nr_pages, struct vmem_altmap *altmap) +{ + struct mem_section *ms = __pfn_to_section(pfn); + struct mem_section_usage *usage = NULL; + struct page *memmap; + int rc = 0; + + if (!ms->usage) { + usage = kzalloc(mem_section_usage_size(), GFP_KERNEL); + if (!usage) + return ERR_PTR(-ENOMEM); + ms->usage = usage; + } + + rc = fill_subsection_map(pfn, nr_pages); if (rc) { if (usage) ms->usage = NULL; From 37bc15020a96035cb157be5864b958672fe02d7c Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Mon, 6 Apr 2020 20:07:03 -0700 Subject: [PATCH 059/158] mm/sparse.c: introduce a new function clear_subsection_map() Factor out the code which clear subsection map of one memory region from section_deactivate() into clear_subsection_map(). And also add helper function is_subsection_map_empty() to check if the current subsection map is empty or not. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Pankaj Gupta Cc: Dan Williams Cc: Michal Hocko Cc: Wei Yang Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com Signed-off-by: Linus Torvalds --- mm/sparse.c | 31 +++++++++++++++++++++++-------- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index 51965d56cd39..01204c3b4649 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -705,15 +705,11 @@ static void free_map_bootmem(struct page *memmap) } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ -static void section_deactivate(unsigned long pfn, unsigned long nr_pages, - struct vmem_altmap *altmap) +static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) { DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; struct mem_section *ms = __pfn_to_section(pfn); - bool section_is_early = early_section(ms); - struct page *memmap = NULL; - bool empty; unsigned long *subsection_map = ms->usage ? &ms->usage->subsection_map[0] : NULL; @@ -724,8 +720,28 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), "section already deactivated (%#lx + %ld)\n", pfn, nr_pages)) - return; + return -EINVAL; + bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); + return 0; +} + +static bool is_subsection_map_empty(struct mem_section *ms) +{ + return bitmap_empty(&ms->usage->subsection_map[0], + SUBSECTIONS_PER_SECTION); +} + +static void section_deactivate(unsigned long pfn, unsigned long nr_pages, + struct vmem_altmap *altmap) +{ + struct mem_section *ms = __pfn_to_section(pfn); + bool section_is_early = early_section(ms); + struct page *memmap = NULL; + bool empty; + + if (clear_subsection_map(pfn, nr_pages)) + return; /* * There are 3 cases to handle across two configurations * (SPARSEMEM_VMEMMAP={y,n}): @@ -743,8 +759,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, * * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified */ - bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); - empty = bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION); + empty = is_subsection_map_empty(ms); if (empty) { unsigned long section_nr = pfn_to_section_nr(pfn); From 0a9f9f62316606ee827fa3318e95a1c489d9acf5 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Mon, 6 Apr 2020 20:07:06 -0700 Subject: [PATCH 060/158] mm/sparse.c: only use subsection map in VMEMMAP case Currently, to support subsection aligned memory region adding for pmem, subsection map is added to track which subsection is present. However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the classic sparse, it's meaningless. Even worse, it may confuse people when checking code related to the classic sparse. About the classic sparse which doesn't support subsection hotplug, Dan said it's more because the effort and maintenance burden outweighs the benefit. Besides, the current 64 bit ARCHes all enable SPARSEMEM_VMEMMAP_ENABLE by default. Combining the above reasons, no need to provide subsection map and the relevant handling for the classic sparse. Let's remove them. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Cc: Dan Williams Cc: Michal Hocko Cc: Pankaj Gupta Cc: Wei Yang Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 ++ mm/sparse.c | 25 +++++++++++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 42b77d3b68e8..f3f264826423 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1143,7 +1143,9 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec) #define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK) struct mem_section_usage { +#ifdef CONFIG_SPARSEMEM_VMEMMAP DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION); +#endif /* See declaration of similar field in struct zone */ unsigned long pageblock_flags[0]; }; diff --git a/mm/sparse.c b/mm/sparse.c index 01204c3b4649..095ecf5bb6d3 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -209,6 +209,7 @@ static inline unsigned long first_present_section_nr(void) return next_present_section_nr(-1); } +#ifdef CONFIG_SPARSEMEM_VMEMMAP static void subsection_mask_set(unsigned long *map, unsigned long pfn, unsigned long nr_pages) { @@ -243,6 +244,11 @@ void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages) nr_pages -= pfns; } } +#else +void __init subsection_map_init(unsigned long pfn, unsigned long nr_pages) +{ +} +#endif /* Record a memory area against a node. */ void __init memory_present(int nid, unsigned long start, unsigned long end) @@ -705,6 +711,7 @@ static void free_map_bootmem(struct page *memmap) } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ +#ifdef CONFIG_SPARSEMEM_VMEMMAP static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) { DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; @@ -731,6 +738,17 @@ static bool is_subsection_map_empty(struct mem_section *ms) return bitmap_empty(&ms->usage->subsection_map[0], SUBSECTIONS_PER_SECTION); } +#else +static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + return 0; +} + +static bool is_subsection_map_empty(struct mem_section *ms) +{ + return true; +} +#endif static void section_deactivate(unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap) @@ -792,6 +810,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->section_mem_map = (unsigned long)NULL; } +#ifdef CONFIG_SPARSEMEM_VMEMMAP static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) { struct mem_section *ms = __pfn_to_section(pfn); @@ -813,6 +832,12 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) return rc; } +#else +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + return 0; +} +#endif static struct page * __meminit section_activate(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap) From 95a5a34dfe22be11bc4af58854585e90124b1db0 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Mon, 6 Apr 2020 20:07:09 -0700 Subject: [PATCH 061/158] mm/sparse.c: add note about only VMEMMAP supporting sub-section hotplug And tell check_pfn_span() gating the porper alignment and size of hot added memory region. And also move the code comments from inside section_deactivate() to being above it. The code comments are reasonable for the whole function, and the moving makes code cleaner. Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Cc: Dan Williams Cc: Pankaj Gupta Cc: Wei Yang Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com Signed-off-by: Linus Torvalds --- mm/sparse.c | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index 095ecf5bb6d3..9d43fde1f630 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -750,6 +750,22 @@ static bool is_subsection_map_empty(struct mem_section *ms) } #endif +/* + * To deactivate a memory region, there are 3 cases to handle across + * two configurations (SPARSEMEM_VMEMMAP={y,n}): + * + * 1. deactivation of a partial hot-added section (only possible in + * the SPARSEMEM_VMEMMAP=y case). + * a) section was present at memory init. + * b) section was hot-added post memory init. + * 2. deactivation of a complete hot-added section. + * 3. deactivation of a complete section from memory init. + * + * For 1, when subsection_map does not empty we will not be freeing the + * usage map, but still need to free the vmemmap range. + * + * For 2 and 3, the SPARSEMEM_VMEMMAP={y,n} cases are unified + */ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap) { @@ -760,23 +776,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, if (clear_subsection_map(pfn, nr_pages)) return; - /* - * There are 3 cases to handle across two configurations - * (SPARSEMEM_VMEMMAP={y,n}): - * - * 1/ deactivation of a partial hot-added section (only possible - * in the SPARSEMEM_VMEMMAP=y case). - * a/ section was present at memory init - * b/ section was hot-added post memory init - * 2/ deactivation of a complete hot-added section - * 3/ deactivation of a complete section from memory init - * - * For 1/, when subsection_map does not empty we will not be - * freeing the usage map, but still need to free the vmemmap - * range. - * - * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified - */ + empty = is_subsection_map_empty(ms); if (empty) { unsigned long section_nr = pfn_to_section_nr(pfn); @@ -890,6 +890,10 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn, * * This is only intended for hotplug. * + * Note that only VMEMMAP supports sub-section aligned hotplug, + * the proper alignment and size are gated by check_pfn_span(). + * + * * Return: * * 0 - On success. * * -EEXIST - Section has been present. From 6ecb0fc61290e16217a8f6165225b53b193b337c Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Mon, 6 Apr 2020 20:07:13 -0700 Subject: [PATCH 062/158] mm/sparse.c: move subsection_map related functions together No functional change. [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope] Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv Signed-off-by: Baoquan He Signed-off-by: Andrew Morton Cc: Michal Hocko Cc: David Hildenbrand Cc: Wei Yang Cc: Dan Williams Cc: Pankaj Gupta Cc: Stephen Rothwell Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com Signed-off-by: Linus Torvalds --- mm/sparse.c | 114 +++++++++++++++++++++++++--------------------------- 1 file changed, 55 insertions(+), 59 deletions(-) diff --git a/mm/sparse.c b/mm/sparse.c index 9d43fde1f630..1aee5a481571 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -666,6 +666,55 @@ static void free_map_bootmem(struct page *memmap) vmemmap_free(start, end, NULL); } + +static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; + struct mem_section *ms = __pfn_to_section(pfn); + unsigned long *subsection_map = ms->usage + ? &ms->usage->subsection_map[0] : NULL; + + subsection_mask_set(map, pfn, nr_pages); + if (subsection_map) + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); + + if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), + "section already deactivated (%#lx + %ld)\n", + pfn, nr_pages)) + return -EINVAL; + + bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); + return 0; +} + +static bool is_subsection_map_empty(struct mem_section *ms) +{ + return bitmap_empty(&ms->usage->subsection_map[0], + SUBSECTIONS_PER_SECTION); +} + +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + struct mem_section *ms = __pfn_to_section(pfn); + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; + unsigned long *subsection_map; + int rc = 0; + + subsection_mask_set(map, pfn, nr_pages); + + subsection_map = &ms->usage->subsection_map[0]; + + if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) + rc = -EINVAL; + else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION)) + rc = -EEXIST; + else + bitmap_or(subsection_map, map, subsection_map, + SUBSECTIONS_PER_SECTION); + + return rc; +} #else struct page * __meminit populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap) @@ -709,36 +758,7 @@ static void free_map_bootmem(struct page *memmap) put_page_bootmem(page); } } -#endif /* CONFIG_SPARSEMEM_VMEMMAP */ -#ifdef CONFIG_SPARSEMEM_VMEMMAP -static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; - DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 }; - struct mem_section *ms = __pfn_to_section(pfn); - unsigned long *subsection_map = ms->usage - ? &ms->usage->subsection_map[0] : NULL; - - subsection_mask_set(map, pfn, nr_pages); - if (subsection_map) - bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION); - - if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION), - "section already deactivated (%#lx + %ld)\n", - pfn, nr_pages)) - return -EINVAL; - - bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION); - return 0; -} - -static bool is_subsection_map_empty(struct mem_section *ms) -{ - return bitmap_empty(&ms->usage->subsection_map[0], - SUBSECTIONS_PER_SECTION); -} -#else static int clear_subsection_map(unsigned long pfn, unsigned long nr_pages) { return 0; @@ -748,7 +768,12 @@ static bool is_subsection_map_empty(struct mem_section *ms) { return true; } -#endif + +static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) +{ + return 0; +} +#endif /* CONFIG_SPARSEMEM_VMEMMAP */ /* * To deactivate a memory region, there are 3 cases to handle across @@ -810,35 +835,6 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, ms->section_mem_map = (unsigned long)NULL; } -#ifdef CONFIG_SPARSEMEM_VMEMMAP -static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - struct mem_section *ms = __pfn_to_section(pfn); - DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 }; - unsigned long *subsection_map; - int rc = 0; - - subsection_mask_set(map, pfn, nr_pages); - - subsection_map = &ms->usage->subsection_map[0]; - - if (bitmap_empty(map, SUBSECTIONS_PER_SECTION)) - rc = -EINVAL; - else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION)) - rc = -EEXIST; - else - bitmap_or(subsection_map, map, subsection_map, - SUBSECTIONS_PER_SECTION); - - return rc; -} -#else -static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages) -{ - return 0; -} -#endif - static struct page * __meminit section_activate(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap) { From 956f8b445061667c3545baa24778f890d1d522f4 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:16 -0700 Subject: [PATCH 063/158] drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE Patch series "mm/memory_hotplug: allow to specify a default online_type", v3. Distributions nowadays use udev rules ([1] [2]) to specify if and how to online hotplugged memory. The rules seem to get more complex with many special cases. Due to the various special cases, CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug is handled via udev rules. Every time we hotplug memory, the udev rule will come to the same conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of memory in separate memory blocks and wait for memory to get onlined by user space before continuing to add more memory blocks (to not add memory faster than it is getting onlined). This of course slows down the whole memory hotplug process. To make the job of distributions easier and to avoid udev rules that get more and more complicated, let's extend the mechanism provided by - /sys/devices/system/memory/auto_online_blocks - "memhp_default_state=" on the kernel cmdline to be able to specify also "online_movable" as well as "online_kernel" === Example /usr/libexec/config-memhotplug === #!/bin/bash VIRT=`systemd-detect-virt --vm` ARCH=`uname -p` sense_virtio_mem() { if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l` if [ $DEVICES != "0" ]; then return 0 fi fi return 1 } if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then echo "Memory hotplug configuration support missing in the kernel" exit 1 fi if grep "memhp_default_state=" /proc/cmdline > /dev/null; then echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)" exit 1 fi if [ $VIRT == "microsoft" ]; then echo "Detected Hyper-V on $ARCH" # Hyper-V wants all memory in ZONE_NORMAL ONLINE_TYPE="online_kernel" elif sense_virtio_mem; then echo "Detected virtio-mem on $ARCH" # virtio-mem wants all memory in ZONE_NORMAL ONLINE_TYPE="online_kernel" elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then echo "Detected $ARCH" # standby memory should not be onlined automatically ONLINE_TYPE="offline" elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then echo "Detected" $ARCH # PPC64 onlines all hotplugged memory right from the kernel ONLINE_TYPE="offline" elif [ $VIRT == "none" ]; then echo "Detected bare-metal on $ARCH" # Bare metal users expect hotplugged memory to be unpluggable. We assume # that ZONE imbalances on such enterpise servers cannot happen and is # properly documented ONLINE_TYPE="online_movable" else # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE # imbalances won't happen echo "Detected $VIRT on $ARCH" # Usually, ballooning is used in virtual environments, so memory should go to # ZONE_NORMAL. However, sometimes "movable_node" is relevant. ONLINE_TYPE="online" fi echo "Selected online_type:" $ONLINE_TYPE # Configure what to do with memory that will be hotplugged in the future echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks if [ $? != "0" ]; then echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)" # A backup udev rule should handle old kernels if necessary exit 1 fi # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem) if [ $ONLINE_TYPE != "offline" ]; then for MEMORY in /sys/devices/system/memory/memory*; do STATE=`cat $MEMORY/state` if [ $STATE == "offline" ]; then echo $ONLINE_TYPE > $MEMORY/state fi done fi === Example /usr/lib/systemd/system/config-memhotplug.service === [Unit] Description=Configure memory hotplug behavior DefaultDependencies=no Conflicts=shutdown.target Before=sysinit.target shutdown.target After=systemd-modules-load.service ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks [Service] ExecStart=/usr/libexec/config-memhotplug Type=oneshot TimeoutSec=0 RemainAfterExit=yes [Install] WantedBy=sysinit.target === Example modification to the 40-redhat.rules [2] === : diff --git a/40-redhat.rules b/40-redhat.rules-new : index 2c690e5..168fd03 100644 : --- a/40-redhat.rules : +++ b/40-redhat.rules-new : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online} : # Memory hotadd request : SUBSYSTEM!="memory", GOTO="memory_hotplug_end" : ACTION!="add", GOTO="memory_hotplug_end" : +# memory hotplug behavior configured : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end" : + : PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end" : : ENV{.state}="online" === [1] https://github.com/lnykryn/systemd-rhel/pull/281 [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules This patch (of 8): The name is misleading and it's not really clear what is "kept". Let's just name it like the online_type name we expose to user space ("online"). Add some documentation to the types. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Pankaj Gupta Cc: Greg Kroah-Hartman Cc: Michal Hocko Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Wei Yang Cc: Vitaly Kuznetsov Cc: Yumei Huang Cc: Igor Mammedov Cc: Eduardo Habkost Cc: Benjamin Herrenschmidt Cc: Haiyang Zhang Cc: K. Y. Srinivasan Cc: Michael Ellerman (powerpc) Cc: Paul Mackerras Cc: Stephen Hemminger Cc: Wei Liu Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 9 +++++---- include/linux/memory_hotplug.h | 6 +++++- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 96c80dfaac90..156b89b14fcc 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -208,7 +208,7 @@ static int memory_subsys_online(struct device *dev) * attribute and need to set the online_type. */ if (mem->online_type < 0) - mem->online_type = MMOP_ONLINE_KEEP; + mem->online_type = MMOP_ONLINE; ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE); @@ -243,7 +243,7 @@ static ssize_t state_store(struct device *dev, struct device_attribute *attr, else if (sysfs_streq(buf, "online_movable")) online_type = MMOP_ONLINE_MOVABLE; else if (sysfs_streq(buf, "online")) - online_type = MMOP_ONLINE_KEEP; + online_type = MMOP_ONLINE; else if (sysfs_streq(buf, "offline")) online_type = MMOP_OFFLINE; else { @@ -254,7 +254,7 @@ static ssize_t state_store(struct device *dev, struct device_attribute *attr, switch (online_type) { case MMOP_ONLINE_KERNEL: case MMOP_ONLINE_MOVABLE: - case MMOP_ONLINE_KEEP: + case MMOP_ONLINE: /* mem->online_type is protected by device_hotplug_lock */ mem->online_type = online_type; ret = device_online(&mem->dev); @@ -334,7 +334,8 @@ static ssize_t valid_zones_show(struct device *dev, } nid = mem->nid; - default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, nr_pages); + default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, start_pfn, + nr_pages); strcat(buf, default_zone->name); print_allowed_zone(buf, nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL, diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index f4d59155f3d4..261dbf010d5d 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -47,9 +47,13 @@ enum { /* Types for control the zone type of onlined and offlined memory */ enum { + /* Offline the memory. */ MMOP_OFFLINE = -1, - MMOP_ONLINE_KEEP, + /* Online the memory. Zone depends, see default_zone_for_pfn(). */ + MMOP_ONLINE, + /* Online the memory to ZONE_NORMAL. */ MMOP_ONLINE_KERNEL, + /* Online the memory to ZONE_MOVABLE. */ MMOP_ONLINE_MOVABLE, }; From efc978ad0e05ed6401c7854811750bf55b67f4b9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:20 -0700 Subject: [PATCH 064/158] drivers/base/memory: map MMOP_OFFLINE to 0 Historically, we used the value -1. Just treat 0 as the special case now. Clarify a comment (which was wrong, when we come via device_online() the first time, the online_type would have been 0 / MEM_ONLINE). The default is now always MMOP_OFFLINE. This removes the last user of the manual "-1", which didn't use the enum value. This is a preparation to use the online_type as an array index. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Michal Hocko Acked-by: Pankaj Gupta Cc: Greg Kroah-Hartman Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Wei Yang Cc: Benjamin Herrenschmidt Cc: Eduardo Habkost Cc: Haiyang Zhang Cc: Igor Mammedov Cc: "K. Y. Srinivasan" Cc: Michael Ellerman Cc: Paul Mackerras Cc: Stephen Hemminger Cc: Vitaly Kuznetsov Cc: Wei Liu Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 11 ++++------- include/linux/memory_hotplug.h | 2 +- 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 156b89b14fcc..f65f3d53dc64 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -203,17 +203,14 @@ static int memory_subsys_online(struct device *dev) return 0; /* - * If we are called from state_store(), online_type will be - * set >= 0 Otherwise we were called from the device online - * attribute and need to set the online_type. + * When called via device_online() without configuring the online_type, + * we want to default to MMOP_ONLINE. */ - if (mem->online_type < 0) + if (mem->online_type == MMOP_OFFLINE) mem->online_type = MMOP_ONLINE; ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE); - - /* clear online_type */ - mem->online_type = -1; + mem->online_type = MMOP_OFFLINE; return ret; } diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 261dbf010d5d..c2e06ed5e0e9 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -48,7 +48,7 @@ enum { /* Types for control the zone type of onlined and offlined memory */ enum { /* Offline the memory. */ - MMOP_OFFLINE = -1, + MMOP_OFFLINE = 0, /* Online the memory. Zone depends, see default_zone_for_pfn(). */ MMOP_ONLINE, /* Online the memory to ZONE_NORMAL. */ From 4dc8207bfd45799525f882e1039e63e9438d605e Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:24 -0700 Subject: [PATCH 065/158] drivers/base/memory: store mapping between MMOP_* and string in an array Let's use a simple array which we can reuse soon. While at it, move the string->mmop conversion out of the device hotplug lock. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Michal Hocko Acked-by: Pankaj Gupta Cc: Greg Kroah-Hartman Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Wei Yang Cc: Benjamin Herrenschmidt Cc: Eduardo Habkost Cc: Haiyang Zhang Cc: Igor Mammedov Cc: "K. Y. Srinivasan" Cc: Michael Ellerman Cc: Paul Mackerras Cc: Stephen Hemminger Cc: Vitaly Kuznetsov Cc: Wei Liu Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 38 +++++++++++++++++++++++--------------- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f65f3d53dc64..1c90bdf60d85 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -27,6 +27,24 @@ #define MEMORY_CLASS_NAME "memory" +static const char *const online_type_to_str[] = { + [MMOP_OFFLINE] = "offline", + [MMOP_ONLINE] = "online", + [MMOP_ONLINE_KERNEL] = "online_kernel", + [MMOP_ONLINE_MOVABLE] = "online_movable", +}; + +static int memhp_online_type_from_str(const char *str) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(online_type_to_str); i++) { + if (sysfs_streq(str, online_type_to_str[i])) + return i; + } + return -EINVAL; +} + #define to_memory_block(dev) container_of(dev, struct memory_block, dev) static int sections_per_block; @@ -228,26 +246,17 @@ static int memory_subsys_offline(struct device *dev) static ssize_t state_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { + const int online_type = memhp_online_type_from_str(buf); struct memory_block *mem = to_memory_block(dev); - int ret, online_type; + int ret; + + if (online_type < 0) + return -EINVAL; ret = lock_device_hotplug_sysfs(); if (ret) return ret; - if (sysfs_streq(buf, "online_kernel")) - online_type = MMOP_ONLINE_KERNEL; - else if (sysfs_streq(buf, "online_movable")) - online_type = MMOP_ONLINE_MOVABLE; - else if (sysfs_streq(buf, "online")) - online_type = MMOP_ONLINE; - else if (sysfs_streq(buf, "offline")) - online_type = MMOP_OFFLINE; - else { - ret = -EINVAL; - goto err; - } - switch (online_type) { case MMOP_ONLINE_KERNEL: case MMOP_ONLINE_MOVABLE: @@ -263,7 +272,6 @@ static ssize_t state_store(struct device *dev, struct device_attribute *attr, ret = -EINVAL; /* should never happen */ } -err: unlock_device_hotplug(); if (ret < 0) From ed7f9fec8c8f8227ebd1fb69feda60ce4a7df61f Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:28 -0700 Subject: [PATCH 066/158] powernv/memtrace: always online added memory blocks Let's always try to online the re-added memory blocks. In case add_memory() already onlined the added memory blocks, the first device_online() call will fail and stop processing the remaining memory blocks. This avoids manually having to check memhp_auto_online. Note: PPC always onlines all hotplugged memory directly from the kernel as well - something that is handled by user space on other architectures. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Michal Hocko Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Greg Kroah-Hartman Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Eduardo Habkost Cc: Haiyang Zhang Cc: Igor Mammedov Cc: "K. Y. Srinivasan" Cc: Stephen Hemminger Cc: Vitaly Kuznetsov Cc: Wei Liu Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-5-david@redhat.com Signed-off-by: Linus Torvalds --- arch/powerpc/platforms/powernv/memtrace.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c index d6d64f8718e6..13b369d2cc45 100644 --- a/arch/powerpc/platforms/powernv/memtrace.c +++ b/arch/powerpc/platforms/powernv/memtrace.c @@ -231,16 +231,10 @@ static int memtrace_online(void) continue; } - /* - * If kernel isn't compiled with the auto online option - * we need to online the memory ourselves. - */ - if (!memhp_auto_online) { - lock_device_hotplug(); - walk_memory_blocks(ent->start, ent->size, NULL, - online_mem_block); - unlock_device_hotplug(); - } + lock_device_hotplug(); + walk_memory_blocks(ent->start, ent->size, NULL, + online_mem_block); + unlock_device_hotplug(); /* * Memory was added successfully so clean up references to it From bc58ebd506c369c26337cf6b1a400af1a25c989c Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:32 -0700 Subject: [PATCH 067/158] hv_balloon: don't check for memhp_auto_online manually We get the MEM_ONLINE notifier call if memory is added right from the kernel via add_memory() or later from user space. Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt mechanism (->done) for that. Initialize the wait event only once and reinitialize before adding memory. Unconditionally call complete() and wait_for_completion_timeout(). If there are no waiters, complete() will only increment ->done - which will be reset by reinit_completion(). If complete() has already been called, wait_for_completion_timeout() will not wait. There is still the chance for a small race between concurrent reinit_completion() and complete(). If complete() wins, we would not wait - which is tolerable (and the race exists in current code as well). Note: We only wait for "some" memory to get onlined, which seems to be good enough for now. [akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David] Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Vitaly Kuznetsov Reviewed-by: Baoquan He Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Wei Liu Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Wei Yang Cc: Benjamin Herrenschmidt Cc: Eduardo Habkost Cc: Greg Kroah-Hartman Cc: Igor Mammedov Cc: Michael Ellerman Cc: Paul Mackerras Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-6-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/hv/hv_balloon.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index a02ce43d778d..32e3bc0aa665 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -533,7 +533,6 @@ struct hv_dynmem_device { * State to synchronize hot-add. */ struct completion ol_waitevent; - bool ha_waiting; /* * This thread handles hot-add * requests from the host as well as notifying @@ -634,10 +633,7 @@ static int hv_memory_notifier(struct notifier_block *nb, unsigned long val, switch (val) { case MEM_ONLINE: case MEM_CANCEL_ONLINE: - if (dm_device.ha_waiting) { - dm_device.ha_waiting = false; - complete(&dm_device.ol_waitevent); - } + complete(&dm_device.ol_waitevent); break; case MEM_OFFLINE: @@ -726,8 +722,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, has->covered_end_pfn += processed_pfn; spin_unlock_irqrestore(&dm_device.ha_lock, flags); - init_completion(&dm_device.ol_waitevent); - dm_device.ha_waiting = !memhp_auto_online; + reinit_completion(&dm_device.ol_waitevent); nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn)); ret = add_memory(nid, PFN_PHYS((start_pfn)), @@ -753,15 +748,14 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size, } /* - * Wait for the memory block to be onlined when memory onlining - * is done outside of kernel (memhp_auto_online). Since the hot - * add has succeeded, it is ok to proceed even if the pages in - * the hot added region have not been "onlined" within the - * allowed time. + * Wait for memory to get onlined. If the kernel onlined the + * memory when adding it, this will return directly. Otherwise, + * it will wait for user space to online the memory. This helps + * to avoid adding memory faster than it is getting onlined. As + * adding succeeded, it is ok to proceed even if the memory was + * not onlined in time. */ - if (dm_device.ha_waiting) - wait_for_completion_timeout(&dm_device.ol_waitevent, - 5*HZ); + wait_for_completion_timeout(&dm_device.ol_waitevent, 5 * HZ); post_status(&dm_device); } } @@ -1706,6 +1700,7 @@ static int balloon_probe(struct hv_device *dev, #ifdef CONFIG_MEMORY_HOTPLUG set_online_page_callback(&hv_online_page); + init_completion(&dm_device.ol_waitevent); register_memory_notifier(&hv_memory_nb); #endif From 5a04af1322f0124c6e89870ca4e69d5d0f00b4f8 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:36 -0700 Subject: [PATCH 068/158] mm/memory_hotplug: unexport memhp_auto_online All in-tree users except the mm-core are gone. Let's drop the export. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Michal Hocko Acked-by: Pankaj Gupta Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Benjamin Herrenschmidt Cc: Eduardo Habkost Cc: Greg Kroah-Hartman Cc: Haiyang Zhang Cc: Igor Mammedov Cc: "K. Y. Srinivasan" Cc: Michael Ellerman Cc: Paul Mackerras Cc: Stephen Hemminger Cc: Vitaly Kuznetsov Cc: Wei Liu Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-7-david@redhat.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 04cb87ec39d0..9691cbb4383e 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -71,7 +71,6 @@ bool memhp_auto_online; #else bool memhp_auto_online = true; #endif -EXPORT_SYMBOL_GPL(memhp_auto_online); static int __init setup_memhp_default_state(char *str) { From 862919e568356cc36288a11b42cd88ec3a7100e9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:40 -0700 Subject: [PATCH 069/158] mm/memory_hotplug: convert memhp_auto_online to store an online_type ... and rename it to memhp_default_online_type. This is a preparation for more detailed default online behavior. Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Michal Hocko Acked-by: Pankaj Gupta Cc: Greg Kroah-Hartman Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Wei Yang Cc: Benjamin Herrenschmidt Cc: Eduardo Habkost Cc: Haiyang Zhang Cc: Igor Mammedov Cc: "K. Y. Srinivasan" Cc: Michael Ellerman Cc: Paul Mackerras Cc: Stephen Hemminger Cc: Vitaly Kuznetsov Cc: Wei Liu Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 10 ++++------ include/linux/memory_hotplug.h | 3 ++- mm/memory_hotplug.c | 11 ++++++----- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 1c90bdf60d85..7d2f829d00d7 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -378,10 +378,8 @@ static DEVICE_ATTR_RO(block_size_bytes); static ssize_t auto_online_blocks_show(struct device *dev, struct device_attribute *attr, char *buf) { - if (memhp_auto_online) - return sprintf(buf, "online\n"); - else - return sprintf(buf, "offline\n"); + return sprintf(buf, "%s\n", + online_type_to_str[memhp_default_online_type]); } static ssize_t auto_online_blocks_store(struct device *dev, @@ -389,9 +387,9 @@ static ssize_t auto_online_blocks_store(struct device *dev, const char *buf, size_t count) { if (sysfs_streq(buf, "online")) - memhp_auto_online = true; + memhp_default_online_type = MMOP_ONLINE; else if (sysfs_streq(buf, "offline")) - memhp_auto_online = false; + memhp_default_online_type = MMOP_OFFLINE; else return -EINVAL; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index c2e06ed5e0e9..c6e090b34c4b 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -117,7 +117,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions); extern u64 max_mem_size; -extern bool memhp_auto_online; +/* Default online_type (MMOP_*) when new memory blocks are added. */ +extern int memhp_default_online_type; /* If movable_node boot option specified */ extern bool movable_node_enabled; static inline bool movable_node_is_enabled(void) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9691cbb4383e..9436f7e6257a 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -67,17 +67,17 @@ void put_online_mems(void) bool movable_node_enabled = false; #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE -bool memhp_auto_online; +int memhp_default_online_type = MMOP_OFFLINE; #else -bool memhp_auto_online = true; +int memhp_default_online_type = MMOP_ONLINE; #endif static int __init setup_memhp_default_state(char *str) { if (!strcmp(str, "online")) - memhp_auto_online = true; + memhp_default_online_type = MMOP_ONLINE; else if (!strcmp(str, "offline")) - memhp_auto_online = false; + memhp_default_online_type = MMOP_OFFLINE; return 1; } @@ -990,6 +990,7 @@ static int check_hotplug_memory_range(u64 start, u64 size) static int online_memory_block(struct memory_block *mem, void *arg) { + mem->online_type = memhp_default_online_type; return device_online(&mem->dev); } @@ -1062,7 +1063,7 @@ int __ref add_memory_resource(int nid, struct resource *res) mem_hotplug_done(); /* online pages if requested */ - if (memhp_auto_online) + if (memhp_default_online_type != MMOP_OFFLINE) walk_memory_blocks(start, size, NULL, online_memory_block); return ret; From 5f47adf762b78cae97de58d9ff01d2d44db09467 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Mon, 6 Apr 2020 20:07:44 -0700 Subject: [PATCH 070/158] mm/memory_hotplug: allow to specify a default online_type For now, distributions implement advanced udev rules to essentially - Don't online any hotplugged memory (s390x) - Online all memory to ZONE_NORMAL (e.g., most virt environments like hyperv) - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken care of (e.g., bare metal, special virt environments) In summary: All memory is usually onlined the same way, however, the kernel always has to ask user space to come up with the same answer. E.g., Hyper-V always waits for a memory block to get onlined before continuing, otherwise it might end up adding memory faster than onlining it, which can result in strange OOM situations. This waiting slows down adding of a bigger amount of memory. Let's allow to specify a default online_type, not just "online" and "offline". This allows distributions to configure the default online_type when booting up and be done with it. We can now specify "offline", "online", "online_movable" and "online_kernel" via - "memhp_default_state=" on the kernel cmdline - /sys/devices/system/memory/auto_online_blocks just like we are able to specify for a single memory block via /sys/devices/system/memory/memoryX/state Signed-off-by: David Hildenbrand Signed-off-by: Andrew Morton Reviewed-by: Wei Yang Reviewed-by: Baoquan He Acked-by: Michal Hocko Acked-by: Pankaj Gupta Cc: Greg Kroah-Hartman Cc: Oscar Salvador Cc: "Rafael J. Wysocki" Cc: Wei Yang Cc: Benjamin Herrenschmidt Cc: Eduardo Habkost Cc: Haiyang Zhang Cc: Igor Mammedov Cc: "K. Y. Srinivasan" Cc: Michael Ellerman Cc: Paul Mackerras Cc: Stephen Hemminger Cc: Vitaly Kuznetsov Cc: Wei Liu Cc: Yumei Huang Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 11 +++++------ include/linux/memory_hotplug.h | 2 ++ mm/memory_hotplug.c | 8 ++++---- 3 files changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 7d2f829d00d7..dbec3a05590a 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -34,7 +34,7 @@ static const char *const online_type_to_str[] = { [MMOP_ONLINE_MOVABLE] = "online_movable", }; -static int memhp_online_type_from_str(const char *str) +int memhp_online_type_from_str(const char *str) { int i; @@ -386,13 +386,12 @@ static ssize_t auto_online_blocks_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - if (sysfs_streq(buf, "online")) - memhp_default_online_type = MMOP_ONLINE; - else if (sysfs_streq(buf, "offline")) - memhp_default_online_type = MMOP_OFFLINE; - else + const int online_type = memhp_online_type_from_str(buf); + + if (online_type < 0) return -EINVAL; + memhp_default_online_type = online_type; return count; } diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index c6e090b34c4b..ef55115320fb 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -117,6 +117,8 @@ extern int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions); extern u64 max_mem_size; +extern int memhp_online_type_from_str(const char *str); + /* Default online_type (MMOP_*) when new memory blocks are added. */ extern int memhp_default_online_type; /* If movable_node boot option specified */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 9436f7e6257a..2fb78c5ebaf3 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -74,10 +74,10 @@ int memhp_default_online_type = MMOP_ONLINE; static int __init setup_memhp_default_state(char *str) { - if (!strcmp(str, "online")) - memhp_default_online_type = MMOP_ONLINE; - else if (!strcmp(str, "offline")) - memhp_default_online_type = MMOP_OFFLINE; + const int online_type = memhp_online_type_from_str(str); + + if (online_type >= 0) + memhp_default_online_type = online_type; return 1; } From 104049017b774036fa971722a202a17e6585a200 Mon Sep 17 00:00:00 2001 From: chenqiwu Date: Mon, 6 Apr 2020 20:07:48 -0700 Subject: [PATCH 071/158] mm/memory_hotplug.c: use __pfn_to_section() instead of open-coding Use __pfn_to_section() API instead of open-coding for better code readability. Signed-off-by: chenqiwu Signed-off-by: Andrew Morton Acked-by: David Hildenbrand Link: http://lkml.kernel.org/r/1584345134-16671-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2fb78c5ebaf3..635e8e286598 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -495,7 +495,7 @@ static void __remove_section(unsigned long pfn, unsigned long nr_pages, unsigned long map_offset, struct vmem_altmap *altmap) { - struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn)); + struct mem_section *ms = __pfn_to_section(pfn); if (WARN_ON_ONCE(!valid_section(ms))) return; From 27d80fa24326b7b33c8ee7527843776e5df808a7 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:07:51 -0700 Subject: [PATCH 072/158] mm/shmem.c: distribute switch variables for initialization Variables declared in a switch statement before any case statements cannot be automatically initialized with compiler instrumentation (as they are not part of any execution flow). With GCC's proposed automatic stack variable initialization feature, this triggers a warning (and they don't get initialized). Clang's automatic stack variable initialization (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't initialize such variables[1]. Note that these warnings (or silent skipping) happen before the dead-store elimination optimization phase, so even when the automatic initializations are later elided in favor of direct initializations, the warnings remain. To avoid these problems, move such variables into the "case" where they're used or lift them up into the main function body. mm/shmem.c: In function `shmem_getpage_gfp': mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable] 1816 | loff_t i_size; | ^~~~~~ [1] https://bugs.llvm.org/show_bug.cgi?id=44916 Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Cc: Hugh Dickins Cc: Alexander Potapenko Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org Signed-off-by: Linus Torvalds --- mm/shmem.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 2c255f383608..97448a46a4dc 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1812,17 +1812,20 @@ repeat: if (shmem_huge == SHMEM_HUGE_FORCE) goto alloc_huge; switch (sbinfo->huge) { - loff_t i_size; - pgoff_t off; case SHMEM_HUGE_NEVER: goto alloc_nohuge; - case SHMEM_HUGE_WITHIN_SIZE: + case SHMEM_HUGE_WITHIN_SIZE: { + loff_t i_size; + pgoff_t off; + off = round_up(index, HPAGE_PMD_NR); i_size = round_up(i_size_read(inode), PAGE_SIZE); if (i_size >= HPAGE_PMD_SIZE && i_size >> PAGE_SHIFT >= off) goto alloc_huge; - /* fallthrough */ + + fallthrough; + } case SHMEM_HUGE_ADVISE: if (sgp_huge == SGP_HUGE) goto alloc_huge; From 343c3d7f0927e000427fae5e361aa560f83dd5b5 Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Mon, 6 Apr 2020 20:07:54 -0700 Subject: [PATCH 073/158] mm/shmem.c: clean code by removing unnecessary assignment Previously 0 was assigned to variable 'error' but the variable was never read before reassignemnt later. So the assignment can be removed. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Matthew Wilcox (Oracle) Acked-by: Pankaj Gupta Cc: Hugh Dickins Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/shmem.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 97448a46a4dc..e23fea40767e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3120,12 +3120,9 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s error = security_inode_init_security(inode, dir, &dentry->d_name, shmem_initxattrs, NULL); - if (error) { - if (error != -EOPNOTSUPP) { - iput(inode); - return error; - } - error = 0; + if (error && error != -EOPNOTSUPP) { + iput(inode); + return error; } inode->i_size = len-1; From 71725ed10c40696dc6bdccf8e225815dcef24dba Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Mon, 6 Apr 2020 20:07:57 -0700 Subject: [PATCH 074/158] mm: huge tmpfs: try to split_huge_page() when punching hole Yang Shi writes: Currently, when truncating a shmem file, if the range is partly in a THP (start or end is in the middle of THP), the pages actually will just get cleared rather than being freed, unless the range covers the whole THP. Even though all the subpages are truncated (randomly or sequentially), the THP may still be kept in page cache. This might be fine for some usecases which prefer preserving THP, but balloon inflation is handled in base page size. So when using shmem THP as memory backend, QEMU inflation actually doesn't work as expected since it doesn't free memory. But the inflation usecase really needs to get the memory freed. (Anonymous THP will also not get freed right away, but will be freed eventually when all subpages are unmapped: whereas shmem THP still stays in page cache.) Split THP right away when doing partial hole punch, and if split fails just clear the page so that read of the punched area will return zeroes. Hugh Dickins adds: Our earlier "team of pages" huge tmpfs implementation worked in the way that Yang Shi proposes; and we have been using this patch to continue to split the huge page when hole-punched or truncated, since converting over to the compound page implementation. Although huge tmpfs gives out huge pages when available, if the user specifically asks to truncate or punch a hole (perhaps to free memory, perhaps to reduce the memcg charge), then the filesystem should do so as best it can, splitting the huge page. That is not always possible: any additional reference to the huge page prevents split_huge_page() from succeeding, so the result can be flaky. But in practice it works successfully enough that we've not seen any problem from that. Add shmem_punch_compound() to encapsulate the decision of when a split is needed, and doing the split if so. Using this simplifies the flow in shmem_undo_range(); and the first (trylock) pass does not need to do any page clearing on failure, because the second pass will either succeed or do that clearing. Following the example of zero_user_segment() when clearing a partial page, add flush_dcache_page() and set_page_dirty() when clearing a hole - though I'm not certain that either is needed. But: split_huge_page() would be sure to fail if shmem_undo_range()'s pagevec holds further references to the huge page. The easiest way to fix that is for find_get_entries() to return early, as soon as it has put one compound head or tail into the pagevec. At first this felt like a hack; but on examination, this convention better suits all its callers - or will do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping() and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec speedup by checking for compound pages there. Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Cc: Yang Shi Cc: Alexander Duyck Cc: "Michael S. Tsirkin" Cc: David Hildenbrand Cc: "Kirill A. Shutemov" Cc: Matthew Wilcox Cc: Andrea Arcangeli Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils Signed-off-by: Linus Torvalds --- mm/filemap.c | 14 +++++++- mm/shmem.c | 98 +++++++++++++++++++++++----------------------------- mm/swap.c | 4 +++ 3 files changed, 60 insertions(+), 56 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 0fbdc8e30dd2..23a051a7ef0f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1693,6 +1693,11 @@ EXPORT_SYMBOL(pagecache_get_page); * Any shadow entries of evicted pages, or swap entries from * shmem/tmpfs, are included in the returned array. * + * If it finds a Transparent Huge Page, head or tail, find_get_entries() + * stops at that page: the caller is likely to have a better way to handle + * the compound page as a whole, and then skip its extent, than repeatedly + * calling find_get_entries() to return all its tails. + * * Return: the number of pages and shadow entries which were found. */ unsigned find_get_entries(struct address_space *mapping, @@ -1724,8 +1729,15 @@ unsigned find_get_entries(struct address_space *mapping, /* Has the page moved or been split? */ if (unlikely(page != xas_reload(&xas))) goto put_page; - page = find_subpage(page, xas.xa_index); + /* + * Terminate early on finding a THP, to allow the caller to + * handle it all at once; but continue if this is hugetlbfs. + */ + if (PageTransHuge(page) && !PageHuge(page)) { + page = find_subpage(page, xas.xa_index); + nr_entries = ret + 1; + } export: indices[ret] = xas.xa_index; entries[ret] = page; diff --git a/mm/shmem.c b/mm/shmem.c index e23fea40767e..8d0aba1cc155 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -788,6 +788,32 @@ void shmem_unlock_mapping(struct address_space *mapping) } } +/* + * Check whether a hole-punch or truncation needs to split a huge page, + * returning true if no split was required, or the split has been successful. + * + * Eviction (or truncation to 0 size) should never need to split a huge page; + * but in rare cases might do so, if shmem_undo_range() failed to trylock on + * head, and then succeeded to trylock on tail. + * + * A split can only succeed when there are no additional references on the + * huge page: so the split below relies upon find_get_entries() having stopped + * when it found a subpage of the huge page, without getting further references. + */ +static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end) +{ + if (!PageTransCompound(page)) + return true; + + /* Just proceed to delete a huge page wholly within the range punched */ + if (PageHead(page) && + page->index >= start && page->index + HPAGE_PMD_NR <= end) + return true; + + /* Try to split huge page, so we can truly punch the hole or truncate */ + return split_huge_page(page) >= 0; +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -838,31 +864,11 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, if (!trylock_page(page)) continue; - if (PageTransTail(page)) { - /* Middle of THP: zero out the page */ - clear_highpage(page); - unlock_page(page); - continue; - } else if (PageTransHuge(page)) { - if (index == round_down(end, HPAGE_PMD_NR)) { - /* - * Range ends in the middle of THP: - * zero out the page - */ - clear_highpage(page); - unlock_page(page); - continue; - } - index += HPAGE_PMD_NR - 1; - i += HPAGE_PMD_NR - 1; - } - - if (!unfalloc || !PageUptodate(page)) { - VM_BUG_ON_PAGE(PageTail(page), page); - if (page_mapping(page) == mapping) { - VM_BUG_ON_PAGE(PageWriteback(page), page); + if ((!unfalloc || !PageUptodate(page)) && + page_mapping(page) == mapping) { + VM_BUG_ON_PAGE(PageWriteback(page), page); + if (shmem_punch_compound(page, start, end)) truncate_inode_page(mapping, page); - } } unlock_page(page); } @@ -936,43 +942,25 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, lock_page(page); - if (PageTransTail(page)) { - /* Middle of THP: zero out the page */ - clear_highpage(page); - unlock_page(page); - /* - * Partial thp truncate due 'start' in middle - * of THP: don't need to look on these pages - * again on !pvec.nr restart. - */ - if (index != round_down(end, HPAGE_PMD_NR)) - start++; - continue; - } else if (PageTransHuge(page)) { - if (index == round_down(end, HPAGE_PMD_NR)) { - /* - * Range ends in the middle of THP: - * zero out the page - */ - clear_highpage(page); - unlock_page(page); - continue; - } - index += HPAGE_PMD_NR - 1; - i += HPAGE_PMD_NR - 1; - } - if (!unfalloc || !PageUptodate(page)) { - VM_BUG_ON_PAGE(PageTail(page), page); - if (page_mapping(page) == mapping) { - VM_BUG_ON_PAGE(PageWriteback(page), page); - truncate_inode_page(mapping, page); - } else { + if (page_mapping(page) != mapping) { /* Page was replaced by swap: retry */ unlock_page(page); index--; break; } + VM_BUG_ON_PAGE(PageWriteback(page), page); + if (shmem_punch_compound(page, start, end)) + truncate_inode_page(mapping, page); + else { + /* Wipe the page and don't get stuck */ + clear_highpage(page); + flush_dcache_page(page); + set_page_dirty(page); + if (index < + round_up(start, HPAGE_PMD_NR)) + start = index + 1; + } } unlock_page(page); } diff --git a/mm/swap.c b/mm/swap.c index 18505990c3b1..bf9a79fed62d 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1004,6 +1004,10 @@ void __pagevec_lru_add(struct pagevec *pvec) * ascending indexes. There may be holes in the indices due to * not-present entries. * + * Only one subpage of a Transparent Huge Page is returned in one call: + * allowing truncate_inode_pages_range() to evict the whole THP without + * cycling through a pagevec of extra references. + * * pagevec_lookup_entries() returns the number of entries which were * found. */ From 4708f31885a0d3e41a03c39501df697e73406f6a Mon Sep 17 00:00:00 2001 From: Palmer Dabbelt Date: Mon, 6 Apr 2020 20:08:00 -0700 Subject: [PATCH 075/158] mm: prevent a warning when casting void* -> enum I recently build the RISC-V port with LLVM trunk, which has introduced a new warning when casting from a pointer to an enum of a smaller size. This patch simply casts to a long in the middle to stop the warning. I'd be surprised this is the only one in the kernel, but it's the only one I saw. Signed-off-by: Palmer Dabbelt Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com Signed-off-by: Linus Torvalds --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/rmap.c b/mm/rmap.c index ed8889bf4ede..f79a206b271a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1372,7 +1372,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, struct page *subpage; bool ret = true; struct mmu_notifier_range range; - enum ttu_flags flags = (enum ttu_flags)arg; + enum ttu_flags flags = (enum ttu_flags)(long)arg; /* munlock has nothing to gain from examining un-locked vmas */ if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) From bb8b93b5b65130138d3c80f3136af721863f561a Mon Sep 17 00:00:00 2001 From: "Maciej S. Szmigiero" Date: Mon, 6 Apr 2020 20:08:03 -0700 Subject: [PATCH 076/158] mm/zswap: allow setting default status, compressor and allocator in Kconfig The compressed cache for swap pages (zswap) currently needs from 1 to 3 extra kernel command line parameters in order to make it work: it has to be enabled by adding a "zswap.enabled=1" command line parameter and if one wants a different compressor or pool allocator than the default lzo / zbud combination then these choices also need to be specified on the kernel command line in additional parameters. Using a different compressor and allocator for zswap is actually pretty common as guides often recommend using the lz4 / z3fold pair instead of the default one. In such case it is also necessary to remember to enable the appropriate compression algorithm and pool allocator in the kernel config manually. Let's avoid the need for adding these kernel command line parameters and automatically pull in the dependencies for the selected compressor algorithm and pool allocator by adding an appropriate default switches to Kconfig. The default values for these options match what the code was using previously as its defaults. Signed-off-by: Maciej S. Szmigiero Signed-off-by: Andrew Morton Reviewed-by: Vitaly Wool Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name Signed-off-by: Linus Torvalds --- Documentation/vm/zswap.rst | 20 ++++--- mm/Kconfig | 118 ++++++++++++++++++++++++++++++++++++- mm/zswap.c | 24 ++++---- 3 files changed, 141 insertions(+), 21 deletions(-) diff --git a/Documentation/vm/zswap.rst b/Documentation/vm/zswap.rst index 61f6185188cd..f8c6a79d7c70 100644 --- a/Documentation/vm/zswap.rst +++ b/Documentation/vm/zswap.rst @@ -35,9 +35,11 @@ Zswap evicts pages from compressed cache on an LRU basis to the backing swap device when the compressed pool reaches its size limit. This requirement had been identified in prior community discussions. -Zswap is disabled by default but can be enabled at boot time by setting -the ``enabled`` attribute to 1 at boot time. ie: ``zswap.enabled=1``. Zswap -can also be enabled and disabled at runtime using the sysfs interface. +Whether Zswap is enabled at the boot time depends on whether +the ``CONFIG_ZSWAP_DEFAULT_ON`` Kconfig option is enabled or not. +This setting can then be overridden by providing the kernel command line +``zswap.enabled=`` option, for example ``zswap.enabled=0``. +Zswap can also be enabled and disabled at runtime using the sysfs interface. An example command to enable zswap at runtime, assuming sysfs is mounted at ``/sys``, is:: @@ -64,9 +66,10 @@ allocation in zpool is not directly accessible by address. Rather, a handle is returned by the allocation routine and that handle must be mapped before being accessed. The compressed memory pool grows on demand and shrinks as compressed pages are freed. The pool is not preallocated. By default, a zpool -of type zbud is created, but it can be selected at boot time by -setting the ``zpool`` attribute, e.g. ``zswap.zpool=zbud``. It can -also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: +of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, +but it can be overridden at boot time by setting the ``zpool`` attribute, +e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs +``zpool`` attribute, e.g.:: echo zbud > /sys/module/zswap/parameters/zpool @@ -97,8 +100,9 @@ controlled policy: * max_pool_percent - The maximum percentage of memory that the compressed pool can occupy. -The default compressor is lzo, but it can be selected at boot time by -setting the ``compressor`` attribute, e.g. ``zswap.compressor=lzo``. +The default compressor is selected in ``CONFIG_ZSWAP_COMPRESSOR_DEFAULT`` +Kconfig option, but it can be overridden at boot time by setting the +``compressor`` attribute, e.g. ``zswap.compressor=lzo``. It can also be changed at runtime using the sysfs "compressor" attribute, e.g.:: diff --git a/mm/Kconfig b/mm/Kconfig index d286dc54458c..691021492e78 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -533,7 +533,6 @@ config MEM_SOFT_DIRTY config ZSWAP bool "Compressed cache for swap pages (EXPERIMENTAL)" depends on FRONTSWAP && CRYPTO=y - select CRYPTO_LZO select ZPOOL help A lightweight compressed cache for swap pages. It takes @@ -549,6 +548,123 @@ config ZSWAP they have not be fully explored on the large set of potential configurations and workloads that exist. +choice + prompt "Compressed cache for swap pages default compressor" + depends on ZSWAP + default ZSWAP_COMPRESSOR_DEFAULT_LZO + help + Selects the default compression algorithm for the compressed cache + for swap pages. + + For an overview what kind of performance can be expected from + a particular compression algorithm please refer to the benchmarks + available at the following LWN page: + https://lwn.net/Articles/751795/ + + If in doubt, select 'LZO'. + + The selection made here can be overridden by using the kernel + command line 'zswap.compressor=' option. + +config ZSWAP_COMPRESSOR_DEFAULT_DEFLATE + bool "Deflate" + select CRYPTO_DEFLATE + help + Use the Deflate algorithm as the default compression algorithm. + +config ZSWAP_COMPRESSOR_DEFAULT_LZO + bool "LZO" + select CRYPTO_LZO + help + Use the LZO algorithm as the default compression algorithm. + +config ZSWAP_COMPRESSOR_DEFAULT_842 + bool "842" + select CRYPTO_842 + help + Use the 842 algorithm as the default compression algorithm. + +config ZSWAP_COMPRESSOR_DEFAULT_LZ4 + bool "LZ4" + select CRYPTO_LZ4 + help + Use the LZ4 algorithm as the default compression algorithm. + +config ZSWAP_COMPRESSOR_DEFAULT_LZ4HC + bool "LZ4HC" + select CRYPTO_LZ4HC + help + Use the LZ4HC algorithm as the default compression algorithm. + +config ZSWAP_COMPRESSOR_DEFAULT_ZSTD + bool "zstd" + select CRYPTO_ZSTD + help + Use the zstd algorithm as the default compression algorithm. +endchoice + +config ZSWAP_COMPRESSOR_DEFAULT + string + depends on ZSWAP + default "deflate" if ZSWAP_COMPRESSOR_DEFAULT_DEFLATE + default "lzo" if ZSWAP_COMPRESSOR_DEFAULT_LZO + default "842" if ZSWAP_COMPRESSOR_DEFAULT_842 + default "lz4" if ZSWAP_COMPRESSOR_DEFAULT_LZ4 + default "lz4hc" if ZSWAP_COMPRESSOR_DEFAULT_LZ4HC + default "zstd" if ZSWAP_COMPRESSOR_DEFAULT_ZSTD + default "" + +choice + prompt "Compressed cache for swap pages default allocator" + depends on ZSWAP + default ZSWAP_ZPOOL_DEFAULT_ZBUD + help + Selects the default allocator for the compressed cache for + swap pages. + The default is 'zbud' for compatibility, however please do + read the description of each of the allocators below before + making a right choice. + + The selection made here can be overridden by using the kernel + command line 'zswap.zpool=' option. + +config ZSWAP_ZPOOL_DEFAULT_ZBUD + bool "zbud" + select ZBUD + help + Use the zbud allocator as the default allocator. + +config ZSWAP_ZPOOL_DEFAULT_Z3FOLD + bool "z3fold" + select Z3FOLD + help + Use the z3fold allocator as the default allocator. + +config ZSWAP_ZPOOL_DEFAULT_ZSMALLOC + bool "zsmalloc" + select ZSMALLOC + help + Use the zsmalloc allocator as the default allocator. +endchoice + +config ZSWAP_ZPOOL_DEFAULT + string + depends on ZSWAP + default "zbud" if ZSWAP_ZPOOL_DEFAULT_ZBUD + default "z3fold" if ZSWAP_ZPOOL_DEFAULT_Z3FOLD + default "zsmalloc" if ZSWAP_ZPOOL_DEFAULT_ZSMALLOC + default "" + +config ZSWAP_DEFAULT_ON + bool "Enable the compressed cache for swap pages by default" + depends on ZSWAP + help + If selected, the compressed cache for swap pages will be enabled + at boot, otherwise it will be disabled. + + The selection made here can be overridden by using the kernel + command line 'zswap.enabled=' option. + config ZPOOL tristate "Common API for compressed memory storage" help diff --git a/mm/zswap.c b/mm/zswap.c index 55094e63b72d..fbb782924ccc 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -77,8 +77,8 @@ static bool zswap_pool_reached_full; #define ZSWAP_PARAM_UNSET "" -/* Enable/disable zswap (disabled by default) */ -static bool zswap_enabled; +/* Enable/disable zswap */ +static bool zswap_enabled = IS_ENABLED(CONFIG_ZSWAP_DEFAULT_ON); static int zswap_enabled_param_set(const char *, const struct kernel_param *); static struct kernel_param_ops zswap_enabled_param_ops = { @@ -88,8 +88,7 @@ static struct kernel_param_ops zswap_enabled_param_ops = { module_param_cb(enabled, &zswap_enabled_param_ops, &zswap_enabled, 0644); /* Crypto compressor to use */ -#define ZSWAP_COMPRESSOR_DEFAULT "lzo" -static char *zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT; +static char *zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT; static int zswap_compressor_param_set(const char *, const struct kernel_param *); static struct kernel_param_ops zswap_compressor_param_ops = { @@ -101,8 +100,7 @@ module_param_cb(compressor, &zswap_compressor_param_ops, &zswap_compressor, 0644); /* Compressed storage zpool to use */ -#define ZSWAP_ZPOOL_DEFAULT "zbud" -static char *zswap_zpool_type = ZSWAP_ZPOOL_DEFAULT; +static char *zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT; static int zswap_zpool_param_set(const char *, const struct kernel_param *); static struct kernel_param_ops zswap_zpool_param_ops = { .set = zswap_zpool_param_set, @@ -599,11 +597,12 @@ static __init struct zswap_pool *__zswap_pool_create_fallback(void) bool has_comp, has_zpool; has_comp = crypto_has_comp(zswap_compressor, 0, 0); - if (!has_comp && strcmp(zswap_compressor, ZSWAP_COMPRESSOR_DEFAULT)) { + if (!has_comp && strcmp(zswap_compressor, + CONFIG_ZSWAP_COMPRESSOR_DEFAULT)) { pr_err("compressor %s not available, using default %s\n", - zswap_compressor, ZSWAP_COMPRESSOR_DEFAULT); + zswap_compressor, CONFIG_ZSWAP_COMPRESSOR_DEFAULT); param_free_charp(&zswap_compressor); - zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT; + zswap_compressor = CONFIG_ZSWAP_COMPRESSOR_DEFAULT; has_comp = crypto_has_comp(zswap_compressor, 0, 0); } if (!has_comp) { @@ -614,11 +613,12 @@ static __init struct zswap_pool *__zswap_pool_create_fallback(void) } has_zpool = zpool_has_pool(zswap_zpool_type); - if (!has_zpool && strcmp(zswap_zpool_type, ZSWAP_ZPOOL_DEFAULT)) { + if (!has_zpool && strcmp(zswap_zpool_type, + CONFIG_ZSWAP_ZPOOL_DEFAULT)) { pr_err("zpool %s not available, using default %s\n", - zswap_zpool_type, ZSWAP_ZPOOL_DEFAULT); + zswap_zpool_type, CONFIG_ZSWAP_ZPOOL_DEFAULT); param_free_charp(&zswap_zpool_type); - zswap_zpool_type = ZSWAP_ZPOOL_DEFAULT; + zswap_zpool_type = CONFIG_ZSWAP_ZPOOL_DEFAULT; has_zpool = zpool_has_pool(zswap_zpool_type); } if (!has_zpool) { From 77337edee7598d82fb5acf66cb91a5b3f0c46add Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:06 -0700 Subject: [PATCH 077/158] mm/compaction: add missing annotation for compact_lock_irqsave Sparse reports a warning at compact_lock_irqsave() warning: context imbalance in compact_lock_irqsave() - wrong count at exit The root cause is the missing annotation at compact_lock_irqsave() Add the missing __acquires(lock) annotation. Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200214204741.94112-6-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/compaction.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/compaction.c b/mm/compaction.c index 46af63eb8212..46f0fcc93081 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -481,6 +481,7 @@ static bool test_and_set_skip(struct compact_control *cc, struct page *page, */ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags, struct compact_control *cc) + __acquires(lock) { /* Track if the lock is contended in async mode */ if (cc->mode == MIGRATE_ASYNC && !cc->contended) { From 1b2a1e7bb9ce9936d6839c1023737ccff7889dee Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:09 -0700 Subject: [PATCH 078/158] mm/hugetlb: add missing annotation for gather_surplus_pages() Sparse reports a warning at gather_surplus_pages() warning: context imbalance in hugetlb_cow() - unexpected unlock The root cause is the missing annotation at gather_surplus_pages() Add the missing __must_hold(&hugetlb_lock) Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Reviewed-by: Mike Kravetz Link: http://lkml.kernel.org/r/20200214204741.94112-7-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f9ea1e5197b4..f5fb53fdfa02 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2010,6 +2010,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, * of size 'delta'. */ static int gather_surplus_pages(struct hstate *h, int delta) + __must_hold(&hugetlb_lock) { struct list_head surplus_list; struct page *page, *tmp; From 959a7e136d52bff429c774a4ed451b094706116b Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:12 -0700 Subject: [PATCH 079/158] mm/mempolicy: add missing annotation for queue_pages_pmd() Sparse reports a warning at queue_pages_pmd() context imbalance in queue_pages_pmd() - unexpected unlock The root cause is the missing annotation at queue_pages_pmd() Add the missing __releases(ptl) Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200214204741.94112-8-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 145be04b7108..c4118bb508ec 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -442,6 +442,7 @@ static inline bool queue_pages_required(struct page *page, */ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr, unsigned long end, struct mm_walk *walk) + __releases(ptl) { int ret = 0; struct page *page; From 31364c2e168b48c8596d4d43bad9bff70e21976b Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:15 -0700 Subject: [PATCH 080/158] mm/slub: add missing annotation for get_map() Sparse reports a warning at get_map()() warning: context imbalance in get_map() - wrong count at exit The root cause is the missing annotation at get_map() Add the missing __acquires(&object_map_lock) annotation Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200214204741.94112-9-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/slub.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/slub.c b/mm/slub.c index 3098e0cf2899..c6d603648c79 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -449,6 +449,7 @@ static DEFINE_SPINLOCK(object_map_lock); * not vanish from under us. */ static unsigned long *get_map(struct kmem_cache *s, struct page *page) + __acquires(&object_map_lock) { void *p; void *addr = page_address(page); From 81aba9e06ba8cc1e2494258bc4b48875af522f32 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:18 -0700 Subject: [PATCH 081/158] mm/slub: add missing annotation for put_map() Sparse reports a warning at put_map()() warning: context imbalance in put_map() - unexpected unlock The root cause is the missing annotation at put_map() Add the missing __releases(&object_map_lock) annotation Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200214204741.94112-10-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index c6d603648c79..332d4b459a90 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -466,7 +466,7 @@ static unsigned long *get_map(struct kmem_cache *s, struct page *page) return object_map; } -static void put_map(unsigned long *map) +static void put_map(unsigned long *map) __releases(&object_map_lock) { VM_BUG_ON(map != object_map); lockdep_assert_held(&object_map_lock); From cfc451cfdf1dde36eada652a4b16d9b1524c5932 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:21 -0700 Subject: [PATCH 082/158] mm/zsmalloc: add missing annotation for migrate_read_lock() Sparse reports a warning at migrate_read_lock()() warning: context imbalance in migrate_read_lock() - wrong count at exit The root cause is the missing annotation at migrate_read_lock() Add the missing __acquires(&zspage->lock) annotation Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Acked-by: Minchan Kim Link: http://lkml.kernel.org/r/20200214204741.94112-11-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 22d17ecfe7df..da70817b4ed8 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1833,7 +1833,7 @@ static void migrate_lock_init(struct zspage *zspage) rwlock_init(&zspage->lock); } -static void migrate_read_lock(struct zspage *zspage) +static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock) { read_lock(&zspage->lock); } From 8a374cccee8cdfa2902bb7a07a10671ffc1a72c1 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:24 -0700 Subject: [PATCH 083/158] mm/zsmalloc: add missing annotation for migrate_read_unlock() Sparse reports a warning at migrate_read_unlock()() warning: context imbalance in migrate_read_unlock() - unexpected unlock The root cause is the missing annotation at migrate_read_unlock() Add the missing __releases(&zspage->lock) annotation Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Acked-by: Minchan Kim Link: http://lkml.kernel.org/r/20200214204741.94112-12-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index da70817b4ed8..2eab424c8c67 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1838,7 +1838,7 @@ static void migrate_read_lock(struct zspage *zspage) __acquires(&zspage->lock) read_lock(&zspage->lock); } -static void migrate_read_unlock(struct zspage *zspage) +static void migrate_read_unlock(struct zspage *zspage) __releases(&zspage->lock) { read_unlock(&zspage->lock); } From 70c7ec95bece1bd01e25fbc4f17c9262445de417 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:27 -0700 Subject: [PATCH 084/158] mm/zsmalloc: add missing annotation for pin_tag() Sparse reports a warning at pin_tag()() warning: context imbalance in pin_tag() - wrong count at exit The root cause is the missing annotation at pin_tag() Add the missing __acquires(bitlock) annotation Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Acked-by: Minchan Kim Link: http://lkml.kernel.org/r/20200214204741.94112-13-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 2eab424c8c67..7bac76ae11b3 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -891,7 +891,7 @@ static inline int trypin_tag(unsigned long handle) return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle); } -static void pin_tag(unsigned long handle) +static void pin_tag(unsigned long handle) __acquires(bitlock) { bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); } From bc22b18b1f805dd785c408060f21aecb4c500ea5 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:30 -0700 Subject: [PATCH 085/158] mm/zsmalloc: add missing annotation for unpin_tag() Sparse reports a warning at unpin_tag()() warning: context imbalance in unpin_tag() - unexpected unlock The root cause is the missing annotation at unpin_tag() Add the missing __releases(bitlock) annotation Signed-off-by: Jules Irenge Signed-off-by: Andrew Morton Acked-by: Minchan Kim Link: http://lkml.kernel.org/r/20200214204741.94112-14-jbi.octave@gmail.com Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 7bac76ae11b3..2aa2d524a343 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -896,7 +896,7 @@ static void pin_tag(unsigned long handle) __acquires(bitlock) bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle); } -static void unpin_tag(unsigned long handle) +static void unpin_tag(unsigned long handle) __releases(bitlock) { bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle); } From 552657b7b3343851916fde7e4fd6bfb6516d2bcb Mon Sep 17 00:00:00 2001 From: chenqiwu Date: Mon, 6 Apr 2020 20:08:33 -0700 Subject: [PATCH 086/158] mm: fix ambiguous comments for better code readability The parameter of remap_pfn_range() @pfn passed from the caller is actually a page-frame number converted by corresponding physical address of kernel memory, the original comment is ambiguous that may mislead the users. Meanwhile, there is an ambiguous typo "VMM" in the comment of vm_area_struct. So fixing them will make the code more readable. Signed-off-by: chenqiwu Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds --- include/linux/mm_types.h | 4 ++-- mm/memory.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index dd555e6d23f3..4aba6c0c2ba8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -289,8 +289,8 @@ struct vm_userfaultfd_ctx {}; #endif /* CONFIG_USERFAULTFD */ /* - * This struct defines a memory VMM memory area. There is one of these - * per VM-area/task. A VM area is any part of the process virtual memory + * This struct describes a virtual memory area. There is one of these + * per VM-area/task. A VM area is any part of the process virtual memory * space that has a special rule for the page-fault handlers (ie a shared * library, the executable area etc). */ diff --git a/mm/memory.c b/mm/memory.c index 8ac9af73e9d2..19874d133a66 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1952,7 +1952,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd, * @vma: user vma to map to * @addr: target user address to start at * @pfn: page frame number of kernel physical memory address - * @size: size of map area + * @size: size of mapping area * @prot: page protection flags for this mapping * * Note: this is only safe if the mm semaphore is held when called. From e46b893dd113253c2ff032aed22d9df404c5511d Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Mon, 6 Apr 2020 20:08:36 -0700 Subject: [PATCH 087/158] mm/mm_init.c: clean code. Use BUILD_BUG_ON when comparing compile time constant MAX_ZONELISTS is a compile time constant, so it should be compared using BUILD_BUG_ON not BUG_ON. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Reviewed-by: Wei Yang Link: http://lkml.kernel.org/r/20200228224617.11343-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/mm_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c918388de99..7da6991d9435 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -37,7 +37,7 @@ void __init mminit_verify_zonelist(void) struct zonelist *zonelist; int i, listid, zoneid; - BUG_ON(MAX_ZONELISTS > 2); + BUILD_BUG_ON(MAX_ZONELISTS > 2); for (i = 0; i < MAX_ZONELISTS * MAX_NR_ZONES; i++) { /* Identify the zone and nodelist */ From e4a9bc58969abc695a6ebb06d801a99c1bafc001 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 6 Apr 2020 20:08:39 -0700 Subject: [PATCH 088/158] mm: use fallthrough; Convert the various /* fallthrough */ comments to the pseudo-keyword fallthrough; Done via script: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/ Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Reviewed-by: Gustavo A. R. Silva Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com Signed-off-by: Linus Torvalds --- mm/gup.c | 2 +- mm/hugetlb_cgroup.c | 6 +++--- mm/ksm.c | 3 +-- mm/list_lru.c | 2 +- mm/memcontrol.c | 2 +- mm/mempolicy.c | 3 --- mm/mmap.c | 5 ++--- mm/shmem.c | 2 +- mm/zsmalloc.c | 2 +- 9 files changed, 11 insertions(+), 16 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index b185377c38b7..a21230569520 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1102,7 +1102,7 @@ retry: goto retry; case -EBUSY: ret = 0; - /* FALLTHRU */ + fallthrough; case -EFAULT: case -ENOMEM: case -EHWPOISON: diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index c2d7ae6cabd1..aabf65d4d91b 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -467,14 +467,14 @@ static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v) switch (MEMFILE_ATTR(cft->private)) { case RES_RSVD_USAGE: counter = &h_cg->rsvd_hugepage[idx]; - /* Fall through. */ + fallthrough; case RES_USAGE: val = (u64)page_counter_read(counter); seq_printf(seq, "%llu\n", val * PAGE_SIZE); break; case RES_RSVD_LIMIT: counter = &h_cg->rsvd_hugepage[idx]; - /* Fall through. */ + fallthrough; case RES_LIMIT: val = (u64)counter->max; if (val == limit) @@ -514,7 +514,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of, switch (MEMFILE_ATTR(of_cft(of)->private)) { case RES_RSVD_LIMIT: rsvd = true; - /* Fall through. */ + fallthrough; case RES_LIMIT: mutex_lock(&hugetlb_limit_mutex); ret = page_counter_set_max( diff --git a/mm/ksm.c b/mm/ksm.c index 363ec8189561..a558da9e7177 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2813,8 +2813,7 @@ static int ksm_memory_callback(struct notifier_block *self, */ ksm_check_stable_tree(mn->start_pfn, mn->start_pfn + mn->nr_pages); - /* fallthrough */ - + fallthrough; case MEM_CANCEL_OFFLINE: mutex_lock(&ksm_thread_mutex); ksm_run &= ~KSM_RUN_OFFLINE; diff --git a/mm/list_lru.c b/mm/list_lru.c index 8de5e3784ee4..4d5294c39bba 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -223,7 +223,7 @@ restart: switch (ret) { case LRU_REMOVED_RETRY: assert_spin_locked(&nlru->lock); - /* fall through */ + fallthrough; case LRU_REMOVED: isolated++; nlru->nr_items--; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9965a77d53e5..05b4ec2c6499 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5813,7 +5813,7 @@ retry: switch (get_mctgt_type(vma, addr, ptent, &target)) { case MC_TARGET_DEVICE: device = true; - /* fall through */ + fallthrough; case MC_TARGET_PAGE: page = target.page; /* diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c4118bb508ec..b4a039d779b0 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -881,7 +881,6 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) switch (p->mode) { case MPOL_BIND: - /* Fall through */ case MPOL_INTERLEAVE: *nodes = p->v.nodes; break; @@ -2066,7 +2065,6 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) break; case MPOL_BIND: - /* Fall through */ case MPOL_INTERLEAVE: *mask = mempolicy->v.nodes; break; @@ -2333,7 +2331,6 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) switch (a->mode) { case MPOL_BIND: - /* Fall through */ case MPOL_INTERLEAVE: return !!nodes_equal(a->v.nodes, b->v.nodes); case MPOL_PREFERRED: diff --git a/mm/mmap.c b/mm/mmap.c index aa09429d5888..8d77dbbb80fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1460,7 +1460,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, * with MAP_SHARED to preserve backward compatibility. */ flags &= LEGACY_MAP_MASK; - /* fall through */ + fallthrough; case MAP_SHARED_VALIDATE: if (flags & ~flags_mask) return -EOPNOTSUPP; @@ -1487,8 +1487,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); - - /* fall through */ + fallthrough; case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; diff --git a/mm/shmem.c b/mm/shmem.c index 8d0aba1cc155..d722eb830317 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3996,7 +3996,7 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) if (i_size >= HPAGE_PMD_SIZE && i_size >> PAGE_SHIFT >= off) return true; - /* fall through */ + fallthrough; case SHMEM_HUGE_ADVISE: /* TODO: implement fadvise() hints */ return (vma->vm_flags & VM_HUGEPAGE); diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 2aa2d524a343..2f836a2b993f 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -424,7 +424,7 @@ static void *zs_zpool_map(void *pool, unsigned long handle, case ZPOOL_MM_WO: zs_mm = ZS_MM_WO; break; - case ZPOOL_MM_RW: /* fall through */ + case ZPOOL_MM_RW: default: zs_mm = ZS_MM_RW; break; From 3f3673d7d324d872d9d8ddb73b3e5e47fbf12e0d Mon Sep 17 00:00:00 2001 From: Steven Price Date: Mon, 6 Apr 2020 20:08:43 -0700 Subject: [PATCH 089/158] include/linux/swapops.h: correct guards for non_swap_entry() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit If CONFIG_DEVICE_PRIVATE is defined, but neither CONFIG_MEMORY_FAILURE nor CONFIG_MIGRATION, then non_swap_entry() will return 0, meaning that the condition (non_swap_entry(entry) && is_device_private_entry(entry)) in zap_pte_range() will never be true even if the entry is a device private one. Equally any other code depending on non_swap_entry() will not function as expected. I originally spotted this just by looking at the code, I haven't actually observed any problems. Looking a bit more closely it appears that actually this situation (currently at least) cannot occur: DEVICE_PRIVATE depends on ZONE_DEVICE ZONE_DEVICE depends on MEMORY_HOTREMOVE MEMORY_HOTREMOVE depends on MIGRATION Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory") Signed-off-by: Steven Price Signed-off-by: Andrew Morton Cc: Jérôme Glisse Cc: Arnd Bergmann Cc: Dan Williams Cc: John Hubbard Link: http://lkml.kernel.org/r/20200305130550.22693-1-steven.price@arm.com Signed-off-by: Linus Torvalds --- include/linux/swapops.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 9a6f06de183b..d9b7c9132c2f 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -350,7 +350,8 @@ static inline void num_poisoned_pages_inc(void) } #endif -#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) +#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) || \ + defined(CONFIG_DEVICE_PRIVATE) static inline int non_swap_entry(swp_entry_t entry) { return swp_type(entry) >= MAX_SWAPFILES; From 1d90b6491014ead775146726b81a78ed993c3188 Mon Sep 17 00:00:00 2001 From: Ira Weiny Date: Mon, 6 Apr 2020 20:08:46 -0700 Subject: [PATCH 090/158] include/linux/memremap.h: remove stale comments Fixes: 80a72d0af05a ("memremap: remove the data field in struct dev_pagemap") Fixes: fdc029b19dfd ("memremap: remove the dev field in struct dev_pagemap") Signed-off-by: Ira Weiny Signed-off-by: Andrew Morton Reviewed-by: Christoph Hellwig Cc: Jason Gunthorpe Cc: Dan Williams Link: http://lkml.kernel.org/r/20200316213205.145333-1-ira.weiny@intel.com Signed-off-by: Linus Torvalds --- include/linux/memremap.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 60d97e8fd3c0..8b37c4c9222c 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -98,8 +98,6 @@ struct dev_pagemap_ops { * @ref: reference count that pins the devm_memremap_pages() mapping * @internal_ref: internal reference if @ref is not provided by the caller * @done: completion for @internal_ref - * @dev: host device of the mapping for debug - * @data: private data pointer for page_free() * @type: memory type: see MEMORY_* in memory_hotplug.h * @flags: PGMAP_* flags to specify defailed behavior * @ops: method table From 1386f7a3bfa6245a69f1fa39786be54a73889ff7 Mon Sep 17 00:00:00 2001 From: Mateusz Nosek Date: Mon, 6 Apr 2020 20:08:49 -0700 Subject: [PATCH 091/158] mm/dmapool.c: micro-optimisation remove unnecessary branch Previously there was a check if 'size' is aligned to 'align' and if not then it was aligned. This check was expensive as both branch and division are expensive instructions in most architectures. 'ALIGN' function on already aligned value will not change it, and as it is cheaper than branch + division it can be executed all the time and branch can be removed. Signed-off-by: Mateusz Nosek Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: http://lkml.kernel.org/r/20200320173317.26408-1-mateusznosek0@gmail.com Signed-off-by: Linus Torvalds --- mm/dmapool.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index fe5d33060415..f9fb9bbd733e 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -144,9 +144,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, else if (size < 4) size = 4; - if ((size % align) != 0) - size = ALIGN(size, align); - + size = ALIGN(size, align); allocation = max_t(size_t, size, PAGE_SIZE); if (!boundary) From 6218d740ac1bc723d57900b865d3e52b83550c2b Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 6 Apr 2020 20:08:52 -0700 Subject: [PATCH 092/158] mm: remove dummy struct bootmem_data/bootmem_data_t Both bootmem_data and bootmem_data_t structures are no longer defined. Remove the dummy forward declarations. Signed-off-by: Waiman Long Signed-off-by: Andrew Morton Reviewed-by: Baoquan He Acked-by: Mike Rapoport Link: http://lkml.kernel.org/r/20200326022617.26208-1-longman@redhat.com Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/mmzone.h | 2 -- include/linux/mmzone.h | 1 - 2 files changed, 3 deletions(-) diff --git a/arch/alpha/include/asm/mmzone.h b/arch/alpha/include/asm/mmzone.h index 7ee144f484f1..9b521c857436 100644 --- a/arch/alpha/include/asm/mmzone.h +++ b/arch/alpha/include/asm/mmzone.h @@ -8,8 +8,6 @@ #include -struct bootmem_data_t; /* stupid forward decl. */ - /* * Following are macros that are specific to this numa platform. */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index f3f264826423..e9892bf9eba9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -664,7 +664,6 @@ struct deferred_split { * Memory statistics and page replacement data structures are maintained on a * per-zone basis. */ -struct bootmem_data; typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; struct zonelist node_zonelists[MAX_ZONELISTS]; From 904f394e2e9fd362dc78d58646b85c4cf6c87cac Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 6 Apr 2020 20:08:55 -0700 Subject: [PATCH 093/158] fs/proc/inode.c: annotate close_pdeo() for sparse Fix sparse locking imbalance warning: warning: context imbalance in close_pdeo() - unexpected unlock Signed-off-by: Jules Irenge Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200227201538.GA30462@avx2 Signed-off-by: Linus Torvalds --- fs/proc/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 1e730ea1dcd6..05d31c464bee 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -202,6 +202,7 @@ static void unuse_pde(struct proc_dir_entry *pde) /* pde is locked on entry, unlocked on exit */ static void close_pdeo(struct proc_dir_entry *pde, struct pde_opener *pdeo) + __releases(&pde->pde_unload_lock) { /* * close() (proc_reg_release()) can't delete an entry and proceed: From d919b33dafb3e222d23671b2bb06d119aede625f Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Mon, 6 Apr 2020 20:09:01 -0700 Subject: [PATCH 094/158] proc: faster open/read/close with "permanent" files Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include #include #include #include #include #include #include #include const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: kbuild test robot Reported-by: Dan Carpenter Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Cc: Al Viro Cc: Joe Perches Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by: Linus Torvalds --- fs/proc/cpuinfo.c | 1 + fs/proc/generic.c | 31 ++++++- fs/proc/inode.c | 187 +++++++++++++++++++++++++++++----------- fs/proc/internal.h | 6 ++ fs/proc/kmsg.c | 1 + fs/proc/stat.c | 1 + include/linux/proc_fs.h | 17 +++- ipc/util.c | 1 + kernel/module.c | 1 + mm/slab_common.c | 1 + mm/swapfile.c | 1 + 11 files changed, 194 insertions(+), 54 deletions(-) diff --git a/fs/proc/cpuinfo.c b/fs/proc/cpuinfo.c index c1dea9b8222e..d0989a443c77 100644 --- a/fs/proc/cpuinfo.c +++ b/fs/proc/cpuinfo.c @@ -17,6 +17,7 @@ static int cpuinfo_open(struct inode *inode, struct file *file) } static const struct proc_ops cpuinfo_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = cpuinfo_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 3faed94e4b65..4ed6dabdf6ff 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -531,6 +531,12 @@ struct proc_dir_entry *proc_create_reg(const char *name, umode_t mode, return p; } +static inline void pde_set_flags(struct proc_dir_entry *pde) +{ + if (pde->proc_ops->proc_flags & PROC_ENTRY_PERMANENT) + pde->flags |= PROC_ENTRY_PERMANENT; +} + struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct proc_ops *proc_ops, void *data) @@ -541,6 +547,7 @@ struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, if (!p) return NULL; p->proc_ops = proc_ops; + pde_set_flags(p); return proc_register(parent, p); } EXPORT_SYMBOL(proc_create_data); @@ -572,6 +579,7 @@ static int proc_seq_release(struct inode *inode, struct file *file) } static const struct proc_ops proc_seq_ops = { + /* not permanent -- can call into arbitrary seq_operations */ .proc_open = proc_seq_open, .proc_read = seq_read, .proc_lseek = seq_lseek, @@ -602,6 +610,7 @@ static int proc_single_open(struct inode *inode, struct file *file) } static const struct proc_ops proc_single_ops = { + /* not permanent -- can call into arbitrary ->single_show */ .proc_open = proc_single_open, .proc_read = seq_read, .proc_lseek = seq_lseek, @@ -662,9 +671,13 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent) de = pde_subdir_find(parent, fn, len); if (de) { - rb_erase(&de->subdir_node, &parent->subdir); - if (S_ISDIR(de->mode)) { - parent->nlink--; + if (unlikely(pde_is_permanent(de))) { + WARN(1, "removing permanent /proc entry '%s'", de->name); + de = NULL; + } else { + rb_erase(&de->subdir_node, &parent->subdir); + if (S_ISDIR(de->mode)) + parent->nlink--; } } write_unlock(&proc_subdir_lock); @@ -700,12 +713,24 @@ int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) write_unlock(&proc_subdir_lock); return -ENOENT; } + if (unlikely(pde_is_permanent(root))) { + write_unlock(&proc_subdir_lock); + WARN(1, "removing permanent /proc entry '%s/%s'", + root->parent->name, root->name); + return -EINVAL; + } rb_erase(&root->subdir_node, &parent->subdir); de = root; while (1) { next = pde_subdir_first(de); if (next) { + if (unlikely(pde_is_permanent(root))) { + write_unlock(&proc_subdir_lock); + WARN(1, "removing permanent /proc entry '%s/%s'", + next->parent->name, next->name); + return -EINVAL; + } rb_erase(&next->subdir_node, &de->subdir); de = next; continue; diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 05d31c464bee..fb4cace9ea41 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -259,114 +259,192 @@ void proc_entry_rundown(struct proc_dir_entry *de) spin_unlock(&de->pde_unload_lock); } +static loff_t pde_lseek(struct proc_dir_entry *pde, struct file *file, loff_t offset, int whence) +{ + typeof_member(struct proc_ops, proc_lseek) lseek; + + lseek = pde->proc_ops->proc_lseek; + if (!lseek) + lseek = default_llseek; + return lseek(file, offset, whence); +} + static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence) { struct proc_dir_entry *pde = PDE(file_inode(file)); loff_t rv = -EINVAL; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_lseek) lseek; - lseek = pde->proc_ops->proc_lseek; - if (!lseek) - lseek = default_llseek; - rv = lseek(file, offset, whence); + if (pde_is_permanent(pde)) { + return pde_lseek(pde, file, offset, whence); + } else if (use_pde(pde)) { + rv = pde_lseek(pde, file, offset, whence); unuse_pde(pde); } return rv; } +static ssize_t pde_read(struct proc_dir_entry *pde, struct file *file, char __user *buf, size_t count, loff_t *ppos) +{ + typeof_member(struct proc_ops, proc_read) read; + + read = pde->proc_ops->proc_read; + if (read) + return read(file, buf, count, ppos); + return -EIO; +} + static ssize_t proc_reg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct proc_dir_entry *pde = PDE(file_inode(file)); ssize_t rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_read) read; - read = pde->proc_ops->proc_read; - if (read) - rv = read(file, buf, count, ppos); + if (pde_is_permanent(pde)) { + return pde_read(pde, file, buf, count, ppos); + } else if (use_pde(pde)) { + rv = pde_read(pde, file, buf, count, ppos); unuse_pde(pde); } return rv; } +static ssize_t pde_write(struct proc_dir_entry *pde, struct file *file, const char __user *buf, size_t count, loff_t *ppos) +{ + typeof_member(struct proc_ops, proc_write) write; + + write = pde->proc_ops->proc_write; + if (write) + return write(file, buf, count, ppos); + return -EIO; +} + static ssize_t proc_reg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { struct proc_dir_entry *pde = PDE(file_inode(file)); ssize_t rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_write) write; - write = pde->proc_ops->proc_write; - if (write) - rv = write(file, buf, count, ppos); + if (pde_is_permanent(pde)) { + return pde_write(pde, file, buf, count, ppos); + } else if (use_pde(pde)) { + rv = pde_write(pde, file, buf, count, ppos); unuse_pde(pde); } return rv; } +static __poll_t pde_poll(struct proc_dir_entry *pde, struct file *file, struct poll_table_struct *pts) +{ + typeof_member(struct proc_ops, proc_poll) poll; + + poll = pde->proc_ops->proc_poll; + if (poll) + return poll(file, pts); + return DEFAULT_POLLMASK; +} + static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct *pts) { struct proc_dir_entry *pde = PDE(file_inode(file)); __poll_t rv = DEFAULT_POLLMASK; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_poll) poll; - poll = pde->proc_ops->proc_poll; - if (poll) - rv = poll(file, pts); + if (pde_is_permanent(pde)) { + return pde_poll(pde, file, pts); + } else if (use_pde(pde)) { + rv = pde_poll(pde, file, pts); unuse_pde(pde); } return rv; } +static long pde_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg) +{ + typeof_member(struct proc_ops, proc_ioctl) ioctl; + + ioctl = pde->proc_ops->proc_ioctl; + if (ioctl) + return ioctl(file, cmd, arg); + return -ENOTTY; +} + static long proc_reg_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct proc_dir_entry *pde = PDE(file_inode(file)); long rv = -ENOTTY; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_ioctl) ioctl; - ioctl = pde->proc_ops->proc_ioctl; - if (ioctl) - rv = ioctl(file, cmd, arg); + if (pde_is_permanent(pde)) { + return pde_ioctl(pde, file, cmd, arg); + } else if (use_pde(pde)) { + rv = pde_ioctl(pde, file, cmd, arg); unuse_pde(pde); } return rv; } #ifdef CONFIG_COMPAT +static long pde_compat_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg) +{ + typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl; + + compat_ioctl = pde->proc_ops->proc_compat_ioctl; + if (compat_ioctl) + return compat_ioctl(file, cmd, arg); + return -ENOTTY; +} + static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct proc_dir_entry *pde = PDE(file_inode(file)); long rv = -ENOTTY; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl; - - compat_ioctl = pde->proc_ops->proc_compat_ioctl; - if (compat_ioctl) - rv = compat_ioctl(file, cmd, arg); + if (pde_is_permanent(pde)) { + return pde_compat_ioctl(pde, file, cmd, arg); + } else if (use_pde(pde)) { + rv = pde_compat_ioctl(pde, file, cmd, arg); unuse_pde(pde); } return rv; } #endif +static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma) +{ + typeof_member(struct proc_ops, proc_mmap) mmap; + + mmap = pde->proc_ops->proc_mmap; + if (mmap) + return mmap(file, vma); + return -EIO; +} + static int proc_reg_mmap(struct file *file, struct vm_area_struct *vma) { struct proc_dir_entry *pde = PDE(file_inode(file)); int rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_mmap) mmap; - mmap = pde->proc_ops->proc_mmap; - if (mmap) - rv = mmap(file, vma); + if (pde_is_permanent(pde)) { + return pde_mmap(pde, file, vma); + } else if (use_pde(pde)) { + rv = pde_mmap(pde, file, vma); unuse_pde(pde); } return rv; } +static unsigned long +pde_get_unmapped_area(struct proc_dir_entry *pde, struct file *file, unsigned long orig_addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + typeof_member(struct proc_ops, proc_get_unmapped_area) get_area; + + get_area = pde->proc_ops->proc_get_unmapped_area; +#ifdef CONFIG_MMU + if (!get_area) + get_area = current->mm->get_unmapped_area; +#endif + if (get_area) + return get_area(file, orig_addr, len, pgoff, flags); + return orig_addr; +} + static unsigned long proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr, unsigned long len, unsigned long pgoff, @@ -375,19 +453,10 @@ proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr, struct proc_dir_entry *pde = PDE(file_inode(file)); unsigned long rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_get_unmapped_area) get_area; - - get_area = pde->proc_ops->proc_get_unmapped_area; -#ifdef CONFIG_MMU - if (!get_area) - get_area = current->mm->get_unmapped_area; -#endif - - if (get_area) - rv = get_area(file, orig_addr, len, pgoff, flags); - else - rv = orig_addr; + if (pde_is_permanent(pde)) { + return pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags); + } else if (use_pde(pde)) { + rv = pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags); unuse_pde(pde); } return rv; @@ -401,6 +470,13 @@ static int proc_reg_open(struct inode *inode, struct file *file) typeof_member(struct proc_ops, proc_release) release; struct pde_opener *pdeo; + if (pde_is_permanent(pde)) { + open = pde->proc_ops->proc_open; + if (open) + rv = open(inode, file); + return rv; + } + /* * Ensure that * 1) PDE's ->release hook will be called no matter what @@ -450,6 +526,17 @@ static int proc_reg_release(struct inode *inode, struct file *file) { struct proc_dir_entry *pde = PDE(inode); struct pde_opener *pdeo; + + if (pde_is_permanent(pde)) { + typeof_member(struct proc_ops, proc_release) release; + + release = pde->proc_ops->proc_release; + if (release) { + return release(inode, file); + } + return 0; + } + spin_lock(&pde->pde_unload_lock); list_for_each_entry(pdeo, &pde->pde_openers, lh) { if (pdeo->file == file) { diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 9e294f0290e5..917cc85e3466 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -61,6 +61,7 @@ struct proc_dir_entry { struct rb_node subdir_node; char *name; umode_t mode; + u8 flags; u8 namelen; char inline_name[]; } __randomize_layout; @@ -73,6 +74,11 @@ struct proc_dir_entry { 0) #define SIZEOF_PDE_INLINE_NAME (SIZEOF_PDE - sizeof(struct proc_dir_entry)) +static inline bool pde_is_permanent(const struct proc_dir_entry *pde) +{ + return pde->flags & PROC_ENTRY_PERMANENT; +} + extern struct kmem_cache *proc_dir_entry_cache; void pde_free(struct proc_dir_entry *pde); diff --git a/fs/proc/kmsg.c b/fs/proc/kmsg.c index ec1b7d2fb773..b38ad552887f 100644 --- a/fs/proc/kmsg.c +++ b/fs/proc/kmsg.c @@ -50,6 +50,7 @@ static __poll_t kmsg_poll(struct file *file, poll_table *wait) static const struct proc_ops kmsg_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_read = kmsg_read, .proc_poll = kmsg_poll, .proc_open = kmsg_open, diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 0449edf460f5..46b3293015fe 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -224,6 +224,7 @@ static int stat_open(struct inode *inode, struct file *file) } static const struct proc_ops stat_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = stat_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 40a7982b7285..45c05fd9c99d 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -5,6 +5,7 @@ #ifndef _LINUX_PROC_FS_H #define _LINUX_PROC_FS_H +#include #include #include @@ -12,7 +13,21 @@ struct proc_dir_entry; struct seq_file; struct seq_operations; +enum { + /* + * All /proc entries using this ->proc_ops instance are never removed. + * + * If in doubt, ignore this flag. + */ +#ifdef MODULE + PROC_ENTRY_PERMANENT = 0U, +#else + PROC_ENTRY_PERMANENT = 1U << 0, +#endif +}; + struct proc_ops { + unsigned int proc_flags; int (*proc_open)(struct inode *, struct file *); ssize_t (*proc_read)(struct file *, char __user *, size_t, loff_t *); ssize_t (*proc_write)(struct file *, const char __user *, size_t, loff_t *); @@ -25,7 +40,7 @@ struct proc_ops { #endif int (*proc_mmap)(struct file *, struct vm_area_struct *); unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); -}; +} __randomize_layout; #ifdef CONFIG_PROC_FS diff --git a/ipc/util.c b/ipc/util.c index fe61df53775a..97638eb2d7cb 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -885,6 +885,7 @@ static int sysvipc_proc_release(struct inode *inode, struct file *file) } static const struct proc_ops sysvipc_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = sysvipc_proc_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/kernel/module.c b/kernel/module.c index 33569a01d6e1..3447f3b74870 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -4355,6 +4355,7 @@ static int modules_open(struct inode *inode, struct file *file) } static const struct proc_ops modules_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = modules_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/mm/slab_common.c b/mm/slab_common.c index 5282f881d2f5..93ec4a574d8d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1581,6 +1581,7 @@ static int slabinfo_open(struct inode *inode, struct file *file) } static const struct proc_ops slabinfo_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = slabinfo_open, .proc_read = seq_read, .proc_write = slabinfo_write, diff --git a/mm/swapfile.c b/mm/swapfile.c index 273a923c275c..5871a2aa86a5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2797,6 +2797,7 @@ static int swaps_open(struct inode *inode, struct file *file) } static const struct proc_ops swaps_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = swaps_open, .proc_read = seq_read, .proc_lseek = seq_lseek, From 5c5ab9714c2225d50119e397c537ad5e568f268b Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Mon, 6 Apr 2020 20:09:05 -0700 Subject: [PATCH 095/158] proc: speed up /proc/*/statm top(1) reads all /proc/*/statm files but kernel threads will always have zeros. Print those zeroes directly without going through seq_put_decimal_ull(). Speed up reading /proc/2/statm (which is kthreadd) is like 3%. My system has more kernel threads than normal processes after booting KDE. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200307154435.GA2788@avx2 Signed-off-by: Linus Torvalds --- fs/proc/array.c | 39 +++++++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index 5efaf3708ec6..8e16f14bb05a 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -635,28 +635,35 @@ int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns, int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0; struct mm_struct *mm = get_task_mm(task); if (mm) { + unsigned long size; + unsigned long resident = 0; + unsigned long shared = 0; + unsigned long text = 0; + unsigned long data = 0; + size = task_statm(mm, &shared, &text, &data, &resident); mmput(mm); - } - /* - * For quick read, open code by putting numbers directly - * expected format is - * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n", - * size, resident, shared, text, data); - */ - seq_put_decimal_ull(m, "", size); - seq_put_decimal_ull(m, " ", resident); - seq_put_decimal_ull(m, " ", shared); - seq_put_decimal_ull(m, " ", text); - seq_put_decimal_ull(m, " ", 0); - seq_put_decimal_ull(m, " ", data); - seq_put_decimal_ull(m, " ", 0); - seq_putc(m, '\n'); + /* + * For quick read, open code by putting numbers directly + * expected format is + * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n", + * size, resident, shared, text, data); + */ + seq_put_decimal_ull(m, "", size); + seq_put_decimal_ull(m, " ", resident); + seq_put_decimal_ull(m, " ", shared); + seq_put_decimal_ull(m, " ", text); + seq_put_decimal_ull(m, " ", 0); + seq_put_decimal_ull(m, " ", data); + seq_put_decimal_ull(m, " ", 0); + seq_putc(m, '\n'); + } else { + seq_write(m, "0 0 0 0 0 0 0\n", 14); + } return 0; } From d07ded611e46d276d41dafcbd6abc9dd7ab51a7f Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:09:08 -0700 Subject: [PATCH 096/158] proc: inline vma_stop into m_stop Instead of calling vma_stop() from m_start() and m_next(), do its work in m_stop(). Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200317193201.9924-1-adobriyan@gmail.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 34 +++++++++++++++------------------- 1 file changed, 15 insertions(+), 19 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3ba9ae83bff5..4b19a46d351f 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -123,15 +123,6 @@ static void release_task_mempolicy(struct proc_maps_private *priv) } #endif -static void vma_stop(struct proc_maps_private *priv) -{ - struct mm_struct *mm = priv->mm; - - release_task_mempolicy(priv); - up_read(&mm->mmap_sem); - mmput(mm); -} - static struct vm_area_struct * m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) { @@ -163,11 +154,16 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return ERR_PTR(-ESRCH); mm = priv->mm; - if (!mm || !mmget_not_zero(mm)) + if (!mm || !mmget_not_zero(mm)) { + put_task_struct(priv->task); + priv->task = NULL; return NULL; + } if (down_read_killable(&mm->mmap_sem)) { mmput(mm); + put_task_struct(priv->task); + priv->task = NULL; return ERR_PTR(-EINTR); } @@ -195,7 +191,6 @@ static void *m_start(struct seq_file *m, loff_t *ppos) if (pos == mm->map_count && priv->tail_vma) return priv->tail_vma; - vma_stop(priv); return NULL; } @@ -206,21 +201,22 @@ static void *m_next(struct seq_file *m, void *v, loff_t *pos) (*pos)++; next = m_next_vma(priv, v); - if (!next) - vma_stop(priv); return next; } static void m_stop(struct seq_file *m, void *v) { struct proc_maps_private *priv = m->private; + struct mm_struct *mm = priv->mm; - if (!IS_ERR_OR_NULL(v)) - vma_stop(priv); - if (priv->task) { - put_task_struct(priv->task); - priv->task = NULL; - } + if (!priv->task) + return; + + release_task_mempolicy(priv); + up_read(&mm->mmap_sem); + mmput(mm); + put_task_struct(priv->task); + priv->task = NULL; } static int proc_maps_open(struct inode *inode, struct file *file, From c2e88d22e8ea698c183c6c36485c134742b99838 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:09:11 -0700 Subject: [PATCH 097/158] proc: remove m_cache_vma Instead of setting m->version in the show method, set it in m_next(), where it should be. Also remove the fallback code for failing to find a vma, or version being zero. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200317193201.9924-2-adobriyan@gmail.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 38 ++++++-------------------------------- 1 file changed, 6 insertions(+), 32 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 4b19a46d351f..981085b32708 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -131,21 +131,14 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) return vma->vm_next ?: priv->tail_vma; } -static void m_cache_vma(struct seq_file *m, struct vm_area_struct *vma) -{ - if (m->count < m->size) /* vma is copied successfully */ - m->version = m_next_vma(m->private, vma) ? vma->vm_end : -1UL; -} - static void *m_start(struct seq_file *m, loff_t *ppos) { struct proc_maps_private *priv = m->private; unsigned long last_addr = m->version; struct mm_struct *mm; struct vm_area_struct *vma; - unsigned int pos = *ppos; - /* See m_cache_vma(). Zero at the start or after lseek. */ + /* See m_next(). Zero at the start or after lseek. */ if (last_addr == -1UL) return NULL; @@ -170,28 +163,11 @@ static void *m_start(struct seq_file *m, loff_t *ppos) hold_task_mempolicy(priv); priv->tail_vma = get_gate_vma(mm); - if (last_addr) { - vma = find_vma(mm, last_addr - 1); - if (vma && vma->vm_start <= last_addr) - vma = m_next_vma(priv, vma); - if (vma) - return vma; - } - - m->version = 0; - if (pos < mm->map_count) { - for (vma = mm->mmap; pos; pos--) { - m->version = vma->vm_start; - vma = vma->vm_next; - } + vma = find_vma(mm, last_addr); + if (vma) return vma; - } - /* we do not bother to update m->version in this case */ - if (pos == mm->map_count && priv->tail_vma) - return priv->tail_vma; - - return NULL; + return priv->tail_vma; } static void *m_next(struct seq_file *m, void *v, loff_t *pos) @@ -201,6 +177,8 @@ static void *m_next(struct seq_file *m, void *v, loff_t *pos) (*pos)++; next = m_next_vma(priv, v); + m->version = next ? next->vm_start : -1UL; + return next; } @@ -359,7 +337,6 @@ done: static int show_map(struct seq_file *m, void *v) { show_map_vma(m, v); - m_cache_vma(m, v); return 0; } @@ -843,8 +820,6 @@ static int show_smap(struct seq_file *m, void *v) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); show_smap_vma_flags(m, vma); - m_cache_vma(m, vma); - return 0; } @@ -1883,7 +1858,6 @@ static int show_numa_map(struct seq_file *m, void *v) seq_printf(m, " kernelpagesize_kB=%lu", vma_kernel_pagesize(vma) >> 10); out: seq_putc(m, '\n'); - m_cache_vma(m, vma); return 0; } From 4781f2c3abddc400cfdb6ed18c2bf655d402efb5 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:09:14 -0700 Subject: [PATCH 098/158] proc: use ppos instead of m->version The ppos is a private cursor, just like m->version. Use the canonical cursor, not a special one. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200317193201.9924-3-adobriyan@gmail.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 981085b32708..9419c6972fcd 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -134,7 +134,7 @@ m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) static void *m_start(struct seq_file *m, loff_t *ppos) { struct proc_maps_private *priv = m->private; - unsigned long last_addr = m->version; + unsigned long last_addr = *ppos; struct mm_struct *mm; struct vm_area_struct *vma; @@ -170,14 +170,13 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return priv->tail_vma; } -static void *m_next(struct seq_file *m, void *v, loff_t *pos) +static void *m_next(struct seq_file *m, void *v, loff_t *ppos) { struct proc_maps_private *priv = m->private; struct vm_area_struct *next; - (*pos)++; next = m_next_vma(priv, v); - m->version = next ? next->vm_start : -1UL; + *ppos = next ? next->vm_start : -1UL; return next; } From b829a0f0f2f2094c1e40637259c44b854e6ebe96 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:09:17 -0700 Subject: [PATCH 099/158] seq_file: remove m->version The process maps file was the only user of version (introduced back in 2005). Now that it uses ppos instead, we can remove it. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.com Signed-off-by: Linus Torvalds --- fs/seq_file.c | 28 ---------------------------- include/linux/seq_file.h | 1 - 2 files changed, 29 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 1600034a929b..79781ebd2145 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -67,13 +67,6 @@ int seq_open(struct file *file, const struct seq_operations *op) // to the lifetime of the file. p->file = file; - /* - * Wrappers around seq_open(e.g. swaps_open) need to be - * aware of this. If they set f_version themselves, they - * should call seq_open first and then set f_version. - */ - file->f_version = 0; - /* * seq_files support lseek() and pread(). They do not implement * write() at all, but we clear FMODE_PWRITE here for historical @@ -94,7 +87,6 @@ static int traverse(struct seq_file *m, loff_t offset) int error = 0; void *p; - m->version = 0; m->index = 0; m->count = m->from = 0; if (!offset) @@ -160,26 +152,12 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) mutex_lock(&m->lock); - /* - * seq_file->op->..m_start/m_stop/m_next may do special actions - * or optimisations based on the file->f_version, so we want to - * pass the file->f_version to those methods. - * - * seq_file->version is just copy of f_version, and seq_file - * methods can treat it simply as file version. - * It is copied in first and copied out after all operations. - * It is convenient to have it as part of structure to avoid the - * need of passing another argument to all the seq_file methods. - */ - m->version = file->f_version; - /* * if request is to read from zero offset, reset iterator to first * record as it might have been already advanced by previous requests */ if (*ppos == 0) { m->index = 0; - m->version = 0; m->count = 0; } @@ -190,7 +168,6 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) if (err) { /* With prejudice... */ m->read_pos = 0; - m->version = 0; m->index = 0; m->count = 0; goto Done; @@ -243,7 +220,6 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) m->buf = seq_buf_alloc(m->size <<= 1); if (!m->buf) goto Enomem; - m->version = 0; p = m->op->start(m, &m->index); } m->op->stop(m, p); @@ -287,7 +263,6 @@ Done: *ppos += copied; m->read_pos += copied; } - file->f_version = m->version; mutex_unlock(&m->lock); return copied; Enomem: @@ -313,7 +288,6 @@ loff_t seq_lseek(struct file *file, loff_t offset, int whence) loff_t retval = -EINVAL; mutex_lock(&m->lock); - m->version = file->f_version; switch (whence) { case SEEK_CUR: offset += file->f_pos; @@ -329,7 +303,6 @@ loff_t seq_lseek(struct file *file, loff_t offset, int whence) /* with extreme prejudice... */ file->f_pos = 0; m->read_pos = 0; - m->version = 0; m->index = 0; m->count = 0; } else { @@ -340,7 +313,6 @@ loff_t seq_lseek(struct file *file, loff_t offset, int whence) file->f_pos = offset; } } - file->f_version = m->version; mutex_unlock(&m->lock); return retval; } diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 770c2bf3aa43..1672cf6f7614 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -21,7 +21,6 @@ struct seq_file { size_t pad_until; loff_t index; loff_t read_pos; - u64 version; struct mutex lock; const struct seq_operations *op; int poll_event; From fad955009c2ba49e2eac98926ad5cce48b777e7e Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Mon, 6 Apr 2020 20:09:20 -0700 Subject: [PATCH 100/158] proc: inline m_next_vma into m_next It's clearer to just put this inline. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200317193201.9924-5-adobriyan@gmail.com Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 9419c6972fcd..8d382d4ec067 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -123,14 +123,6 @@ static void release_task_mempolicy(struct proc_maps_private *priv) } #endif -static struct vm_area_struct * -m_next_vma(struct proc_maps_private *priv, struct vm_area_struct *vma) -{ - if (vma == priv->tail_vma) - return NULL; - return vma->vm_next ?: priv->tail_vma; -} - static void *m_start(struct seq_file *m, loff_t *ppos) { struct proc_maps_private *priv = m->private; @@ -173,9 +165,15 @@ static void *m_start(struct seq_file *m, loff_t *ppos) static void *m_next(struct seq_file *m, void *v, loff_t *ppos) { struct proc_maps_private *priv = m->private; - struct vm_area_struct *next; + struct vm_area_struct *next, *vma = v; + + if (vma == priv->tail_vma) + next = NULL; + else if (vma->vm_next) + next = vma->vm_next; + else + next = priv->tail_vma; - next = m_next_vma(priv, v); *ppos = next ? next->vm_start : -1UL; return next; From 06e85c7e9a1c1356038936566fc23f7c0d363b96 Mon Sep 17 00:00:00 2001 From: Michal Simek Date: Mon, 6 Apr 2020 20:09:23 -0700 Subject: [PATCH 101/158] asm-generic: fix unistd_32.h generation format Generated files are also checked by sparse that's why add newline to remove sparse (C=1) warning. The issue was found on Microblaze and reported like this: ./arch/microblaze/include/generated/uapi/asm/unistd_32.h:438:45: warning: no newline at end of file Mips and PowerPC have it already but let's align with style used by m68k. Signed-off-by: Michal Simek Signed-off-by: Andrew Morton Reviewed-by: Stefan Asserhall Acked-by: Max Filippov (xtensa) Cc: Arnd Bergmann Cc: Max Filippov Cc: Benjamin Herrenschmidt Cc: Chris Zankel Cc: David S. Miller Cc: Fenghua Yu Cc: Helge Deller Cc: Ivan Kokshaysky Cc: James Bottomley Cc: Matt Turner Cc: Michael Ellerman Cc: Paul Burton Cc: Paul Mackerras Cc: Ralf Baechle Cc: Rich Felker Cc: Richard Henderson Cc: Tony Luck Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/4d32ab4e1fb2edb691d2e1687e8fb303c09fd023.1581504803.git.michal.simek@xilinx.com Signed-off-by: Linus Torvalds --- arch/alpha/kernel/syscalls/syscallhdr.sh | 2 +- arch/ia64/kernel/syscalls/syscallhdr.sh | 2 +- arch/microblaze/kernel/syscalls/syscallhdr.sh | 2 +- arch/mips/kernel/syscalls/syscallhdr.sh | 3 +-- arch/parisc/kernel/syscalls/syscallhdr.sh | 2 +- arch/powerpc/kernel/syscalls/syscallhdr.sh | 3 +-- arch/sh/kernel/syscalls/syscallhdr.sh | 2 +- arch/sparc/kernel/syscalls/syscallhdr.sh | 2 +- arch/xtensa/kernel/syscalls/syscallhdr.sh | 2 +- 9 files changed, 9 insertions(+), 11 deletions(-) diff --git a/arch/alpha/kernel/syscalls/syscallhdr.sh b/arch/alpha/kernel/syscalls/syscallhdr.sh index e5b99bd2e5e7..1780e861492a 100644 --- a/arch/alpha/kernel/syscalls/syscallhdr.sh +++ b/arch/alpha/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/ia64/kernel/syscalls/syscallhdr.sh b/arch/ia64/kernel/syscalls/syscallhdr.sh index 0c2d2c748565..f407b6e53283 100644 --- a/arch/ia64/kernel/syscalls/syscallhdr.sh +++ b/arch/ia64/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/microblaze/kernel/syscalls/syscallhdr.sh b/arch/microblaze/kernel/syscalls/syscallhdr.sh index 2e9062a926a3..a914854f8d9f 100644 --- a/arch/microblaze/kernel/syscalls/syscallhdr.sh +++ b/arch/microblaze/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/mips/kernel/syscalls/syscallhdr.sh b/arch/mips/kernel/syscalls/syscallhdr.sh index d2bcfa8f4d1a..2e241e713a7d 100644 --- a/arch/mips/kernel/syscalls/syscallhdr.sh +++ b/arch/mips/kernel/syscalls/syscallhdr.sh @@ -32,6 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" - printf "\n" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/parisc/kernel/syscalls/syscallhdr.sh b/arch/parisc/kernel/syscalls/syscallhdr.sh index 50242b747d7c..730db288fe54 100644 --- a/arch/parisc/kernel/syscalls/syscallhdr.sh +++ b/arch/parisc/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/powerpc/kernel/syscalls/syscallhdr.sh b/arch/powerpc/kernel/syscalls/syscallhdr.sh index c0a9a32937f1..02d6751f3be3 100644 --- a/arch/powerpc/kernel/syscalls/syscallhdr.sh +++ b/arch/powerpc/kernel/syscalls/syscallhdr.sh @@ -32,6 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" - printf "\n" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/sh/kernel/syscalls/syscallhdr.sh b/arch/sh/kernel/syscalls/syscallhdr.sh index 1de0334e577f..4c0519861e97 100644 --- a/arch/sh/kernel/syscalls/syscallhdr.sh +++ b/arch/sh/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/sparc/kernel/syscalls/syscallhdr.sh b/arch/sparc/kernel/syscalls/syscallhdr.sh index 626b5740a9f1..cf50a75cc0bb 100644 --- a/arch/sparc/kernel/syscalls/syscallhdr.sh +++ b/arch/sparc/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" diff --git a/arch/xtensa/kernel/syscalls/syscallhdr.sh b/arch/xtensa/kernel/syscalls/syscallhdr.sh index d37db641ca31..eebfb8a8ace6 100644 --- a/arch/xtensa/kernel/syscalls/syscallhdr.sh +++ b/arch/xtensa/kernel/syscalls/syscallhdr.sh @@ -32,5 +32,5 @@ grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( printf "#define __NR_syscalls\t%s\n" "${nxt}" printf "#endif\n" printf "\n" - printf "#endif /* %s */" "${fileguard}" + printf "#endif /* %s */\n" "${fileguard}" ) > "$out" From 63174f61dfaef58dc0e813eaf6602636794f8942 Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Mon, 6 Apr 2020 20:09:27 -0700 Subject: [PATCH 102/158] kernel/extable.c: use address-of operator on section symbols Clang warns: ../kernel/extable.c:37:52: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) { ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://github.com/ClangBuiltLinux/linux/issues/892 Link: http://lkml.kernel.org/r/20200219202036.45702-1-natechancellor@gmail.com Signed-off-by: Linus Torvalds --- kernel/extable.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/extable.c b/kernel/extable.c index 7681f87e89dd..b0ea5eb0c3b4 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -34,7 +34,8 @@ u32 __initdata __visible main_extable_sort_needed = 1; /* Sort the kernel's built-in exception table */ void __init sort_main_extable(void) { - if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) { + if (main_extable_sort_needed && + &__stop___ex_table > &__start___ex_table) { pr_notice("Sorting __ex_table...\n"); sort_extable(__start___ex_table, __stop___ex_table); } From 12a5b00a5366467e95a25f55da751794e5fe6788 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 6 Apr 2020 20:09:30 -0700 Subject: [PATCH 103/158] sparc,x86: vdso: remove meaningless undefining CONFIG_OPTIMIZE_INLINING The code, #undef CONFIG_OPTIMIZE_INLINING, is not working as expected because is parsed before vclock_gettime.c since 28128c61e08e ("kconfig.h: Include compiler types to avoid missed struct attributes"). Since then, is included really early by using the '-include' option. So, you cannot negate the decision of in this way. You can confirm it by checking the pre-processed code, like this: $ make arch/x86/entry/vdso/vdso32/vclock_gettime.i There is no difference with/without CONFIG_CC_OPTIMIZE_FOR_SIZE. It is about two years since 28128c61e08e. Nobody has reported a problem (or, nobody has even noticed the fact that this code is not working). It is ugly and unreliable to attempt to undefine a CONFIG option from C files, and anyway the inlining heuristic is up to the compiler. Just remove the broken code. Signed-off-by: Masahiro Yamada Signed-off-by: Andrew Morton Reviewed-by: Nathan Chancellor Acked-by: Miguel Ojeda Cc: Arnd Bergmann Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Masahiro Yamada Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: David Miller Link: http://lkml.kernel.org/r/20200220110807.32534-1-masahiroy@kernel.org Signed-off-by: Linus Torvalds --- arch/sparc/vdso/vdso32/vclock_gettime.c | 4 ---- arch/x86/entry/vdso/vdso32/vclock_gettime.c | 4 ---- 2 files changed, 8 deletions(-) diff --git a/arch/sparc/vdso/vdso32/vclock_gettime.c b/arch/sparc/vdso/vdso32/vclock_gettime.c index 026abb3b826c..d7f99e6745ea 100644 --- a/arch/sparc/vdso/vdso32/vclock_gettime.c +++ b/arch/sparc/vdso/vdso32/vclock_gettime.c @@ -4,10 +4,6 @@ #define BUILD_VDSO32 -#ifndef CONFIG_CC_OPTIMIZE_FOR_SIZE -#undef CONFIG_OPTIMIZE_INLINING -#endif - #ifdef CONFIG_SPARC64 /* diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c index 1e82bd43286c..84a4a73f77f7 100644 --- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c +++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c @@ -1,10 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 #define BUILD_VDSO32 -#ifndef CONFIG_CC_OPTIMIZE_FOR_SIZE -#undef CONFIG_OPTIMIZE_INLINING -#endif - #ifdef CONFIG_X86_64 /* From 889b3c1245de48ed0cacf7aebb25c489d3e4a3e9 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 6 Apr 2020 20:09:33 -0700 Subject: [PATCH 104/158] compiler: remove CONFIG_OPTIMIZE_INLINING entirely Commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") made this always-on option. We released v5.4 and v5.5 including that commit. Remove the CONFIG option and clean up the code now. Signed-off-by: Masahiro Yamada Signed-off-by: Andrew Morton Reviewed-by: Miguel Ojeda Reviewed-by: Nathan Chancellor Cc: Arnd Bergmann Cc: Borislav Petkov Cc: David Miller Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/configs/i386_defconfig | 1 - arch/x86/configs/x86_64_defconfig | 1 - include/linux/compiler_types.h | 11 +---------- kernel/configs/tiny.config | 1 - lib/Kconfig.debug | 12 ------------ 5 files changed, 1 insertion(+), 25 deletions(-) diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index ab8b30cb978e..550904591e94 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -285,7 +285,6 @@ CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_BOOT_PARAMS=y -CONFIG_OPTIMIZE_INLINING=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_SELINUX=y diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index 2d196cb49084..614961009075 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -282,7 +282,6 @@ CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_BOOT_PARAMS=y -CONFIG_OPTIMIZE_INLINING=y CONFIG_UNWINDER_ORC=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 72393a8c1a6c..e970f97a7fcb 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -129,22 +129,13 @@ struct ftrace_likely_data { #define __compiler_offsetof(a, b) __builtin_offsetof(a, b) /* - * Force always-inline if the user requests it so via the .config. * Prefer gnu_inline, so that extern inline functions do not emit an * externally visible function. This makes extern inline behave as per gnu89 * semantics rather than c99. This prevents multiple symbol definition errors * of extern inline functions at link time. * A lot of inline functions can cause havoc with function tracing. - * Do not use __always_inline here, since currently it expands to inline again - * (which would break users of __always_inline). */ -#if !defined(CONFIG_OPTIMIZE_INLINING) -#define inline inline __attribute__((__always_inline__)) __gnu_inline \ - __inline_maybe_unused notrace -#else -#define inline inline __gnu_inline \ - __inline_maybe_unused notrace -#endif +#define inline inline __gnu_inline __inline_maybe_unused notrace /* * gcc provides both __inline__ and __inline as alternate spellings of diff --git a/kernel/configs/tiny.config b/kernel/configs/tiny.config index 7fa0c4ae6394..8a44b93da0f3 100644 --- a/kernel/configs/tiny.config +++ b/kernel/configs/tiny.config @@ -6,7 +6,6 @@ CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set -CONFIG_OPTIMIZE_INLINING=y # CONFIG_SLAB is not set # CONFIG_SLUB is not set CONFIG_SLOB=y diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d1398cef3b18..7f9a89847b65 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -305,18 +305,6 @@ config HEADERS_INSTALL user-space program samples. It is also needed by some features such as uapi header sanity checks. -config OPTIMIZE_INLINING - def_bool y - help - This option determines if the kernel forces gcc to inline the functions - developers have marked 'inline'. Doing so takes away freedom from gcc to - do what it thinks is best, which is desirable for the gcc 3.x series of - compilers. The gcc 4.x series have a rewritten inlining algorithm and - enabling this option will generate a smaller kernel there. Hopefully - this algorithm is so good that allowing gcc 4.x and above to make the - decision will become the default in the future. Until then this option - is there to test gcc for this. - config DEBUG_SECTION_MISMATCH bool "Enable full Section mismatch analysis" help From af9c5d2e3b355854ff0e4acfbfbfadcd5198a349 Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Mon, 6 Apr 2020 20:09:37 -0700 Subject: [PATCH 105/158] compiler.h: fix error in BUILD_BUG_ON() reporting compiletime_assert() uses __LINE__ to create a unique function name. This means that if you have more than one BUILD_BUG_ON() in the same source line (which can happen if they appear e.g. in a macro), then the error message from the compiler might output the wrong condition. For this source file: #include #define macro() \ BUILD_BUG_ON(1); \ BUILD_BUG_ON(0); void foo() { macro(); } gcc would output: ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_9' declared with attribute error: BUILD_BUG_ON failed: 0 _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) However, it was not the BUILD_BUG_ON(0) that failed, so it should say 1 instead of 0. With this patch, we use __COUNTER__ instead of __LINE__, so each BUILD_BUG_ON() gets a different function name and the correct condition is printed: ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_0' declared with attribute error: BUILD_BUG_ON failed: 1 _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) Signed-off-by: Vegard Nossum Signed-off-by: Andrew Morton Reviewed-by: Masahiro Yamada Reviewed-by: Daniel Santos Cc: Rasmus Villemoes Cc: Ian Abbott Cc: Joe Perches Link: http://lkml.kernel.org/r/20200331112637.25047-1-vegard.nossum@oracle.com Signed-off-by: Linus Torvalds --- include/linux/compiler.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 5e88e7e33abe..034b0a644efc 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -347,7 +347,7 @@ static inline void *offset_to_ptr(const int *off) * compiler has support to do so. */ #define compiletime_assert(condition, msg) \ - _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) + _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) #define compiletime_assert_atomic_type(t) \ compiletime_assert(__native_word(t), \ From 6680125ea5a2a5a69eeb8f7dcd89d0a1d23998ce Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 6 Apr 2020 20:09:40 -0700 Subject: [PATCH 106/158] MAINTAINERS: list the section entries in the preferred order The MAINTAINERS file header has never shown a preferred order for the section entries but scripts/parse-maintainers.pl added a preferred order with commit 61f741645a35 ("parse-maintainers: Add section pattern sorting") Commit 5cdbec108fd2 ("parse-maintainers: Do not sort section content by default") changed the preferred order to be a bit more sensible. Update the MAINTAINERS section description block to use this preferred section entry ordering. Add a slightly better description for the N: entry too. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Reviewed-by: Andy Shevchenko Link: http://lkml.kernel.org/r/5aa5aad6fb1678230c260337dc066cd449a2bf32.camel@perches.com Signed-off-by: Linus Torvalds --- MAINTAINERS | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index 534a8dc4f84a..9271068bde63 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -77,21 +77,13 @@ Tips for patch submitters 8. Happy hacking. -Descriptions of section entries -------------------------------- +Descriptions of section entries and preferred order +--------------------------------------------------- M: *Mail* patches to: FullName R: Designated *Reviewer*: FullName These reviewers should be CCed on patches. L: *Mailing list* that is relevant to this area - W: *Web-page* with status/info - B: URI for where to file *bugs*. A web-page with detailed bug - filing info, a direct bug tracker link, or a mailto: URI. - C: URI for *chat* protocol, server and channel where developers - usually hang out, for example irc://server/channel. - Q: *Patchwork* web based patch tracking system site - T: *SCM* tree type and location. - Type is one of: git, hg, quilt, stgit, topgit S: *Status*, one of the following: Supported: Someone is actually paid to look after this. Maintained: Someone actually looks after it. @@ -102,30 +94,39 @@ Descriptions of section entries Obsolete: Old code. Something tagged obsolete generally means it has been replaced by a better system and you should be using that. + W: *Web-page* with status/info + Q: *Patchwork* web based patch tracking system site + B: URI for where to file *bugs*. A web-page with detailed bug + filing info, a direct bug tracker link, or a mailto: URI. + C: URI for *chat* protocol, server and channel where developers + usually hang out, for example irc://server/channel. P: Subsystem Profile document for more details submitting patches to the given subsystem. This is either an in-tree file, or a URI. See Documentation/maintainer/maintainer-entry-profile.rst for details. + T: *SCM* tree type and location. + Type is one of: git, hg, quilt, stgit, topgit F: *Files* and directories wildcard patterns. A trailing slash includes all files and subdirectory files. F: drivers/net/ all files in and below drivers/net F: drivers/net/* all files in drivers/net, but not below F: */net/* all files in "any top level directory"/net One pattern per line. Multiple F: lines acceptable. - N: Files and directories *Regex* patterns. - N: [^a-z]tegra all files whose path contains the word tegra - One pattern per line. Multiple N: lines acceptable. - scripts/get_maintainer.pl has different behavior for files that - match F: pattern and matches of N: patterns. By default, - get_maintainer will not look at git log history when an F: pattern - match occurs. When an N: match occurs, git log history is used - to also notify the people that have git commit signatures. X: *Excluded* files and directories that are NOT maintained, same rules as F:. Files exclusions are tested before file matches. Can be useful for excluding a specific subdirectory, for instance: F: net/ X: net/ipv6/ matches all files in and below net excluding net/ipv6/ + N: Files and directories *Regex* patterns. + N: [^a-z]tegra all files whose path contains tegra + (not including files like integrator) + One pattern per line. Multiple N: lines acceptable. + scripts/get_maintainer.pl has different behavior for files that + match F: pattern and matches of N: patterns. By default, + get_maintainer will not look at git log history when an F: pattern + match occurs. When an N: match occurs, git log history is used + to also notify the people that have git commit signatures. K: *Content regex* (perl extended) pattern match in a patch or file. For instance: K: of_get_profile From f80ac98a641a03097cbc9fdfd4b6a41a8dd3b7ae Mon Sep 17 00:00:00 2001 From: Josh Poimboeuf Date: Mon, 6 Apr 2020 20:09:43 -0700 Subject: [PATCH 107/158] bitops: always inline sign extension helpers With CONFIG_CC_OPTIMIZE_FOR_SIZE, objtool reports: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.o: warning: objtool: i915_gem_execbuffer2_ioctl()+0x5b7: call to gen8_canonical_addr() with UACCESS enabled This means i915_gem_execbuffer2_ioctl() is calling gen8_canonical_addr() from the user_access_begin/end critical region (i.e, with SMAP disabled). While it's probably harmless in this case, in general we like to avoid extra function calls in SMAP-disabled regions because it can open up inadvertent security holes. Fix the warning by changing the sign extension helpers to __always_inline. This convinces GCC to inline gen8_canonical_addr(). The sign extension functions are trivial anyway, so it makes sense to always inline them. With my test optimize-for-size-based config, this actually shrinks the text size of i915_gem_execbuffer.o by 45 bytes -- and no change for vmlinux. Reported-by: Randy Dunlap Signed-off-by: Josh Poimboeuf Signed-off-by: Andrew Morton Cc: Peter Zijlstra Cc: Al Viro Cc: Chris Wilson Link: http://lkml.kernel.org/r/740179324b2b18b750b16295c48357f00b5fa9ed.1582982020.git.jpoimboe@redhat.com Signed-off-by: Linus Torvalds --- include/linux/bitops.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/bitops.h b/include/linux/bitops.h index 47f54b459c26..9acf654f0b19 100644 --- a/include/linux/bitops.h +++ b/include/linux/bitops.h @@ -162,7 +162,7 @@ static inline __u8 ror8(__u8 word, unsigned int shift) * * This is safe to use for 16- and 8-bit types as well. */ -static inline __s32 sign_extend32(__u32 value, int index) +static __always_inline __s32 sign_extend32(__u32 value, int index) { __u8 shift = 31 - index; return (__s32)(value << shift) >> shift; @@ -173,7 +173,7 @@ static inline __s32 sign_extend32(__u32 value, int index) * @value: value to sign extend * @index: 0 based bit index (0<=index<64) to sign bit */ -static inline __s64 sign_extend64(__u64 value, int index) +static __always_inline __s64 sign_extend64(__u64 value, int index) { __u8 shift = 63 - index; return (__s64)(value << shift) >> shift; From 30428ef5d1e8caf78639cc70a802f1cb7b1cec04 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Mon, 6 Apr 2020 20:09:47 -0700 Subject: [PATCH 108/158] lib/test_lockup: test module to generate lockups CONFIG_TEST_LOCKUP=m adds module "test_lockup" that helps to make sure that watchdogs and lockup detectors are working properly. Depending on module parameters test_lockup could emulate soft or hard lockup, "hung task", hold arbitrary lock, allocate bunch of pages. Also it could generate series of lockups with cooling-down periods, in this way it could be used as "ping" for locks or page allocator. Loop checks signals between iteration thus could be stopped by ^C. # modinfo test_lockup ... parm: time_secs:lockup time in seconds, default 0 (uint) parm: time_nsecs:nanoseconds part of lockup time, default 0 (uint) parm: cooldown_secs:cooldown time between iterations in seconds, default 0 (uint) parm: cooldown_nsecs:nanoseconds part of cooldown, default 0 (uint) parm: iterations:lockup iterations, default 1 (uint) parm: all_cpus:trigger lockup at all cpus at once (bool) parm: state:wait in 'R' running (default), 'D' uninterruptible, 'K' killable, 'S' interruptible state (charp) parm: use_hrtimer:use high-resolution timer for sleeping (bool) parm: iowait:account sleep time as iowait (bool) parm: lock_read:lock read-write locks for read (bool) parm: lock_single:acquire locks only at one cpu (bool) parm: reacquire_locks:release and reacquire locks/irq/preempt between iterations (bool) parm: touch_softlockup:touch soft-lockup watchdog between iterations (bool) parm: touch_hardlockup:touch hard-lockup watchdog between iterations (bool) parm: call_cond_resched:call cond_resched() between iterations (bool) parm: measure_lock_wait:measure lock wait time (bool) parm: lock_wait_threshold:print lock wait time longer than this in nanoseconds, default off (ulong) parm: disable_irq:disable interrupts: generate hard-lockups (bool) parm: disable_softirq:disable bottom-half irq handlers (bool) parm: disable_preempt:disable preemption: generate soft-lockups (bool) parm: lock_rcu:grab rcu_read_lock: generate rcu stalls (bool) parm: lock_mmap_sem:lock mm->mmap_sem: block procfs interfaces (bool) parm: lock_rwsem_ptr:lock rw_semaphore at address (ulong) parm: lock_mutex_ptr:lock mutex at address (ulong) parm: lock_spinlock_ptr:lock spinlock at address (ulong) parm: lock_rwlock_ptr:lock rwlock at address (ulong) parm: alloc_pages_nr:allocate and free pages under locks (uint) parm: alloc_pages_order:page order to allocate (uint) parm: alloc_pages_gfp:allocate pages with this gfp_mask, default GFP_KERNEL (uint) parm: alloc_pages_atomic:allocate pages with GFP_ATOMIC (bool) parm: reallocate_pages:free and allocate pages between iterations (bool) Parameters for locking by address are unsafe and taints kernel. With CONFIG_DEBUG_SPINLOCK=y they at least check magics for embedded spinlocks. Examples: task hang in D-state: modprobe test_lockup time_secs=1 iterations=60 state=D task hang in io-wait D-state: modprobe test_lockup time_secs=1 iterations=60 state=D iowait softlockup: modprobe test_lockup time_secs=1 iterations=60 state=R hardlockup: modprobe test_lockup time_secs=1 iterations=60 state=R disable_irq system-wide hardlockup: modprobe test_lockup time_secs=1 iterations=60 state=R \ disable_irq all_cpus rcu stall: modprobe test_lockup time_secs=1 iterations=60 state=R \ lock_rcu touch_softlockup lock mmap_sem / block procfs interfaces: modprobe test_lockup time_secs=1 iterations=60 state=S lock_mmap_sem lock tasklist_lock for read / block forks: TASKLIST_LOCK=$(awk '$3 == "tasklist_lock" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=R \ disable_irq lock_read lock_rwlock_ptr=$TASKLIST_LOCK lock namespace_sem / block vfs mount operations: NAMESPACE_SEM=$(awk '$3 == "namespace_sem" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=S \ lock_rwsem_ptr=$NAMESPACE_SEM lock cgroup mutex / block cgroup operations: CGROUP_MUTEX=$(awk '$3 == "cgroup_mutex" {print "0x"$1}' /proc/kallsyms) modprobe test_lockup time_secs=1 iterations=60 state=S \ lock_mutex_ptr=$CGROUP_MUTEX ping cgroup_mutex every second and measure maximum lock wait time: modprobe test_lockup cooldown_secs=1 iterations=60 state=S \ lock_mutex_ptr=$CGROUP_MUTEX reacquire_locks measure_lock_wait [linux@roeck-us.net: rename disable_irq to fix build error] Link: http://lkml.kernel.org/r/20200317133614.23152-1-linux@roeck-us.net Signed-off-by: Konstantin Khlebnikov Signed-off-by: Guenter Roeck Signed-off-by: Andrew Morton Cc: Sasha Levin Cc: Petr Mladek Cc: Kees Cook Cc: Peter Zijlstra Cc: Greg Kroah-Hartman Cc: Steven Rostedt Cc: Sergey Senozhatsky Cc: Dmitry Monakhov Cc: Guenter Roeck Link: http://lkml.kernel.org/r/158132859146.2797.525923171323227836.stgit@buzz Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 12 + lib/Makefile | 1 + lib/test_lockup.c | 554 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 567 insertions(+) create mode 100644 lib/test_lockup.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 7f9a89847b65..ddcf000022ae 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -976,6 +976,18 @@ config WQ_WATCHDOG state. This can be configured through kernel parameter "workqueue.watchdog_thresh" and its sysfs counterpart. +config TEST_LOCKUP + tristate "Test module to generate lockups" + help + This builds the "test_lockup" module that helps to make sure + that watchdogs and lockup detectors are working properly. + + Depending on module parameters it could emulate soft or hard + lockup, "hung task", or locking arbitrary lock for a long time. + Also it could generate series of lockups with cooling-down periods. + + If unsure, say N. + endmenu # "Debug lockups and hangs" menu "Scheduler Debugging" diff --git a/lib/Makefile b/lib/Makefile index 09a8acb0cf92..0fd125c4ad07 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -90,6 +90,7 @@ obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o +obj-$(CONFIG_TEST_LOCKUP) += test_lockup.o obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/ diff --git a/lib/test_lockup.c b/lib/test_lockup.c new file mode 100644 index 000000000000..9e8b8a0be9af --- /dev/null +++ b/lib/test_lockup.c @@ -0,0 +1,554 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Test module to generate lockups + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static unsigned int time_secs; +module_param(time_secs, uint, 0600); +MODULE_PARM_DESC(time_secs, "lockup time in seconds, default 0"); + +static unsigned int time_nsecs; +module_param(time_nsecs, uint, 0600); +MODULE_PARM_DESC(time_nsecs, "nanoseconds part of lockup time, default 0"); + +static unsigned int cooldown_secs; +module_param(cooldown_secs, uint, 0600); +MODULE_PARM_DESC(cooldown_secs, "cooldown time between iterations in seconds, default 0"); + +static unsigned int cooldown_nsecs; +module_param(cooldown_nsecs, uint, 0600); +MODULE_PARM_DESC(cooldown_nsecs, "nanoseconds part of cooldown, default 0"); + +static unsigned int iterations = 1; +module_param(iterations, uint, 0600); +MODULE_PARM_DESC(iterations, "lockup iterations, default 1"); + +static bool all_cpus; +module_param(all_cpus, bool, 0400); +MODULE_PARM_DESC(all_cpus, "trigger lockup at all cpus at once"); + +static int wait_state; +static char *state = "R"; +module_param(state, charp, 0400); +MODULE_PARM_DESC(state, "wait in 'R' running (default), 'D' uninterruptible, 'K' killable, 'S' interruptible state"); + +static bool use_hrtimer; +module_param(use_hrtimer, bool, 0400); +MODULE_PARM_DESC(use_hrtimer, "use high-resolution timer for sleeping"); + +static bool iowait; +module_param(iowait, bool, 0400); +MODULE_PARM_DESC(iowait, "account sleep time as iowait"); + +static bool lock_read; +module_param(lock_read, bool, 0400); +MODULE_PARM_DESC(lock_read, "lock read-write locks for read"); + +static bool lock_single; +module_param(lock_single, bool, 0400); +MODULE_PARM_DESC(lock_single, "acquire locks only at one cpu"); + +static bool reacquire_locks; +module_param(reacquire_locks, bool, 0400); +MODULE_PARM_DESC(reacquire_locks, "release and reacquire locks/irq/preempt between iterations"); + +static bool touch_softlockup; +module_param(touch_softlockup, bool, 0600); +MODULE_PARM_DESC(touch_softlockup, "touch soft-lockup watchdog between iterations"); + +static bool touch_hardlockup; +module_param(touch_hardlockup, bool, 0600); +MODULE_PARM_DESC(touch_hardlockup, "touch hard-lockup watchdog between iterations"); + +static bool call_cond_resched; +module_param(call_cond_resched, bool, 0600); +MODULE_PARM_DESC(call_cond_resched, "call cond_resched() between iterations"); + +static bool measure_lock_wait; +module_param(measure_lock_wait, bool, 0400); +MODULE_PARM_DESC(measure_lock_wait, "measure lock wait time"); + +static unsigned long lock_wait_threshold = ULONG_MAX; +module_param(lock_wait_threshold, ulong, 0400); +MODULE_PARM_DESC(lock_wait_threshold, "print lock wait time longer than this in nanoseconds, default off"); + +static bool test_disable_irq; +module_param_named(disable_irq, test_disable_irq, bool, 0400); +MODULE_PARM_DESC(disable_irq, "disable interrupts: generate hard-lockups"); + +static bool disable_softirq; +module_param(disable_softirq, bool, 0400); +MODULE_PARM_DESC(disable_softirq, "disable bottom-half irq handlers"); + +static bool disable_preempt; +module_param(disable_preempt, bool, 0400); +MODULE_PARM_DESC(disable_preempt, "disable preemption: generate soft-lockups"); + +static bool lock_rcu; +module_param(lock_rcu, bool, 0400); +MODULE_PARM_DESC(lock_rcu, "grab rcu_read_lock: generate rcu stalls"); + +static bool lock_mmap_sem; +module_param(lock_mmap_sem, bool, 0400); +MODULE_PARM_DESC(lock_mmap_sem, "lock mm->mmap_sem: block procfs interfaces"); + +static unsigned long lock_rwsem_ptr; +module_param_unsafe(lock_rwsem_ptr, ulong, 0400); +MODULE_PARM_DESC(lock_rwsem_ptr, "lock rw_semaphore at address"); + +static unsigned long lock_mutex_ptr; +module_param_unsafe(lock_mutex_ptr, ulong, 0400); +MODULE_PARM_DESC(lock_mutex_ptr, "lock mutex at address"); + +static unsigned long lock_spinlock_ptr; +module_param_unsafe(lock_spinlock_ptr, ulong, 0400); +MODULE_PARM_DESC(lock_spinlock_ptr, "lock spinlock at address"); + +static unsigned long lock_rwlock_ptr; +module_param_unsafe(lock_rwlock_ptr, ulong, 0400); +MODULE_PARM_DESC(lock_rwlock_ptr, "lock rwlock at address"); + +static unsigned int alloc_pages_nr; +module_param_unsafe(alloc_pages_nr, uint, 0600); +MODULE_PARM_DESC(alloc_pages_nr, "allocate and free pages under locks"); + +static unsigned int alloc_pages_order; +module_param(alloc_pages_order, uint, 0400); +MODULE_PARM_DESC(alloc_pages_order, "page order to allocate"); + +static gfp_t alloc_pages_gfp = GFP_KERNEL; +module_param_unsafe(alloc_pages_gfp, uint, 0400); +MODULE_PARM_DESC(alloc_pages_gfp, "allocate pages with this gfp_mask, default GFP_KERNEL"); + +static bool alloc_pages_atomic; +module_param(alloc_pages_atomic, bool, 0400); +MODULE_PARM_DESC(alloc_pages_atomic, "allocate pages with GFP_ATOMIC"); + +static bool reallocate_pages; +module_param(reallocate_pages, bool, 0400); +MODULE_PARM_DESC(reallocate_pages, "free and allocate pages between iterations"); + +static atomic_t alloc_pages_failed = ATOMIC_INIT(0); + +static atomic64_t max_lock_wait = ATOMIC64_INIT(0); + +static struct task_struct *main_task; +static int master_cpu; + +static void test_lock(bool master, bool verbose) +{ + u64 uninitialized_var(wait_start); + + if (measure_lock_wait) + wait_start = local_clock(); + + if (lock_mutex_ptr && master) { + if (verbose) + pr_notice("lock mutex %ps\n", (void *)lock_mutex_ptr); + mutex_lock((struct mutex *)lock_mutex_ptr); + } + + if (lock_rwsem_ptr && master) { + if (verbose) + pr_notice("lock rw_semaphore %ps\n", + (void *)lock_rwsem_ptr); + if (lock_read) + down_read((struct rw_semaphore *)lock_rwsem_ptr); + else + down_write((struct rw_semaphore *)lock_rwsem_ptr); + } + + if (lock_mmap_sem && master) { + if (verbose) + pr_notice("lock mmap_sem pid=%d\n", main_task->pid); + if (lock_read) + down_read(&main_task->mm->mmap_sem); + else + down_write(&main_task->mm->mmap_sem); + } + + if (test_disable_irq) + local_irq_disable(); + + if (disable_softirq) + local_bh_disable(); + + if (disable_preempt) + preempt_disable(); + + if (lock_rcu) + rcu_read_lock(); + + if (lock_spinlock_ptr && master) { + if (verbose) + pr_notice("lock spinlock %ps\n", + (void *)lock_spinlock_ptr); + spin_lock((spinlock_t *)lock_spinlock_ptr); + } + + if (lock_rwlock_ptr && master) { + if (verbose) + pr_notice("lock rwlock %ps\n", + (void *)lock_rwlock_ptr); + if (lock_read) + read_lock((rwlock_t *)lock_rwlock_ptr); + else + write_lock((rwlock_t *)lock_rwlock_ptr); + } + + if (measure_lock_wait) { + s64 cur_wait = local_clock() - wait_start; + s64 max_wait = atomic64_read(&max_lock_wait); + + do { + if (cur_wait < max_wait) + break; + max_wait = atomic64_cmpxchg(&max_lock_wait, + max_wait, cur_wait); + } while (max_wait != cur_wait); + + if (cur_wait > lock_wait_threshold) + pr_notice_ratelimited("lock wait %lld ns\n", cur_wait); + } +} + +static void test_unlock(bool master, bool verbose) +{ + if (lock_rwlock_ptr && master) { + if (lock_read) + read_unlock((rwlock_t *)lock_rwlock_ptr); + else + write_unlock((rwlock_t *)lock_rwlock_ptr); + if (verbose) + pr_notice("unlock rwlock %ps\n", + (void *)lock_rwlock_ptr); + } + + if (lock_spinlock_ptr && master) { + spin_unlock((spinlock_t *)lock_spinlock_ptr); + if (verbose) + pr_notice("unlock spinlock %ps\n", + (void *)lock_spinlock_ptr); + } + + if (lock_rcu) + rcu_read_unlock(); + + if (disable_preempt) + preempt_enable(); + + if (disable_softirq) + local_bh_enable(); + + if (test_disable_irq) + local_irq_enable(); + + if (lock_mmap_sem && master) { + if (lock_read) + up_read(&main_task->mm->mmap_sem); + else + up_write(&main_task->mm->mmap_sem); + if (verbose) + pr_notice("unlock mmap_sem pid=%d\n", main_task->pid); + } + + if (lock_rwsem_ptr && master) { + if (lock_read) + up_read((struct rw_semaphore *)lock_rwsem_ptr); + else + up_write((struct rw_semaphore *)lock_rwsem_ptr); + if (verbose) + pr_notice("unlock rw_semaphore %ps\n", + (void *)lock_rwsem_ptr); + } + + if (lock_mutex_ptr && master) { + mutex_unlock((struct mutex *)lock_mutex_ptr); + if (verbose) + pr_notice("unlock mutex %ps\n", + (void *)lock_mutex_ptr); + } +} + +static void test_alloc_pages(struct list_head *pages) +{ + struct page *page; + unsigned int i; + + for (i = 0; i < alloc_pages_nr; i++) { + page = alloc_pages(alloc_pages_gfp, alloc_pages_order); + if (!page) { + atomic_inc(&alloc_pages_failed); + break; + } + list_add(&page->lru, pages); + } +} + +static void test_free_pages(struct list_head *pages) +{ + struct page *page, *next; + + list_for_each_entry_safe(page, next, pages, lru) + __free_pages(page, alloc_pages_order); + INIT_LIST_HEAD(pages); +} + +static void test_wait(unsigned int secs, unsigned int nsecs) +{ + if (wait_state == TASK_RUNNING) { + if (secs) + mdelay(secs * MSEC_PER_SEC); + if (nsecs) + ndelay(nsecs); + return; + } + + __set_current_state(wait_state); + if (use_hrtimer) { + ktime_t time; + + time = ns_to_ktime((u64)secs * NSEC_PER_SEC + nsecs); + schedule_hrtimeout(&time, HRTIMER_MODE_REL); + } else { + schedule_timeout(secs * HZ + nsecs_to_jiffies(nsecs)); + } +} + +static void test_lockup(bool master) +{ + u64 lockup_start = local_clock(); + unsigned int iter = 0; + LIST_HEAD(pages); + + pr_notice("Start on CPU%d\n", raw_smp_processor_id()); + + test_lock(master, true); + + test_alloc_pages(&pages); + + while (iter++ < iterations && !signal_pending(main_task)) { + + if (iowait) + current->in_iowait = 1; + + test_wait(time_secs, time_nsecs); + + if (iowait) + current->in_iowait = 0; + + if (reallocate_pages) + test_free_pages(&pages); + + if (reacquire_locks) + test_unlock(master, false); + + if (touch_softlockup) + touch_softlockup_watchdog(); + + if (touch_hardlockup) + touch_nmi_watchdog(); + + if (call_cond_resched) + cond_resched(); + + test_wait(cooldown_secs, cooldown_nsecs); + + if (reacquire_locks) + test_lock(master, false); + + if (reallocate_pages) + test_alloc_pages(&pages); + } + + pr_notice("Finish on CPU%d in %lld ns\n", raw_smp_processor_id(), + local_clock() - lockup_start); + + test_free_pages(&pages); + + test_unlock(master, true); +} + +DEFINE_PER_CPU(struct work_struct, test_works); + +static void test_work_fn(struct work_struct *work) +{ + test_lockup(!lock_single || + work == per_cpu_ptr(&test_works, master_cpu)); +} + +static bool test_kernel_ptr(unsigned long addr, int size) +{ + void *ptr = (void *)addr; + char buf; + + if (!addr) + return false; + + /* should be at least readable kernel address */ + if (access_ok(ptr, 1) || + access_ok(ptr + size - 1, 1) || + probe_kernel_address(ptr, buf) || + probe_kernel_address(ptr + size - 1, buf)) { + pr_err("invalid kernel ptr: %#lx\n", addr); + return true; + } + + return false; +} + +static bool __maybe_unused test_magic(unsigned long addr, int offset, + unsigned int expected) +{ + void *ptr = (void *)addr + offset; + unsigned int magic = 0; + + if (!addr) + return false; + + if (probe_kernel_address(ptr, magic) || magic != expected) { + pr_err("invalid magic at %#lx + %#x = %#x, expected %#x\n", + addr, offset, magic, expected); + return true; + } + + return false; +} + +static int __init test_lockup_init(void) +{ + u64 test_start = local_clock(); + + main_task = current; + + switch (state[0]) { + case 'S': + wait_state = TASK_INTERRUPTIBLE; + break; + case 'D': + wait_state = TASK_UNINTERRUPTIBLE; + break; + case 'K': + wait_state = TASK_KILLABLE; + break; + case 'R': + wait_state = TASK_RUNNING; + break; + default: + pr_err("unknown state=%s\n", state); + return -EINVAL; + } + + if (alloc_pages_atomic) + alloc_pages_gfp = GFP_ATOMIC; + + if (test_kernel_ptr(lock_spinlock_ptr, sizeof(spinlock_t)) || + test_kernel_ptr(lock_rwlock_ptr, sizeof(rwlock_t)) || + test_kernel_ptr(lock_mutex_ptr, sizeof(struct mutex)) || + test_kernel_ptr(lock_rwsem_ptr, sizeof(struct rw_semaphore))) + return -EINVAL; + +#ifdef CONFIG_DEBUG_SPINLOCK + if (test_magic(lock_spinlock_ptr, + offsetof(spinlock_t, rlock.magic), + SPINLOCK_MAGIC) || + test_magic(lock_rwlock_ptr, + offsetof(rwlock_t, magic), + RWLOCK_MAGIC) || + test_magic(lock_mutex_ptr, + offsetof(struct mutex, wait_lock.rlock.magic), + SPINLOCK_MAGIC) || + test_magic(lock_rwsem_ptr, + offsetof(struct rw_semaphore, wait_lock.magic), + SPINLOCK_MAGIC)) + return -EINVAL; +#endif + + if ((wait_state != TASK_RUNNING || + (call_cond_resched && !reacquire_locks) || + (alloc_pages_nr && gfpflags_allow_blocking(alloc_pages_gfp))) && + (test_disable_irq || disable_softirq || disable_preempt || + lock_rcu || lock_spinlock_ptr || lock_rwlock_ptr)) { + pr_err("refuse to sleep in atomic context\n"); + return -EINVAL; + } + + if (lock_mmap_sem && !main_task->mm) { + pr_err("no mm to lock mmap_sem\n"); + return -EINVAL; + } + + pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iteraions=%u state=%s %s%s%s%s%s%s%s%s%s%s%s\n", + main_task->pid, time_secs, time_nsecs, + cooldown_secs, cooldown_nsecs, iterations, state, + all_cpus ? "all_cpus " : "", + iowait ? "iowait " : "", + test_disable_irq ? "disable_irq " : "", + disable_softirq ? "disable_softirq " : "", + disable_preempt ? "disable_preempt " : "", + lock_rcu ? "lock_rcu " : "", + lock_read ? "lock_read " : "", + touch_softlockup ? "touch_softlockup " : "", + touch_hardlockup ? "touch_hardlockup " : "", + call_cond_resched ? "call_cond_resched " : "", + reacquire_locks ? "reacquire_locks " : ""); + + if (alloc_pages_nr) + pr_notice("ALLOCATE PAGES nr=%u order=%u gfp=%pGg %s\n", + alloc_pages_nr, alloc_pages_order, &alloc_pages_gfp, + reallocate_pages ? "reallocate_pages " : ""); + + if (all_cpus) { + unsigned int cpu; + + cpus_read_lock(); + + preempt_disable(); + master_cpu = smp_processor_id(); + for_each_online_cpu(cpu) { + INIT_WORK(per_cpu_ptr(&test_works, cpu), test_work_fn); + queue_work_on(cpu, system_highpri_wq, + per_cpu_ptr(&test_works, cpu)); + } + preempt_enable(); + + for_each_online_cpu(cpu) + flush_work(per_cpu_ptr(&test_works, cpu)); + + cpus_read_unlock(); + } else { + test_lockup(true); + } + + if (measure_lock_wait) + pr_notice("Maximum lock wait: %lld ns\n", + atomic64_read(&max_lock_wait)); + + if (alloc_pages_nr) + pr_notice("Page allocation failed %u times\n", + atomic_read(&alloc_pages_failed)); + + pr_notice("FINISH in %llu ns\n", local_clock() - test_start); + + if (signal_pending(main_task)) + return -EINTR; + + return -EAGAIN; +} +module_init(test_lockup_init); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Konstantin Khlebnikov "); +MODULE_DESCRIPTION("Test module to generate lockups"); From ad3f434b87e7d2be4f2f5d602d33f070823f4d48 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Mon, 6 Apr 2020 20:09:50 -0700 Subject: [PATCH 109/158] lib/test_lockup.c: fix spelling mistake "iteraions" -> "iterations" There is a spelling mistake in a pr_notice message. Fix it. Signed-off-by: Colin Ian King Signed-off-by: Andrew Morton Cc: Greg Kroah-Hartman Cc: Guenter Roeck Cc: Kees Cook Cc: Konstantin Khlebnikov Cc: Peter Zijlstra Cc: Petr Mladek Cc: Sasha Levin Cc: Sergey Senozhatsky Cc: Steven Rostedt Link: http://lkml.kernel.org/r/20200221155145.79522-1-colin.king@canonical.com Signed-off-by: Linus Torvalds --- lib/test_lockup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/test_lockup.c b/lib/test_lockup.c index 9e8b8a0be9af..83683ec1f429 100644 --- a/lib/test_lockup.c +++ b/lib/test_lockup.c @@ -490,7 +490,7 @@ static int __init test_lockup_init(void) return -EINVAL; } - pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iteraions=%u state=%s %s%s%s%s%s%s%s%s%s%s%s\n", + pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iterations=%u state=%s %s%s%s%s%s%s%s%s%s%s%s\n", main_task->pid, time_secs, time_nsecs, cooldown_secs, cooldown_nsecs, iterations, state, all_cpus ? "all_cpus " : "", From aecd42df6d399346fd8e5551dd8878338f214ec1 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Mon, 6 Apr 2020 20:09:54 -0700 Subject: [PATCH 110/158] lib/test_lockup.c: add parameters for locking generic vfs locks file_path= defines file or directory to open lock_inode=Y set lock_rwsem_ptr to inode->i_rwsem lock_mapping=Y set lock_rwsem_ptr to mapping->i_mmap_rwsem lock_sb_umount=Y set lock_rwsem_ptr to sb->s_umount This gives safe and simple way to see how system reacts to contention of common vfs locks and how syscalls depend on them directly or indirectly. For example to block s_umount for 60 seconds: # modprobe test_lockup file_path=. lock_sb_umount time_secs=60 state=S This is useful for checking/testing scalability issues like this: https://lore.kernel.org/lkml/158497590858.7371.9311902565121473436.stgit@buzz/ Signed-off-by: Konstantin Khlebnikov Signed-off-by: Andrew Morton Cc: Colin Ian King Cc: Greg Kroah-Hartman Cc: Guenter Roeck Cc: Kees Cook Cc: Peter Zijlstra Cc: Petr Mladek Cc: Sasha Levin Cc: Sergey Senozhatsky Cc: Steven Rostedt Link: http://lkml.kernel.org/r/158498153964.5621.83061779039255681.stgit@buzz Signed-off-by: Linus Torvalds --- lib/test_lockup.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/lib/test_lockup.c b/lib/test_lockup.c index 83683ec1f429..ea09ca335b21 100644 --- a/lib/test_lockup.c +++ b/lib/test_lockup.c @@ -14,6 +14,7 @@ #include #include #include +#include static unsigned int time_secs; module_param(time_secs, uint, 0600); @@ -140,6 +141,24 @@ static bool reallocate_pages; module_param(reallocate_pages, bool, 0400); MODULE_PARM_DESC(reallocate_pages, "free and allocate pages between iterations"); +struct file *test_file; +struct inode *test_inode; +static char test_file_path[256]; +module_param_string(file_path, test_file_path, sizeof(test_file_path), 0400); +MODULE_PARM_DESC(file_path, "file path to test"); + +static bool test_lock_inode; +module_param_named(lock_inode, test_lock_inode, bool, 0400); +MODULE_PARM_DESC(lock_inode, "lock file -> inode -> i_rwsem"); + +static bool test_lock_mapping; +module_param_named(lock_mapping, test_lock_mapping, bool, 0400); +MODULE_PARM_DESC(lock_mapping, "lock file -> mapping -> i_mmap_rwsem"); + +static bool test_lock_sb_umount; +module_param_named(lock_sb_umount, test_lock_sb_umount, bool, 0400); +MODULE_PARM_DESC(lock_sb_umount, "lock file -> sb -> s_umount"); + static atomic_t alloc_pages_failed = ATOMIC_INIT(0); static atomic64_t max_lock_wait = ATOMIC64_INIT(0); @@ -490,6 +509,29 @@ static int __init test_lockup_init(void) return -EINVAL; } + if (test_file_path[0]) { + test_file = filp_open(test_file_path, O_RDONLY, 0); + if (IS_ERR(test_file)) { + pr_err("cannot find file_path\n"); + return -EINVAL; + } + test_inode = file_inode(test_file); + } else if (test_lock_inode || + test_lock_mapping || + test_lock_sb_umount) { + pr_err("no file to lock\n"); + return -EINVAL; + } + + if (test_lock_inode && test_inode) + lock_rwsem_ptr = (unsigned long)&test_inode->i_rwsem; + + if (test_lock_mapping && test_file && test_file->f_mapping) + lock_rwsem_ptr = (unsigned long)&test_file->f_mapping->i_mmap_rwsem; + + if (test_lock_sb_umount && test_inode) + lock_rwsem_ptr = (unsigned long)&test_inode->i_sb->s_umount; + pr_notice("START pid=%d time=%u +%u ns cooldown=%u +%u ns iterations=%u state=%s %s%s%s%s%s%s%s%s%s%s%s\n", main_task->pid, time_secs, time_nsecs, cooldown_secs, cooldown_nsecs, iterations, state, @@ -542,6 +584,9 @@ static int __init test_lockup_init(void) pr_notice("FINISH in %llu ns\n", local_clock() - test_start); + if (test_file) + fput(test_file); + if (signal_pending(main_task)) return -EINTR; From a44ce5137026493b3bec3768cff92861dbeede5c Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:09:57 -0700 Subject: [PATCH 111/158] lib/bch.c: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertenly introduced[3] to the codebase from now on. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200211205119.GA21234@embeddedor Signed-off-by: Linus Torvalds --- lib/bch.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/bch.c b/lib/bch.c index 5db6d3a4c8a6..052d3fb753a0 100644 --- a/lib/bch.c +++ b/lib/bch.c @@ -102,7 +102,7 @@ */ struct gf_poly { unsigned int deg; /* polynomial degree */ - unsigned int c[0]; /* polynomial terms */ + unsigned int c[]; /* polynomial terms */ }; /* given its degree, compute a polynomial size in bytes */ From c6e2ac3b476b458f9bd2069d85b95ca4634ce680 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:10:00 -0700 Subject: [PATCH 112/158] lib/ts_bm.c: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertenly introduced[3] to the codebase from now on. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200211205620.GA24694@embeddedor Signed-off-by: Linus Torvalds --- lib/ts_bm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/ts_bm.c b/lib/ts_bm.c index b352903c50e3..277cb4417ac2 100644 --- a/lib/ts_bm.c +++ b/lib/ts_bm.c @@ -52,7 +52,7 @@ struct ts_bm u8 * pattern; unsigned int patlen; unsigned int bad_shift[ASIZE]; - unsigned int good_shift[0]; + unsigned int good_shift[]; }; static unsigned int bm_find(struct ts_config *conf, struct ts_state *state) From 842ae1f52b44a4fa3e4e7e5dd61e8b53e8a09ece Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:10:03 -0700 Subject: [PATCH 113/158] lib/ts_fsm.c: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertenly introduced[3] to the codebase from now on. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200211205813.GA25602@embeddedor Signed-off-by: Linus Torvalds --- lib/ts_fsm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/ts_fsm.c b/lib/ts_fsm.c index 9c873cadab7c..ab749ec10ab5 100644 --- a/lib/ts_fsm.c +++ b/lib/ts_fsm.c @@ -32,7 +32,7 @@ struct ts_fsm { unsigned int ntokens; - struct ts_fsm_token tokens[0]; + struct ts_fsm_token tokens[]; }; /* other values derived from ctype.h */ From 51022f8715bbb365de8521a5b6a2fcf967ee76ca Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:10:06 -0700 Subject: [PATCH 114/158] lib/ts_kmp.c: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertenly introduced[3] to the codebase from now on. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200211205948.GA26459@embeddedor Signed-off-by: Linus Torvalds --- lib/ts_kmp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/ts_kmp.c b/lib/ts_kmp.c index 94617e014b3a..c77a3d537f24 100644 --- a/lib/ts_kmp.c +++ b/lib/ts_kmp.c @@ -36,7 +36,7 @@ struct ts_kmp { u8 * pattern; unsigned int pattern_len; - unsigned int prefix_tbl[0]; + unsigned int prefix_tbl[]; }; static unsigned int kmp_find(struct ts_config *conf, struct ts_state *state) From 6e85318521c001b6af82830774a8a7d05eec6adf Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Mon, 6 Apr 2020 20:10:09 -0700 Subject: [PATCH 115/158] lib/scatterlist: fix sg_copy_buffer() kerneldoc Add the missing closing parenthesis to the description for the to_buffer parameter of sg_copy_buffer(). Signed-off-by: Geert Uytterhoeven Signed-off-by: Andrew Morton Cc: Akinobu Mita --- lib/scatterlist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/scatterlist.c b/lib/scatterlist.c index 5813072bc589..5d63a8857f36 100644 --- a/lib/scatterlist.c +++ b/lib/scatterlist.c @@ -832,7 +832,7 @@ EXPORT_SYMBOL(sg_miter_stop); * @buflen: The number of bytes to copy * @skip: Number of bytes to skip before copying * @to_buffer: transfer direction (true == from an sg list to a - * buffer, false == from a buffer to an sg list + * buffer, false == from a buffer to an sg list) * * Returns the number of copied bytes. * From 9cf016e6b49b53d6c15d4137c034178148149ef4 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:10:12 -0700 Subject: [PATCH 116/158] lib: test_stackinit.c: XFAIL switch variable init tests The tests for initializing a variable defined between a switch statement's test and its first "case" statement are currently not initialized in Clang[1] nor the proposed auto-initialization feature in GCC. We should retain the test (so that we can evaluate compiler fixes), but mark it as an "expected fail". The rest of the kernel source will be adjusted to avoid this corner case. Also disable -Wswitch-unreachable for the test so that the intentionally broken code won't trigger warnings for GCC (nor future Clang) when initialization happens this unhandled place. [1] https://bugs.llvm.org/show_bug.cgi?id=44916 Suggested-by: Alexander Potapenko Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Cc: Jann Horn Cc: Ard Biesheuvel Link: http://lkml.kernel.org/r/202002191358.2897A07C6@keescook Signed-off-by: Linus Torvalds --- lib/Makefile | 1 + lib/test_stackinit.c | 28 ++++++++++++++++++---------- 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/lib/Makefile b/lib/Makefile index 0fd125c4ad07..93d05ff4f501 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -87,6 +87,7 @@ obj-$(CONFIG_TEST_KMOD) += test_kmod.o obj-$(CONFIG_TEST_DEBUG_VIRTUAL) += test_debug_virtual.o obj-$(CONFIG_TEST_MEMCAT_P) += test_memcat_p.o obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o +CFLAGS_test_stackinit.o += $(call cc-disable-warning, switch-unreachable) obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o diff --git a/lib/test_stackinit.c b/lib/test_stackinit.c index 2d7d257a430e..f93b1e145ada 100644 --- a/lib/test_stackinit.c +++ b/lib/test_stackinit.c @@ -92,8 +92,9 @@ static bool range_contains(char *haystack_start, size_t haystack_size, * @var_type: type to be tested for zeroing initialization * @which: is this a SCALAR, STRING, or STRUCT type? * @init_level: what kind of initialization is performed + * @xfail: is this test expected to fail? */ -#define DEFINE_TEST_DRIVER(name, var_type, which) \ +#define DEFINE_TEST_DRIVER(name, var_type, which, xfail) \ /* Returns 0 on success, 1 on failure. */ \ static noinline __init int test_ ## name (void) \ { \ @@ -139,13 +140,14 @@ static noinline __init int test_ ## name (void) \ for (sum = 0, i = 0; i < target_size; i++) \ sum += (check_buf[i] == 0xFF); \ \ - if (sum == 0) \ + if (sum == 0) { \ pr_info(#name " ok\n"); \ - else \ - pr_warn(#name " FAIL (uninit bytes: %d)\n", \ - sum); \ - \ - return (sum != 0); \ + return 0; \ + } else { \ + pr_warn(#name " %sFAIL (uninit bytes: %d)\n", \ + (xfail) ? "X" : "", sum); \ + return (xfail) ? 0 : 1; \ + } \ } #define DEFINE_TEST(name, var_type, which, init_level) \ /* no-op to force compiler into ignoring "uninitialized" vars */\ @@ -189,7 +191,7 @@ static noinline __init int leaf_ ## name(unsigned long sp, \ \ return (int)buf[0] | (int)buf[sizeof(buf) - 1]; \ } \ -DEFINE_TEST_DRIVER(name, var_type, which) +DEFINE_TEST_DRIVER(name, var_type, which, 0) /* Structure with no padding. */ struct test_packed { @@ -326,8 +328,14 @@ static noinline __init int leaf_switch_2_none(unsigned long sp, bool fill, return __leaf_switch_none(2, fill); } -DEFINE_TEST_DRIVER(switch_1_none, uint64_t, SCALAR); -DEFINE_TEST_DRIVER(switch_2_none, uint64_t, SCALAR); +/* + * These are expected to fail for most configurations because neither + * GCC nor Clang have a way to perform initialization of variables in + * non-code areas (i.e. in a switch statement before the first "case"). + * https://bugs.llvm.org/show_bug.cgi?id=44916 + */ +DEFINE_TEST_DRIVER(switch_1_none, uint64_t, SCALAR, 1); +DEFINE_TEST_DRIVER(switch_2_none, uint64_t, SCALAR, 1); static int __init test_stackinit_init(void) { From 69866e156ce28c47dc0085b5ba5e734f5511812e Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Mon, 6 Apr 2020 20:10:15 -0700 Subject: [PATCH 117/158] lib/stackdepot.c: check depot_index before accessing the stack slab Avoid crashes on corrupted stack ids. Despite stack ID corruption may indicate other bugs in the program, we'd better fail gracefully on such IDs instead of crashing the kernel. This patch has been previously mailed as part of KMSAN RFC patch series. Link: http://lkml.kernel.org/r/20200220141916.55455-1-glider@google.com Signed-off-by: Alexander Potapenko Cc: Vegard Nossum Cc: Dmitry Vyukov Cc: Marco Elver Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Arnd Bergmann Cc: Sergey Senozhatsky From: Dan Carpenter Subject: lib/stackdepot.c: fix a condition in stack_depot_fetch() We should check for a NULL pointer first before adding the offset. Otherwise if the pointer is NULL and the offset is non-zero, it will lead to an Oops. Fixes: d45048e65a59 ("lib/stackdepot.c: check depot_index before accessing the stack slab") Signed-off-by: Dan Carpenter Signed-off-by: Andrew Morton Acked-by: Alexander Potapenko Link: http://lkml.kernel.org/r/20200312113006.GA20562@mwanda Signed-off-by: Linus Torvalds --- lib/stackdepot.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/lib/stackdepot.c b/lib/stackdepot.c index 81c69c08d1d1..1ec36ee344e0 100644 --- a/lib/stackdepot.c +++ b/lib/stackdepot.c @@ -202,9 +202,20 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle, unsigned long **entries) { union handle_parts parts = { .handle = handle }; - void *slab = stack_slabs[parts.slabindex]; + void *slab; size_t offset = parts.offset << STACK_ALLOC_ALIGN; - struct stack_record *stack = slab + offset; + struct stack_record *stack; + + *entries = NULL; + if (parts.slabindex > depot_index) { + WARN(1, "slab index %d out of bounds (%d) for stack id %08x\n", + parts.slabindex, depot_index, handle); + return 0; + } + slab = stack_slabs[parts.slabindex]; + if (!slab) + return 0; + stack = slab + offset; *entries = stack->entries; return stack->size; From 7b65942fb2f0ac939be9c659bb889e78b399f84e Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Mon, 6 Apr 2020 20:10:19 -0700 Subject: [PATCH 118/158] lib/stackdepot.c: build with -fno-builtin Clang may replace stackdepot_memcmp() with a call to instrumented bcmp(), which is exactly what we wanted to avoid creating stackdepot_memcmp(). Building the file with -fno-builtin prevents such optimizations. This patch has been previously mailed as part of KMSAN RFC patch series. Signed-off-by: Alexander Potapenko Signed-off-by: Andrew Morton Cc: Vegard Nossum Cc: Dmitry Vyukov Cc: Marco Elver Cc: Andrey Konovalov Cc: Sergey Senozhatsky Cc: Arnd Bergmann Cc: Andrey Ryabinin Link: http://lkml.kernel.org/r/20200220141916.55455-2-glider@google.com Signed-off-by: Linus Torvalds --- lib/Makefile | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/lib/Makefile b/lib/Makefile index 93d05ff4f501..3fc06399295d 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -223,6 +223,10 @@ obj-$(CONFIG_MEMREGION) += memregion.o obj-$(CONFIG_STMP_DEVICE) += stmp_device.o obj-$(CONFIG_IRQ_POLL) += irq_poll.o +# stackdepot.c should not be instrumented or call instrumented functions. +# Prevent the compiler from calling builtins like memcmp() or bcmp() from this +# file. +CFLAGS_stackdepot.o += -fno-builtin obj-$(CONFIG_STACKDEPOT) += stackdepot.o KASAN_SANITIZE_stackdepot.o := n KCOV_INSTRUMENT_stackdepot.o := n From 505a0ef15f96c6c43ec719c9fc1833d98957bb39 Mon Sep 17 00:00:00 2001 From: Alexander Potapenko Date: Mon, 6 Apr 2020 20:10:22 -0700 Subject: [PATCH 119/158] kasan: stackdepot: move filter_irq_stacks() to stackdepot.c filter_irq_stacks() can be used by other tools (e.g. KMSAN), so it needs to be moved to a common location. lib/stackdepot.c seems a good place, as filter_irq_stacks() is usually applied to the output of stack_trace_save(). This patch has been previously mailed as part of KMSAN RFC patch series. [glider@google.co: nds32: linker script: add SOFTIRQENTRY_TEXT\ Link: http://lkml.kernel.org/r/20200311121002.241430-1-glider@google.com [glider@google.com: add IRQENTRY_TEXT and SOFTIRQENTRY_TEXT to linker script] Link: http://lkml.kernel.org/r/20200311121124.243352-1-glider@google.com Signed-off-by: Alexander Potapenko Signed-off-by: Andrew Morton Cc: Vegard Nossum Cc: Dmitry Vyukov Cc: Marco Elver Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Arnd Bergmann Cc: Sergey Senozhatsky Link: http://lkml.kernel.org/r/20200220141916.55455-3-glider@google.com Signed-off-by: Linus Torvalds --- arch/ia64/kernel/vmlinux.lds.S | 2 ++ arch/nds32/kernel/vmlinux.lds.S | 1 + include/linux/stackdepot.h | 2 ++ lib/stackdepot.c | 24 ++++++++++++++++++++++++ mm/kasan/common.c | 23 ----------------------- 5 files changed, 29 insertions(+), 23 deletions(-) diff --git a/arch/ia64/kernel/vmlinux.lds.S b/arch/ia64/kernel/vmlinux.lds.S index 1ec6b703c5b4..6b5652ee76f9 100644 --- a/arch/ia64/kernel/vmlinux.lds.S +++ b/arch/ia64/kernel/vmlinux.lds.S @@ -54,6 +54,8 @@ SECTIONS { CPUIDLE_TEXT LOCK_TEXT KPROBES_TEXT + IRQENTRY_TEXT + SOFTIRQENTRY_TEXT *(.gnu.linkonce.t*) } diff --git a/arch/nds32/kernel/vmlinux.lds.S b/arch/nds32/kernel/vmlinux.lds.S index f679d3397436..7a6c1cefe3fe 100644 --- a/arch/nds32/kernel/vmlinux.lds.S +++ b/arch/nds32/kernel/vmlinux.lds.S @@ -47,6 +47,7 @@ SECTIONS LOCK_TEXT KPROBES_TEXT IRQENTRY_TEXT + SOFTIRQENTRY_TEXT *(.fixup) } diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h index 3efa97d482cb..24d49c732341 100644 --- a/include/linux/stackdepot.h +++ b/include/linux/stackdepot.h @@ -19,4 +19,6 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries, unsigned int stack_depot_fetch(depot_stack_handle_t handle, unsigned long **entries); +unsigned int filter_irq_stacks(unsigned long *entries, unsigned int nr_entries); + #endif diff --git a/lib/stackdepot.c b/lib/stackdepot.c index 1ec36ee344e0..2caffc64e4c8 100644 --- a/lib/stackdepot.c +++ b/lib/stackdepot.c @@ -20,6 +20,7 @@ */ #include +#include #include #include #include @@ -316,3 +317,26 @@ fast_exit: return retval; } EXPORT_SYMBOL_GPL(stack_depot_save); + +static inline int in_irqentry_text(unsigned long ptr) +{ + return (ptr >= (unsigned long)&__irqentry_text_start && + ptr < (unsigned long)&__irqentry_text_end) || + (ptr >= (unsigned long)&__softirqentry_text_start && + ptr < (unsigned long)&__softirqentry_text_end); +} + +unsigned int filter_irq_stacks(unsigned long *entries, + unsigned int nr_entries) +{ + unsigned int i; + + for (i = 0; i < nr_entries; i++) { + if (in_irqentry_text(entries[i])) { + /* Include the irqentry function into the stack. */ + return i + 1; + } + } + return nr_entries; +} +EXPORT_SYMBOL_GPL(filter_irq_stacks); diff --git a/mm/kasan/common.c b/mm/kasan/common.c index e61b4a492218..2906358e42f0 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -15,7 +15,6 @@ */ #include -#include #include #include #include @@ -42,28 +41,6 @@ #include "kasan.h" #include "../slab.h" -static inline int in_irqentry_text(unsigned long ptr) -{ - return (ptr >= (unsigned long)&__irqentry_text_start && - ptr < (unsigned long)&__irqentry_text_end) || - (ptr >= (unsigned long)&__softirqentry_text_start && - ptr < (unsigned long)&__softirqentry_text_end); -} - -static inline unsigned int filter_irq_stacks(unsigned long *entries, - unsigned int nr_entries) -{ - unsigned int i; - - for (i = 0; i < nr_entries; i++) { - if (in_irqentry_text(entries[i])) { - /* Include the irqentry function into the stack. */ - return i + 1; - } - } - return nr_entries; -} - static inline depot_stack_handle_t save_stack(gfp_t flags) { unsigned long entries[KASAN_STACK_DEPTH]; From 7e2345200262e4a6056580f0231cccdaffc825f3 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Mon, 6 Apr 2020 20:10:25 -0700 Subject: [PATCH 120/158] percpu_counter: fix a data race at vm_committed_as "vm_committed_as.count" could be accessed concurrently as reported by KCSAN, BUG: KCSAN: data-race in __vm_enough_memory / percpu_counter_add_batch write to 0xffffffff9451c538 of 8 bytes by task 65879 on cpu 35: percpu_counter_add_batch+0x83/0xd0 percpu_counter_add_batch at lib/percpu_counter.c:91 __vm_enough_memory+0xb9/0x260 dup_mm+0x3a4/0x8f0 copy_process+0x2458/0x3240 _do_fork+0xaa/0x9f0 __do_sys_clone+0x125/0x160 __x64_sys_clone+0x70/0x90 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffffffff9451c538 of 8 bytes by task 66773 on cpu 19: __vm_enough_memory+0x199/0x260 percpu_counter_read_positive at include/linux/percpu_counter.h:81 (inlined by) __vm_enough_memory at mm/util.c:839 mmap_region+0x1b2/0xa10 do_mmap+0x45c/0x700 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x6e/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe The read is outside percpu_counter::lock critical section which results in a data race. Fix it by adding a READ_ONCE() in percpu_counter_read_positive() which could also service as the existing compiler memory barrier. Signed-off-by: Qian Cai Signed-off-by: Andrew Morton Acked-by: Marco Elver Link: http://lkml.kernel.org/r/1582302724-2804-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds --- include/linux/percpu_counter.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index 4f052496cdfd..0a4f54dd4737 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -78,9 +78,9 @@ static inline s64 percpu_counter_read(struct percpu_counter *fbc) */ static inline s64 percpu_counter_read_positive(struct percpu_counter *fbc) { - s64 ret = fbc->count; + /* Prevent reloads of fbc->count */ + s64 ret = READ_ONCE(fbc->count); - barrier(); /* Prevent reloads of fbc->count */ if (ret >= 0) return ret; return 0; From caa7f776ccd3067ddea1f36d92662cee7608f141 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 6 Apr 2020 20:10:28 -0700 Subject: [PATCH 121/158] lib/test_bitmap.c: make use of EXP2_IN_BITS Commit 30544ed5de43 ("lib/bitmap: introduce bitmap_replace() helper") introduced some new test cases to the test_bitmap.c module. Among these it also introduced an (unused) definition. Let's make use of EXP2_IN_BITS. Reported-by: Alex Shi Signed-off-by: Andy Shevchenko Signed-off-by: Andrew Morton Reviewed-by: Alex Shi Link: http://lkml.kernel.org/r/20200121151847.75223-1-andriy.shevchenko@linux.intel.com Signed-off-by: Linus Torvalds --- lib/test_bitmap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 61ed71c1daba..6b13150667f5 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -278,6 +278,8 @@ static void __init test_replace(void) unsigned int nlongs = DIV_ROUND_UP(nbits, BITS_PER_LONG); DECLARE_BITMAP(bmap, 1024); + BUILD_BUG_ON(EXP2_IN_BITS < nbits * 2); + bitmap_zero(bmap, 1024); bitmap_replace(bmap, &exp2[0 * nlongs], &exp2[1 * nlongs], exp2_to_exp3_mask, nbits); expect_eq_bitmap(bmap, exp3_0_1, nbits); From 8d994cada27c565adfaa5ac524cc1bc099668dfd Mon Sep 17 00:00:00 2001 From: chenqiwu Date: Mon, 6 Apr 2020 20:10:31 -0700 Subject: [PATCH 122/158] lib/rbtree: fix coding style of assignments Leave blank space between the right-hand and left-hand side of the assignment to meet the kernel coding style better. Signed-off-by: chenqiwu Signed-off-by: Andrew Morton Reviewed-by: Michel Lespinasse Link: http://lkml.kernel.org/r/1582621140-25850-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds --- lib/rbtree.c | 4 ++-- tools/lib/rbtree.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/rbtree.c b/lib/rbtree.c index abc86c6a3177..8545872e61db 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -503,7 +503,7 @@ struct rb_node *rb_next(const struct rb_node *node) if (node->rb_right) { node = node->rb_right; while (node->rb_left) - node=node->rb_left; + node = node->rb_left; return (struct rb_node *)node; } @@ -535,7 +535,7 @@ struct rb_node *rb_prev(const struct rb_node *node) if (node->rb_left) { node = node->rb_left; while (node->rb_right) - node=node->rb_right; + node = node->rb_right; return (struct rb_node *)node; } diff --git a/tools/lib/rbtree.c b/tools/lib/rbtree.c index 2548ff8c4d9c..06ac7bd2144b 100644 --- a/tools/lib/rbtree.c +++ b/tools/lib/rbtree.c @@ -497,7 +497,7 @@ struct rb_node *rb_next(const struct rb_node *node) if (node->rb_right) { node = node->rb_right; while (node->rb_left) - node=node->rb_left; + node = node->rb_left; return (struct rb_node *)node; } @@ -528,7 +528,7 @@ struct rb_node *rb_prev(const struct rb_node *node) if (node->rb_left) { node = node->rb_left; while (node->rb_right) - node=node->rb_right; + node = node->rb_right; return (struct rb_node *)node; } From 8f0259c27c850403c6a2076ef07180dd802da559 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Mon, 6 Apr 2020 20:10:35 -0700 Subject: [PATCH 123/158] lib/test_kmod.c: remove a NULL test The "info" pointer has already been dereferenced so checking here is too late. Fortunately, we never pass NULL pointers to the test_kmod_put_module() function so the test can simply be removed. Signed-off-by: Dan Carpenter Signed-off-by: Andrew Morton Acked-by: Luis Chamberlain Link: http://lkml.kernel.org/r/20200228092452.vwkhthsn77nrxdy6@kili.mountain Signed-off-by: Linus Torvalds --- lib/test_kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/test_kmod.c b/lib/test_kmod.c index 9cf77628fc91..e651c37d56db 100644 --- a/lib/test_kmod.c +++ b/lib/test_kmod.c @@ -204,7 +204,7 @@ static void test_kmod_put_module(struct kmod_test_device_info *info) case TEST_KMOD_DRIVER: break; case TEST_KMOD_FS_TYPE: - if (info && info->fs_sync && info->fs_sync->owner) + if (info->fs_sync && info->fs_sync->owner) module_put(info->fs_sync->owner); break; default: From 295bcca84916cb5079140a89fccb472bb8d1f6e2 Mon Sep 17 00:00:00 2001 From: Rikard Falkeborn Date: Mon, 6 Apr 2020 20:10:38 -0700 Subject: [PATCH 124/158] linux/bits.h: add compile time sanity check of GENMASK inputs GENMASK() and GENMASK_ULL() are supposed to be called with the high bit as the first argument and the low bit as the second argument. Mixing them will return a mask with zero bits set. Recent commits show getting this wrong is not uncommon, see e.g. commit aa4c0c9091b0 ("net: stmmac: Fix misuses of GENMASK macro") and commit 9bdd7bb3a844 ("clocksource/drivers/npcm: Fix misuse of GENMASK macro"). To prevent such mistakes from appearing again, add compile time sanity checking to the arguments of GENMASK() and GENMASK_ULL(). If both arguments are known at compile time, and the low bit is higher than the high bit, break the build to detect the mistake immediately. Since GENMASK() is used in declarations, BUILD_BUG_ON_ZERO() must be used instead of BUILD_BUG_ON(). __builtin_constant_p does not evaluate is argument, it only checks if it is a constant or not at compile time, and __builtin_choose_expr does not evaluate the expression that is not chosen. Therefore, GENMASK(x++, 0) does only evaluate x++ once. Commit 95b980d62d52 ("linux/bits.h: make BIT(), GENMASK(), and friends available in assembly") made the macros in linux/bits.h available in assembly. Since BUILD_BUG_OR_ZERO() is not asm compatible, disable the checks if the file is included in an asm file. Due to bugs in GCC versions before 4.9 [0], disable the check if building with a too old GCC compiler. [0]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19449 Signed-off-by: Rikard Falkeborn Signed-off-by: Andrew Morton Reviewed-by: Masahiro Yamada Reviewed-by: Kees Cook Cc: Borislav Petkov Cc: Geert Uytterhoeven Cc: Haren Myneni Cc: Joe Perches Cc: Johannes Berg Cc: lkml Cc: Ingo Molnar Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20200308193954.2372399-1-rikard.falkeborn@gmail.com Signed-off-by: Linus Torvalds --- include/linux/bits.h | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/bits.h b/include/linux/bits.h index a740bbcf3cd2..4671fbf28842 100644 --- a/include/linux/bits.h +++ b/include/linux/bits.h @@ -18,12 +18,30 @@ * position @h. For example * GENMASK_ULL(39, 21) gives us the 64bit vector 0x000000ffffe00000. */ -#define GENMASK(h, l) \ +#if !defined(__ASSEMBLY__) && \ + (!defined(CONFIG_CC_IS_GCC) || CONFIG_GCC_VERSION >= 49000) +#include +#define GENMASK_INPUT_CHECK(h, l) \ + (BUILD_BUG_ON_ZERO(__builtin_choose_expr( \ + __builtin_constant_p((l) > (h)), (l) > (h), 0))) +#else +/* + * BUILD_BUG_ON_ZERO is not available in h files included from asm files, + * disable the input check if that is the case. + */ +#define GENMASK_INPUT_CHECK(h, l) 0 +#endif + +#define __GENMASK(h, l) \ (((~UL(0)) - (UL(1) << (l)) + 1) & \ (~UL(0) >> (BITS_PER_LONG - 1 - (h)))) +#define GENMASK(h, l) \ + (GENMASK_INPUT_CHECK(h, l) + __GENMASK(h, l)) -#define GENMASK_ULL(h, l) \ +#define __GENMASK_ULL(h, l) \ (((~ULL(0)) - (ULL(1) << (l)) + 1) & \ (~ULL(0) >> (BITS_PER_LONG_LONG - 1 - (h)))) +#define GENMASK_ULL(h, l) \ + (GENMASK_INPUT_CHECK(h, l) + __GENMASK_ULL(h, l)) #endif /* __LINUX_BITS_H */ From 8306b057a85ec07482da5d4b99d5c0b47af69be1 Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Mon, 6 Apr 2020 20:10:45 -0700 Subject: [PATCH 125/158] lib/dynamic_debug.c: use address-of operator on section symbols Clang warns: ../lib/dynamic_debug.c:1034:24: warning: array comparison always evaluates to false [-Wtautological-compare] if (__start___verbose == __stop___verbose) { ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Signed-off-by: Andrew Morton Acked-by: Jason Baron Link: https://github.com/ClangBuiltLinux/linux/issues/894 Link: http://lkml.kernel.org/r/20200220051320.10739-1-natechancellor@gmail.com Signed-off-by: Linus Torvalds --- lib/dynamic_debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c index aae17d9522e5..8f199f403ab5 100644 --- a/lib/dynamic_debug.c +++ b/lib/dynamic_debug.c @@ -1031,7 +1031,7 @@ static int __init dynamic_debug_init(void) int n = 0, entries = 0, modct = 0; int verbose_bytes = 0; - if (__start___verbose == __stop___verbose) { + if (&__start___verbose == &__stop___verbose) { pr_warn("_ddebug table is empty in a CONFIG_DYNAMIC_DEBUG build\n"); return 1; } From dfa05c28ca7ffc0a94e7a35b91014d088c1a1ff5 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 6 Apr 2020 20:10:48 -0700 Subject: [PATCH 126/158] checkpatch: remove email address comment from email address comparisons About 2% of the last 100K commits have email addresses that include an RFC2822 compliant comment like: Peter Zijlstra (Intel) checkpatch currently does a comparison of the complete name and address to the submitted author to determine if the author has signed-off and emits a warning if the exact email names and addresses do not match. Unfortunately, the author email address can be written without the comment like: Peter Zijlstra Add logic to compare the comment stripped email addresses to avoid this warning. Reported-by: Peter Zijlstra Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Acked-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/ebaa2f7c8f94e25520981945cddcc1982e70e072.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index a63380c6b0d2..3f62971c7c98 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -1118,6 +1118,7 @@ sub parse_email { my ($formatted_email) = @_; my $name = ""; + my $name_comment = ""; my $address = ""; my $comment = ""; @@ -1150,6 +1151,10 @@ sub parse_email { $name = trim($name); $name =~ s/^\"|\"$//g; + $name =~ s/(\s*\([^\)]+\))\s*//; + if (defined($1)) { + $name_comment = trim($1); + } $address = trim($address); $address =~ s/^\<|\>$//g; @@ -1158,7 +1163,7 @@ sub parse_email { $name = "\"$name\""; } - return ($name, $address, $comment); + return ($name, $name_comment, $address, $comment); } sub format_email { @@ -1184,6 +1189,23 @@ sub format_email { return $formatted_email; } +sub reformat_email { + my ($email) = @_; + + my ($email_name, $name_comment, $email_address, $comment) = parse_email($email); + return format_email($email_name, $email_address); +} + +sub same_email_addresses { + my ($email1, $email2) = @_; + + my ($email1_name, $name1_comment, $email1_address, $comment1) = parse_email($email1); + my ($email2_name, $name2_comment, $email2_address, $comment2) = parse_email($email2); + + return $email1_name eq $email2_name && + $email1_address eq $email2_address; +} + sub which { my ($bin) = @_; @@ -2604,17 +2626,16 @@ sub process { $author = $1; $author = encode("utf8", $author) if ($line =~ /=\?utf-8\?/i); $author =~ s/"//g; + $author = reformat_email($author); } # Check the patch for a signoff: - if ($line =~ /^\s*signed-off-by:/i) { + if ($line =~ /^\s*signed-off-by:\s*(.*)/i) { $signoff++; $in_commit_log = 0; if ($author ne '') { - my $l = $line; - $l =~ s/"//g; - if ($l =~ /^\s*signed-off-by:\s*\Q$author\E/i) { - $authorsignoff = 1; + if (same_email_addresses($1, $author)) { + $authorsignoff = 1; } } } @@ -2664,7 +2685,7 @@ sub process { } } - my ($email_name, $email_address, $comment) = parse_email($email); + my ($email_name, $name_comment, $email_address, $comment) = parse_email($email); my $suggested_email = format_email(($email_name, $email_address)); if ($suggested_email eq "") { ERROR("BAD_SIGN_OFF", @@ -2675,9 +2696,7 @@ sub process { $dequoted =~ s/" $comment" ne $email && - "$suggested_email$comment" ne $email) { + if (!same_email_addresses($email, $suggested_email)) { WARN("BAD_SIGN_OFF", "email address '$email' might be better as '$suggested_email$comment'\n" . $herecurr); } From c8df0ab61454aa1c84e97af3673ee8db44e39f05 Mon Sep 17 00:00:00 2001 From: Lubomir Rintel Date: Mon, 6 Apr 2020 20:10:51 -0700 Subject: [PATCH 127/158] checkpatch: check SPDX tags in YAML files This adds a warning when a YAML file is lacking a SPDX header on first line, or it uses incorrect commenting style. Currently the only YAML files in the tree are Devicetree binding documents. Signed-off-by: Lubomir Rintel Signed-off-by: Andrew Morton Acked-by: Joe Perches Cc: Rob Herring Link: http://lkml.kernel.org/r/20200129123356.388669-1-lkundrak@v3.sk Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 3f62971c7c98..977d4c0c2dc9 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -3106,7 +3106,7 @@ sub process { $comment = '/*'; } elsif ($realfile =~ /\.(c|dts|dtsi)$/) { $comment = '//'; - } elsif (($checklicenseline == 2) || $realfile =~ /\.(sh|pl|py|awk|tc)$/) { + } elsif (($checklicenseline == 2) || $realfile =~ /\.(sh|pl|py|awk|tc|yaml)$/) { $comment = '#'; } elsif ($realfile =~ /\.rst$/) { $comment = '..'; From a8972573eb5c784a9c199fedbfdfb694e0ca4233 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Mon, 6 Apr 2020 20:10:55 -0700 Subject: [PATCH 128/158] checkpatch: support "base-commit:" format In order to support the get-lore-mbox.py tool described in [1], I ran: git format-patch --base= --cover-letter ... which generated a "base-commit: " tag at the end of the cover letter. However, checkpatch.pl generated an error upon encounting "base-commit:" in the cover letter: "ERROR: Please use git commit description style..." ... because it found the "commit" keyword, and failed to recognize that it was part of the "base-commit" phrase, and as such, should not be subjected to the same commit description style rules. Update checkpatch.pl to include a special case for "base-commit:" (at the start of the line, possibly with some leading whitespace) so that that tag no longer generates a checkpatch error. [1] https://lwn.net/Articles/811528/ "Better tools for kernel developers" Suggested-by: Joe Perches Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Acked-by: Joe Perches Cc: Andy Whitcroft Cc: Konstantin Ryabitsev Cc: Jonathan Corbet Link: http://lkml.kernel.org/r/20200213055004.69235-2-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 977d4c0c2dc9..ede55afaadf0 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2780,7 +2780,7 @@ sub process { # Check for git id commit length and improperly formed commit descriptions if ($in_commit_log && !$commit_log_possible_stack_dump && - $line !~ /^\s*(?:Link|Patchwork|http|https|BugLink):/i && + $line !~ /^\s*(?:Link|Patchwork|http|https|BugLink|base-commit):/i && $line !~ /^This reverts commit [0-9a-f]{7,40}/ && ($line =~ /\bcommit\s+[0-9a-f]{5,}\b/i || ($line =~ /(?:\s|^)[0-9a-f]{12,40}(?:[\s"'\(\[]|$)/i && From f36d3eb89a43047d3eba222a8132585da25cebfd Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 6 Apr 2020 20:10:58 -0700 Subject: [PATCH 129/158] checkpatch: prefer fallthrough; over fallthrough comments commit 294f69e662d1 ("compiler_attributes.h: Add 'fallthrough' pseudo keyword for switch/case use") added the pseudo keyword so add a test for it to checkpatch. Warn on a patch or use --strict for files. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/8b6c1b9031ab9f3cdebada06b8d46467f1492d68.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index ede55afaadf0..2b32816f5c2e 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2294,6 +2294,19 @@ sub pos_last_openparen { return length(expand_tabs(substr($line, 0, $last_openparen))) + 1; } +sub get_raw_comment { + my ($line, $rawline) = @_; + my $comment = ''; + + for my $i (0 .. (length($line) - 1)) { + if (substr($line, $i, 1) eq "$;") { + $comment .= substr($rawline, $i, 1); + } + } + + return $comment; +} + sub process { my $filename = shift; @@ -2455,6 +2468,7 @@ sub process { $sline =~ s/$;/ /g; #with comments as spaces my $rawline = $rawlines[$linenr - 1]; + my $raw_comment = get_raw_comment($line, $rawline); # check if it's a mode change, rename or start of a patch if (!$in_commit_log && @@ -6408,6 +6422,28 @@ sub process { } } +# check for /* fallthrough */ like comment, prefer fallthrough; + my @fallthroughs = ( + 'fallthrough', + '@fallthrough@', + 'lint -fallthrough[ \t]*', + 'intentional(?:ly)?[ \t]*fall(?:(?:s | |-)[Tt]|t)hr(?:ough|u|ew)', + '(?:else,?\s*)?FALL(?:S | |-)?THR(?:OUGH|U|EW)[ \t.!]*(?:-[^\n\r]*)?', + 'Fall(?:(?:s | |-)[Tt]|t)hr(?:ough|u|ew)[ \t.!]*(?:-[^\n\r]*)?', + 'fall(?:s | |-)?thr(?:ough|u|ew)[ \t.!]*(?:-[^\n\r]*)?', + ); + if ($raw_comment ne '') { + foreach my $ft (@fallthroughs) { + if ($raw_comment =~ /$ft/) { + my $msg_level = \&WARN; + $msg_level = \&CHK if ($file); + &{$msg_level}("PREFER_FALLTHROUGH", + "Prefer 'fallthrough;' over fallthrough comment\n" . $herecurr); + last; + } + } + } + # check for switch/default statements without a break; if ($perl_version_ok && defined $stat && From 342d3d2f136837cbe9aaee368dbea39865b9c0f4 Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Mon, 6 Apr 2020 20:11:01 -0700 Subject: [PATCH 130/158] checkpatch: fix minor typo and mixed space+tab in indentation Fix spelling of "concatenation". Don't use tab after space in indentation. Signed-off-by: Antonio Borneo Signed-off-by: Andrew Morton Cc: Joe Perches Link: http://lkml.kernel.org/r/20200122163852.124417-2-borneo.antonio@gmail.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 2b32816f5c2e..879ccf33b441 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -4615,7 +4615,7 @@ sub process { ($op eq '>' && $ca =~ /<\S+\@\S+$/)) { - $ok = 1; + $ok = 1; } # for asm volatile statements @@ -4950,7 +4950,7 @@ sub process { # conditional. substr($s, 0, length($c), ''); $s =~ s/\n.*//g; - $s =~ s/$;//g; # Remove any comments + $s =~ s/$;//g; # Remove any comments if (length($c) && $s !~ /^\s*{?\s*\\*\s*$/ && $c !~ /}\s*while\s*/) { @@ -4989,7 +4989,7 @@ sub process { # if and else should not have general statements after it if ($line =~ /^.\s*(?:}\s*)?else\b(.*)/) { my $s = $1; - $s =~ s/$;//g; # Remove any comments + $s =~ s/$;//g; # Remove any comments if ($s !~ /^\s*(?:\sif|(?:{|)\s*\\?\s*$)/) { ERROR("TRAILING_STATEMENTS", "trailing statements should be on next line\n" . $herecurr); @@ -5165,7 +5165,7 @@ sub process { { } - # Flatten any obvious string concatentation. + # Flatten any obvious string concatenation. while ($dstat =~ s/($String)\s*$Ident/$1/ || $dstat =~ s/$Ident\s*($String)/$1/) { From 7b18496cbc9af07f5dfbb1ce56ea839344061b54 Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Mon, 6 Apr 2020 20:11:04 -0700 Subject: [PATCH 131/158] checkpatch: fix multiple const * types Commit 1574a29f8e76 ("checkpatch: allow multiple const * types") claims to support repetition of pattern "const *", but it actually allows only one extra instance. Check the following lines int a(char const * const x[]); int b(char const * const *x); int c(char const * const * const x[]); int d(char const * const * const *x); with command ./scripts/checkpatch.pl --show-types -f filename to find that only the first line passes the test, while a warning is triggered by the other 3 lines: WARNING:FUNCTION_ARGUMENTS: function definition argument 'char const * const' should also have an identifier name The reason is that the pattern match halts at the second asterisk in the line, thus the remaining text starting with asterisk fails to match a valid name for a variable. Fixed by replacing "?" (Match 1 or 0 times) with "{0,4}" (Match no more than 4 times) in the regular expression. Fix also the similar test for types in unusual order. Fixes: 1574a29f8e76 ("checkpatch: allow multiple const * types") Signed-off-by: Antonio Borneo Signed-off-by: Andrew Morton Cc: Joe Perches Link: http://lkml.kernel.org/r/20200122163852.124417-1-borneo.antonio@gmail.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 879ccf33b441..0d2b95036ea5 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -804,12 +804,12 @@ sub build_types { }x; $Type = qr{ $NonptrType - (?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+)? + (?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+){0,4} (?:\s+$Inline|\s+$Modifier)* }x; $TypeMisordered = qr{ $NonptrTypeMisordered - (?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+)? + (?:(?:\s|\*|\[\])+\s*const|(?:\s|\*\s*(?:const\s*)?|\[\])+|(?:\s*\[\s*\])+){0,4} (?:\s+$Inline|\s+$Modifier)* }x; $Declare = qr{(?:$Storage\s+(?:$Inline\s+)?)?$Type}; From 713a09de9ca9ac6277cb43d79967e7852238f998 Mon Sep 17 00:00:00 2001 From: Antonio Borneo Date: Mon, 6 Apr 2020 20:11:07 -0700 Subject: [PATCH 132/158] checkpatch: add command-line option for TAB size MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Linux kernel coding style requires a size of 8 characters for both TAB and indentation, and such value is embedded as magic value allover the checkpatch script. This makes hard to reuse the script by other projects with different requirements in their coding style (e.g. OpenOCD [1] requires TAB size of 4 characters [2]). Replace the magic value 8 with a variable. Add a command-line option "--tab-size" to let the user select a TAB size value other than 8. [1] http://openocd.org/ [2] http://openocd.org/doc/doxygen/html/stylec.html#styleformat Signed-off-by: Antonio Borneo Signed-off-by: Erik Ahlén Signed-off-by: Spencer Oliver Signed-off-by: Andrew Morton Cc: Joe Perches Link: http://lkml.kernel.org/r/20200122163852.124417-3-borneo.antonio@gmail.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 0d2b95036ea5..fd16f2239a00 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -64,6 +64,7 @@ my $color = "auto"; my $allow_c99_comments = 1; # Can be overridden by --ignore C99_COMMENT_TOLERANCE # git output parsing needs US English output, so first set backtick child process LANGUAGE my $git_command ='export LANGUAGE=en_US.UTF-8; git'; +my $tabsize = 8; sub help { my ($exitcode) = @_; @@ -98,6 +99,7 @@ Options: --show-types show the specific message type in the output --max-line-length=n set the maximum line length, if exceeded, warn --min-conf-desc-length=n set the min description length, if shorter, warn + --tab-size=n set the number of spaces for tab (default 8) --root=PATH PATH to the kernel tree root --no-summary suppress the per-file summary --mailback only produce a report in case of warnings/errors @@ -215,6 +217,7 @@ GetOptions( 'list-types!' => \$list_types, 'max-line-length=i' => \$max_line_length, 'min-conf-desc-length=i' => \$min_conf_desc_length, + 'tab-size=i' => \$tabsize, 'root=s' => \$root, 'summary!' => \$summary, 'mailback!' => \$mailback, @@ -267,6 +270,9 @@ if ($color =~ /^[01]$/) { die "Invalid color mode: $color\n"; } +# skip TAB size 1 to avoid additional checks on $tabsize - 1 +die "Invalid TAB size: $tabsize\n" if ($tabsize < 2); + sub hash_save_array_words { my ($hashRef, $arrayRef) = @_; @@ -1239,7 +1245,7 @@ sub expand_tabs { if ($c eq "\t") { $res .= ' '; $n++; - for (; ($n % 8) != 0; $n++) { + for (; ($n % $tabsize) != 0; $n++) { $res .= ' '; } next; @@ -2252,7 +2258,7 @@ sub string_find_replace { sub tabify { my ($leading) = @_; - my $source_indent = 8; + my $source_indent = $tabsize; my $max_spaces_before_tab = $source_indent - 1; my $spaces_to_tab = " " x $source_indent; @@ -3231,7 +3237,7 @@ sub process { next if ($realfile !~ /\.(h|c|pl|dtsi|dts)$/); # at the beginning of a line any tabs must come first and anything -# more than 8 must use tabs. +# more than $tabsize must use tabs. if ($rawline =~ /^\+\s* \t\s*\S/ || $rawline =~ /^\+\s* \s*/) { my $herevet = "$here\n" . cat_vet($rawline) . "\n"; @@ -3250,7 +3256,7 @@ sub process { "please, no space before tabs\n" . $herevet) && $fix) { while ($fixed[$fixlinenr] =~ - s/(^\+.*) {8,8}\t/$1\t\t/) {} + s/(^\+.*) {$tabsize,$tabsize}\t/$1\t\t/) {} while ($fixed[$fixlinenr] =~ s/(^\+.*) +\t/$1\t/) {} } @@ -3272,11 +3278,11 @@ sub process { if ($perl_version_ok && $sline =~ /^\+\t+( +)(?:$c90_Keywords\b|\{\s*$|\}\s*(?:else\b|while\b|\s*$)|$Declare\s*$Ident\s*[;=])/) { my $indent = length($1); - if ($indent % 8) { + if ($indent % $tabsize) { if (WARN("TABSTOP", "Statements should start on a tabstop\n" . $herecurr) && $fix) { - $fixed[$fixlinenr] =~ s@(^\+\t+) +@$1 . "\t" x ($indent/8)@e; + $fixed[$fixlinenr] =~ s@(^\+\t+) +@$1 . "\t" x ($indent/$tabsize)@e; } } } @@ -3294,8 +3300,8 @@ sub process { my $newindent = $2; my $goodtabindent = $oldindent . - "\t" x ($pos / 8) . - " " x ($pos % 8); + "\t" x ($pos / $tabsize) . + " " x ($pos % $tabsize); my $goodspaceindent = $oldindent . " " x $pos; if ($newindent ne $goodtabindent && @@ -3766,11 +3772,11 @@ sub process { #print "line<$line> prevline<$prevline> indent<$indent> sindent<$sindent> check<$check> continuation<$continuation> s<$s> cond_lines<$cond_lines> stat_real<$stat_real> stat<$stat>\n"; if ($check && $s ne '' && - (($sindent % 8) != 0 || + (($sindent % $tabsize) != 0 || ($sindent < $indent) || ($sindent == $indent && ($s !~ /^\s*(?:\}|\{|else\b)/)) || - ($sindent > $indent + 8))) { + ($sindent > $indent + $tabsize))) { WARN("SUSPECT_CODE_INDENT", "suspect code indent for conditional statements ($indent, $sindent)\n" . $herecurr . "$stat_real\n"); } From 44d303eb05ef1a24fec96580aa8a8d6f555e9b7e Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 6 Apr 2020 20:11:10 -0700 Subject: [PATCH 133/158] checkpatch: improve Gerrit Change-Id: test The Gerrit Change-Id: entry is sometimes placed after a Signed-off-by: line. When this occurs, the Gerrit warning is not currently emitted as the first Signed-off-by: signature sets a flag to stop looking. Change the test to add a test for the --- patch separator and emit the warning before any before the --- and also before any diff file name. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Tested-by: John Stultz Link: http://lkml.kernel.org/r/2f6d5f8766fe7439a116c77ea8cc721a3f2d77a2.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index fd16f2239a00..64cfd617eefc 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -2335,6 +2335,7 @@ sub process { my $is_binding_patch = -1; my $in_header_lines = $file ? 0 : 1; my $in_commit_log = 0; #Scanning lines before patch + my $has_patch_separator = 0; #Found a --- line my $has_commit_log = 0; #Encountered lines before patch my $commit_log_lines = 0; #Number of commit log lines my $commit_log_possible_stack_dump = 0; @@ -2660,6 +2661,12 @@ sub process { } } +# Check for patch separator + if ($line =~ /^---$/) { + $has_patch_separator = 1; + $in_commit_log = 0; + } + # Check if MAINTAINERS is being updated. If so, there's probably no need to # emit the "does MAINTAINERS need updating?" message on file add/move/delete if ($line =~ /^\s*MAINTAINERS\s*\|/) { @@ -2759,10 +2766,10 @@ sub process { "A patch subject line should describe the change not the tool that found it\n" . $herecurr); } -# Check for unwanted Gerrit info - if ($in_commit_log && $line =~ /^\s*change-id:/i) { +# Check for Gerrit Change-Ids not in any patch context + if ($realfile eq '' && !$has_patch_separator && $line =~ /^\s*change-id:/i) { ERROR("GERRIT_CHANGE_ID", - "Remove Gerrit Change-Id's before submitting upstream.\n" . $herecurr); + "Remove Gerrit Change-Id's before submitting upstream\n" . $herecurr); } # Check if the commit log is in a possible stack dump From 50c92900214dd9a55bcecc3c53e90d072aff6560 Mon Sep 17 00:00:00 2001 From: Lubomir Rintel Date: Mon, 6 Apr 2020 20:11:13 -0700 Subject: [PATCH 134/158] checkpatch: check proper licensing of Devicetree bindings According to Devicetree maintainers (see Link: below), the Devicetree binding documents are preferrably licensed (GPL-2.0-only OR BSD-2-Clause). Let's check that. The actual check is a bit more relaxed, to allow more liberal but compatible licensing (e.g. GPL-2.0-or-later OR BSD-2-Clause). Signed-off-by: Lubomir Rintel Signed-off-by: Andrew Morton Acked-by: Joe Perches Acked-by: Laurent Pinchart Cc: Rob Herring Cc: Neil Armstrong Cc: Laurent Pinchart , Cc: Jonas Karlman , Cc: Jernej Skrabec , Cc: Mark Rutland , Cc: David Airlie Cc: Daniel Vetter , Link: https://lore.kernel.org/lkml/20200108142132.GA4830@bogus/ Link: http://lkml.kernel.org/r/20200309215153.38824-1-lkundrak@v3.sk Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 64cfd617eefc..460a6b5b9d9a 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -3157,6 +3157,17 @@ sub process { WARN("SPDX_LICENSE_TAG", "'$spdx_license' is not supported in LICENSES/...\n" . $herecurr); } + if ($realfile =~ m@^Documentation/devicetree/bindings/@ && + not $spdx_license =~ /GPL-2\.0.*BSD-2-Clause/) { + my $msg_level = \&WARN; + $msg_level = \&CHK if ($file); + if (&{$msg_level}("SPDX_LICENSE_TAG", + + "DT binding documents should be licensed (GPL-2.0-only OR BSD-2-Clause)\n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] =~ s/SPDX-License-Identifier: .*/SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)/; + } + } } } } From 16b7f3c89907acf947500f0c20dcff44f651d795 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Mon, 6 Apr 2020 20:11:17 -0700 Subject: [PATCH 135/158] checkpatch: avoid warning about uninitialized_var() WARNING: function definition argument 'flags' should also have an identifier name #26: FILE: drivers/tty/serial/sh-sci.c:1348: + unsigned long uninitialized_var(flags); Special-case uninitialized_var() to prevent this. Signed-off-by: Joe Perches Signed-off-by: Andrew Morton Tested-by: Andrew Morton Link: http://lkml.kernel.org/r/7db7944761b0bd88c70eb17d4b7f40fe589e14ed.camel@perches.com Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 460a6b5b9d9a..d64c67b67e3c 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -4071,7 +4071,7 @@ sub process { } # check for function declarations without arguments like "int foo()" - if ($line =~ /(\b$Type\s+$Ident)\s*\(\s*\)/) { + if ($line =~ /(\b$Type\s*$Ident)\s*\(\s*\)/) { if (ERROR("FUNCTION_WITHOUT_ARGS", "Bad function definition - $1() should probably be $1(void)\n" . $herecurr) && $fix) { @@ -6287,13 +6287,17 @@ sub process { } # check for function declarations that have arguments without identifier names +# while avoiding uninitialized_var(x) if (defined $stat && - $stat =~ /^.\s*(?:extern\s+)?$Type\s*(?:$Ident|\(\s*\*\s*$Ident\s*\))\s*\(\s*([^{]+)\s*\)\s*;/s && - $1 ne "void") { - my $args = trim($1); + $stat =~ /^.\s*(?:extern\s+)?$Type\s*(?:($Ident)|\(\s*\*\s*$Ident\s*\))\s*\(\s*([^{]+)\s*\)\s*;/s && + (!defined($1) || + (defined($1) && $1 ne "uninitialized_var")) && + $2 ne "void") { + my $args = trim($2); while ($args =~ m/\s*($Type\s*(?:$Ident|\(\s*\*\s*$Ident?\s*\)\s*$balanced_parens)?)/g) { my $arg = trim($1); - if ($arg =~ /^$Type$/ && $arg !~ /enum\s+$Ident$/) { + if ($arg =~ /^$Type$/ && + $arg !~ /enum\s+$Ident$/) { WARN("FUNCTION_ARGUMENTS", "function definition argument '$arg' should also have an identifier name\n" . $herecurr); } From 282144e04b9a88b53dc16dc46e47f7d810e0b63b Mon Sep 17 00:00:00 2001 From: Roman Penyaev Date: Mon, 6 Apr 2020 20:11:20 -0700 Subject: [PATCH 136/158] kselftest: introduce new epoll test case This testcase repeats epollbug.c from the bug: https://bugzilla.kernel.org/show_bug.cgi?id=205933 What it tests? It tests the race between epoll_ctl() and epoll_wait(). New event mask passed to epoll_ctl() triggers wake up, which can be missed because of the bug described in the link. Reproduction is 100%, so easy to fix. Kudos, Max, for wonderful test case. Signed-off-by: Roman Penyaev Signed-off-by: Andrew Morton Cc: Max Neunhoeffer Cc: Jakub Kicinski Cc: Christopher Kohlhoff Cc: Davidlohr Bueso Cc: Jason Baron Cc: Jes Sorensen Link: http://lkml.kernel.org/r/20200214170211.561524-2-rpenyaev@suse.de Signed-off-by: Linus Torvalds --- .../filesystems/epoll/epoll_wakeup_test.c | 67 ++++++++++++++++++- 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c index 37a04dab56f0..11eee0b60040 100644 --- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c +++ b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c @@ -7,13 +7,14 @@ #include #include #include +#include #include "../../kselftest_harness.h" struct epoll_mtcontext { int efd[3]; int sfd[4]; - int count; + volatile int count; pthread_t main; pthread_t waiter; @@ -3071,4 +3072,68 @@ TEST(epoll58) close(ctx.sfd[3]); } +static void *epoll59_thread(void *ctx_) +{ + struct epoll_mtcontext *ctx = ctx_; + struct epoll_event e; + int i; + + for (i = 0; i < 100000; i++) { + while (ctx->count == 0) + ; + + e.events = EPOLLIN | EPOLLERR | EPOLLET; + epoll_ctl(ctx->efd[0], EPOLL_CTL_MOD, ctx->sfd[0], &e); + ctx->count = 0; + } + + return NULL; +} + +/* + * t0 + * (p) \ + * e0 + * (et) / + * e0 + * + * Based on https://bugzilla.kernel.org/show_bug.cgi?id=205933 + */ +TEST(epoll59) +{ + pthread_t emitter; + struct pollfd pfd; + struct epoll_event e; + struct epoll_mtcontext ctx = { 0 }; + int i, ret; + + signal(SIGUSR1, signal_handler); + + ctx.efd[0] = epoll_create1(0); + ASSERT_GE(ctx.efd[0], 0); + + ctx.sfd[0] = eventfd(1, 0); + ASSERT_GE(ctx.sfd[0], 0); + + e.events = EPOLLIN | EPOLLERR | EPOLLET; + ASSERT_EQ(epoll_ctl(ctx.efd[0], EPOLL_CTL_ADD, ctx.sfd[0], &e), 0); + + ASSERT_EQ(pthread_create(&emitter, NULL, epoll59_thread, &ctx), 0); + + for (i = 0; i < 100000; i++) { + ret = epoll_wait(ctx.efd[0], &e, 1, 1000); + ASSERT_GT(ret, 0); + + while (ctx.count != 0) + ; + ctx.count = 1; + } + if (pthread_tryjoin_np(emitter, NULL) < 0) { + pthread_kill(emitter, SIGUSR1); + pthread_join(emitter, NULL); + } + close(ctx.efd[0]); + close(ctx.sfd[0]); +} + TEST_HARNESS_MAIN From efcdd350d1f8a98fa9fb5280e5af0a22e2059a26 Mon Sep 17 00:00:00 2001 From: Jason Baron Date: Mon, 6 Apr 2020 20:11:23 -0700 Subject: [PATCH 137/158] fs/epoll: make nesting accounting safe for -rt kernel Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set ep_poll_safewake() can take several non-raw spinlocks after disabling interrupts. Since a spinlock can block in the -rt kernel, we can't take a spinlock after disabling interrupts. So let's re-work how we determine the nesting level such that it plays nicely with the -rt kernel. Let's introduce a 'nests' field in struct eventpoll that records the current nesting level during ep_poll_callback(). Then, if we nest again we can find the previous struct eventpoll that we were called from and increase our count by 1. The 'nests' field is protected by ep->poll_wait.lock. I've also moved the visited field to reduce the size of struct eventpoll from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which is typical for a production config. Reported-by: Davidlohr Bueso Signed-off-by: Jason Baron Signed-off-by: Andrew Morton Reviewed-by: Davidlohr Bueso Cc: Roman Penyaev Cc: Eric Wong Cc: Al Viro Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.com Signed-off-by: Linus Torvalds --- fs/eventpoll.c | 64 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 43 insertions(+), 21 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index eee3c92a9ebf..8c596641a72b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -218,13 +218,18 @@ struct eventpoll { struct file *file; /* used to optimize loop detection check */ - int visited; struct list_head visited_list_link; + int visited; #ifdef CONFIG_NET_RX_BUSY_POLL /* used to track busy poll napi_id */ unsigned int napi_id; #endif + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + /* tracks wakeup nests for lockdep validation */ + u8 nests; +#endif }; /* Wait structure used by the poll hooks */ @@ -545,30 +550,47 @@ out_unlock: */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -static DEFINE_PER_CPU(int, wakeup_nest); - -static void ep_poll_safewake(wait_queue_head_t *wq) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) { + struct eventpoll *ep_src; unsigned long flags; - int subclass; + u8 nests = 0; - local_irq_save(flags); - preempt_disable(); - subclass = __this_cpu_read(wakeup_nest); - spin_lock_nested(&wq->lock, subclass + 1); - __this_cpu_inc(wakeup_nest); - wake_up_locked_poll(wq, POLLIN); - __this_cpu_dec(wakeup_nest); - spin_unlock(&wq->lock); - local_irq_restore(flags); - preempt_enable(); + /* + * To set the subclass or nesting level for spin_lock_irqsave_nested() + * it might be natural to create a per-cpu nest count. However, since + * we can recurse on ep->poll_wait.lock, and a non-raw spinlock can + * schedule() in the -rt kernel, the per-cpu variable are no longer + * protected. Thus, we are introducing a per eventpoll nest field. + * If we are not being call from ep_poll_callback(), epi is NULL and + * we are at the first level of nesting, 0. Otherwise, we are being + * called from ep_poll_callback() and if a previous wakeup source is + * not an epoll file itself, we are at depth 1 since the wakeup source + * is depth 0. If the wakeup source is a previous epoll file in the + * wakeup chain then we use its nests value and record ours as + * nests + 1. The previous epoll file nests value is stable since its + * already holding its own poll_wait.lock. + */ + if (epi) { + if ((is_file_epoll(epi->ffd.file))) { + ep_src = epi->ffd.file->private_data; + nests = ep_src->nests; + } else { + nests = 1; + } + } + spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests); + ep->nests = nests + 1; + wake_up_locked_poll(&ep->poll_wait, EPOLLIN); + ep->nests = 0; + spin_unlock_irqrestore(&ep->poll_wait.lock, flags); } #else -static void ep_poll_safewake(wait_queue_head_t *wq) +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi) { - wake_up_poll(wq, EPOLLIN); + wake_up_poll(&ep->poll_wait, EPOLLIN); } #endif @@ -789,7 +811,7 @@ static void ep_free(struct eventpoll *ep) /* We need to release all tasks waiting for these file */ if (waitqueue_active(&ep->poll_wait)) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); /* * We need to lock this because we could be hit by @@ -1258,7 +1280,7 @@ out_unlock: /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, epi); if (!(epi->event.events & EPOLLEXCLUSIVE)) ewake = 1; @@ -1562,7 +1584,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event, /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); return 0; @@ -1666,7 +1688,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi, /* We have to call this outside the lock */ if (pwake) - ep_poll_safewake(&ep->poll_wait); + ep_poll_safewake(ep, NULL); return 0; } From c69bcc932ef3568a13cf6d67398cf5e9da88e812 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Mon, 6 Apr 2020 20:11:26 -0700 Subject: [PATCH 138/158] fs/binfmt_elf.c: delete "loc" variable "loc" variable became just a wrapper for PT_INTERP ELF header after main ELF header was moved to "bprm->buf". Delete it. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200219184847.GA4871@avx2 Signed-off-by: Linus Torvalds --- fs/binfmt_elf.c | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 1eb63867e266..3cb888cb2b2b 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -699,15 +699,13 @@ static int load_elf_binary(struct linux_binprm *bprm) unsigned long reloc_func_desc __maybe_unused = 0; int executable_stack = EXSTACK_DEFAULT; struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf; - struct { - struct elfhdr interp_elf_ex; - } *loc; + struct elfhdr *interp_elf_ex; struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE; struct mm_struct *mm; struct pt_regs *regs; - loc = kmalloc(sizeof(*loc), GFP_KERNEL); - if (!loc) { + interp_elf_ex = kmalloc(sizeof(*interp_elf_ex), GFP_KERNEL); + if (!interp_elf_ex) { retval = -ENOMEM; goto out_ret; } @@ -772,8 +770,8 @@ static int load_elf_binary(struct linux_binprm *bprm) would_dump(bprm, interpreter); /* Get the exec headers */ - retval = elf_read(interpreter, &loc->interp_elf_ex, - sizeof(loc->interp_elf_ex), 0); + retval = elf_read(interpreter, interp_elf_ex, + sizeof(*interp_elf_ex), 0); if (retval < 0) goto out_free_dentry; @@ -807,25 +805,25 @@ out_free_interp: if (interpreter) { retval = -ELIBBAD; /* Not an ELF interpreter */ - if (memcmp(loc->interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0) + if (memcmp(interp_elf_ex->e_ident, ELFMAG, SELFMAG) != 0) goto out_free_dentry; /* Verify the interpreter has a valid arch */ - if (!elf_check_arch(&loc->interp_elf_ex) || - elf_check_fdpic(&loc->interp_elf_ex)) + if (!elf_check_arch(interp_elf_ex) || + elf_check_fdpic(interp_elf_ex)) goto out_free_dentry; /* Load the interpreter program headers */ - interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex, + interp_elf_phdata = load_elf_phdrs(interp_elf_ex, interpreter); if (!interp_elf_phdata) goto out_free_dentry; /* Pass PT_LOPROC..PT_HIPROC headers to arch code */ elf_ppnt = interp_elf_phdata; - for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++) + for (i = 0; i < interp_elf_ex->e_phnum; i++, elf_ppnt++) switch (elf_ppnt->p_type) { case PT_LOPROC ... PT_HIPROC: - retval = arch_elf_pt_proc(&loc->interp_elf_ex, + retval = arch_elf_pt_proc(interp_elf_ex, elf_ppnt, interpreter, true, &arch_state); if (retval) @@ -840,7 +838,7 @@ out_free_interp: * the exec syscall. */ retval = arch_check_elf(elf_ex, - !!interpreter, &loc->interp_elf_ex, + !!interpreter, interp_elf_ex, &arch_state); if (retval) goto out_free_dentry; @@ -1056,7 +1054,7 @@ out_free_interp: } if (interpreter) { - elf_entry = load_elf_interp(&loc->interp_elf_ex, + elf_entry = load_elf_interp(interp_elf_ex, interpreter, load_bias, interp_elf_phdata); if (!IS_ERR((void *)elf_entry)) { @@ -1065,7 +1063,7 @@ out_free_interp: * adjustment */ interp_load_addr = elf_entry; - elf_entry += loc->interp_elf_ex.e_entry; + elf_entry += interp_elf_ex->e_entry; } if (BAD_ADDR(elf_entry)) { retval = IS_ERR((void *)elf_entry) ? @@ -1154,7 +1152,7 @@ out_free_interp: start_thread(regs, elf_entry, bprm->p); retval = 0; out: - kfree(loc); + kfree(interp_elf_ex); out_ret: return retval; From 0693ffebcfe5ac7b31f63ad54587007f7d96fb7b Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Mon, 6 Apr 2020 20:11:29 -0700 Subject: [PATCH 139/158] fs/binfmt_elf.c: allocate less for static executable PT_INTERP ELF header can be spared if executable is static. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200219185012.GB4871@avx2 Signed-off-by: Linus Torvalds --- fs/binfmt_elf.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 3cb888cb2b2b..a3a9429ef1d2 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -699,17 +699,11 @@ static int load_elf_binary(struct linux_binprm *bprm) unsigned long reloc_func_desc __maybe_unused = 0; int executable_stack = EXSTACK_DEFAULT; struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf; - struct elfhdr *interp_elf_ex; + struct elfhdr *interp_elf_ex = NULL; struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE; struct mm_struct *mm; struct pt_regs *regs; - interp_elf_ex = kmalloc(sizeof(*interp_elf_ex), GFP_KERNEL); - if (!interp_elf_ex) { - retval = -ENOMEM; - goto out_ret; - } - retval = -ENOEXEC; /* First of all, some simple consistency checks */ if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0) @@ -769,6 +763,12 @@ static int load_elf_binary(struct linux_binprm *bprm) */ would_dump(bprm, interpreter); + interp_elf_ex = kmalloc(sizeof(*interp_elf_ex), GFP_KERNEL); + if (!interp_elf_ex) { + retval = -ENOMEM; + goto out_free_ph; + } + /* Get the exec headers */ retval = elf_read(interpreter, interp_elf_ex, sizeof(*interp_elf_ex), 0); @@ -1074,6 +1074,8 @@ out_free_interp: allow_write_access(interpreter); fput(interpreter); + + kfree(interp_elf_ex); } else { elf_entry = e_entry; if (BAD_ADDR(elf_entry)) { @@ -1152,12 +1154,11 @@ out_free_interp: start_thread(regs, elf_entry, bprm->p); retval = 0; out: - kfree(interp_elf_ex); -out_ret: return retval; /* error cleanup */ out_free_dentry: + kfree(interp_elf_ex); kfree(interp_elf_phdata); allow_write_access(interpreter); if (interpreter) From aa0d1564b10f9165f913229d4428bdeea4e0d945 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Mon, 6 Apr 2020 20:11:32 -0700 Subject: [PATCH 140/158] fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path Static executables don't need to free NULL pointer. It doesn't matter really because static executable is not common scenario but do it anyway out of pedantry. Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200219185330.GA4933@avx2 Signed-off-by: Linus Torvalds --- fs/binfmt_elf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index a3a9429ef1d2..13f25e241ac4 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1076,6 +1076,7 @@ out_free_interp: fput(interpreter); kfree(interp_elf_ex); + kfree(interp_elf_phdata); } else { elf_entry = e_entry; if (BAD_ADDR(elf_entry)) { @@ -1084,7 +1085,6 @@ out_free_interp: } } - kfree(interp_elf_phdata); kfree(elf_phdata); set_binfmt(&elf_format); From 4800314e19d976d88c87e99bcd37d7635eb57f78 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 6 Apr 2020 20:11:36 -0700 Subject: [PATCH 141/158] samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes Patch series "Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()". Despite having just a single modular in-tree user that I could spot, kallsyms_lookup_name() is exported to modules and provides a mechanism for out-of-tree modules to access and invoke arbitrary, non-exported kernel symbols when kallsyms is enabled. This patch series fixes up that one user and unexports the symbol along with kallsyms_on_each_symbol(), since that could also be abused in a similar manner. I would like to avoid out-of-tree modules being easily able to call functions that are not exported. kallsyms_lookup_name() makes this trivial to the point that there is very little incentive to rework these modules to either use upstream interfaces correctly or propose functionality which may be otherwise missing upstream. Both of these latter solutions would be pre-requisites to upstreaming these modules, and the current state of things actively discourages that approach. The background here is that we are aiming for Android devices to be able to use a generic binary kernel image closely following upstream, with any vendor extensions coming in as kernel modules. In this case, we (Google) end up maintaining the binary module ABI within the scope of a single LTS kernel. Monitoring and managing the ABI surface is not feasible if it effectively includes all data and functions via kallsyms_lookup_name(). Of course, we could just carry this patch in the Android kernel tree, but we're aiming to carry as little as possible (ideally nothing) and I think it's a sensible change in its own right. I'm surprised you object to it, in all honesty. Now, you could turn around and say "that's not upstream's problem", but it still seems highly undesirable to me to have an upstream bypass for exported symbols that isn't even used by upstream modules. It's ripe for abuse and encourages people to work outside of the upstream tree. The usual rule is that we don't export symbols without a user in the tree and that seems especially relevant in this case. Joe Lawrence said: : FWIW, kallsyms was historically used by the out-of-tree kpatch support : module to resolve external symbols as well as call set_memory_r{w,o}() : API. All of that support code has been merged upstream, so modern kpatch : modules* no longer leverage kallsyms by default. : : That said, there are still some users who still use the deprecated support : module with newer kernels, but that is not officially supported by the : project. This patch (of 3): Given the name of a kernel symbol, the 'data_breakpoint' test claims to "report any write operations on the kernel symbol". However, it creates the breakpoint using both HW_BREAKPOINT_W and HW_BREAKPOINT_R, which menas it also fires for read access. Drop HW_BREAKPOINT_R from the breakpoint attributes. Signed-off-by: Will Deacon Signed-off-by: Andrew Morton Reviewed-by: Christoph Hellwig Reviewed-by: Masami Hiramatsu Reviewed-by: Quentin Perret Cc: K.Prasad Cc: Thomas Gleixner Cc: Greg Kroah-Hartman Cc: Frederic Weisbecker Cc: Alexei Starovoitov Cc: Miroslav Benes Cc: Petr Mladek Cc: Joe Lawrence Link: http://lkml.kernel.org/r/20200221114404.14641-2-will@kernel.org Signed-off-by: Linus Torvalds --- samples/hw_breakpoint/data_breakpoint.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c index c58504774118..469b36f93696 100644 --- a/samples/hw_breakpoint/data_breakpoint.c +++ b/samples/hw_breakpoint/data_breakpoint.c @@ -45,7 +45,7 @@ static int __init hw_break_module_init(void) hw_breakpoint_init(&attr); attr.bp_addr = kallsyms_lookup_name(ksym_name); attr.bp_len = HW_BREAKPOINT_LEN_4; - attr.bp_type = HW_BREAKPOINT_W | HW_BREAKPOINT_R; + attr.bp_type = HW_BREAKPOINT_W; sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler, NULL); if (IS_ERR((void __force *)sample_hbp)) { From d8a84d33a4954b85c53faf77be48b5c2b559692e Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 6 Apr 2020 20:11:39 -0700 Subject: [PATCH 142/158] samples/hw_breakpoint: drop use of kallsyms_lookup_name() The 'data_breakpoint' test code is the only modular user of kallsyms_lookup_name(), which was exported as part of fixing the test in f60d24d2ad04 ("hw-breakpoints: Fix broken hw-breakpoint sample module"). In preparation for un-exporting this symbol, switch the test over to using __symbol_get(), which can be used to place breakpoints on exported symbols. Signed-off-by: Will Deacon Signed-off-by: Andrew Morton Reviewed-by: Christoph Hellwig Reviewed-by: Masami Hiramatsu Reviewed-by: Quentin Perret Cc: K.Prasad Cc: Thomas Gleixner Cc: Greg Kroah-Hartman Cc: Frederic Weisbecker Cc: Alexei Starovoitov Cc: Miroslav Benes Cc: Petr Mladek Cc: Joe Lawrence Link: http://lkml.kernel.org/r/20200221114404.14641-3-will@kernel.org Signed-off-by: Linus Torvalds --- samples/hw_breakpoint/data_breakpoint.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c index 469b36f93696..418c46fe5ffc 100644 --- a/samples/hw_breakpoint/data_breakpoint.c +++ b/samples/hw_breakpoint/data_breakpoint.c @@ -23,7 +23,7 @@ struct perf_event * __percpu *sample_hbp; -static char ksym_name[KSYM_NAME_LEN] = "pid_max"; +static char ksym_name[KSYM_NAME_LEN] = "jiffies"; module_param_string(ksym, ksym_name, KSYM_NAME_LEN, S_IRUGO); MODULE_PARM_DESC(ksym, "Kernel symbol to monitor; this module will report any" " write operations on the kernel symbol"); @@ -41,9 +41,13 @@ static int __init hw_break_module_init(void) { int ret; struct perf_event_attr attr; + void *addr = __symbol_get(ksym_name); + + if (!addr) + return -ENXIO; hw_breakpoint_init(&attr); - attr.bp_addr = kallsyms_lookup_name(ksym_name); + attr.bp_addr = (unsigned long)addr; attr.bp_len = HW_BREAKPOINT_LEN_4; attr.bp_type = HW_BREAKPOINT_W; @@ -66,6 +70,7 @@ fail: static void __exit hw_break_module_exit(void) { unregister_wide_hw_breakpoint(sample_hbp); + symbol_put(ksym_name); printk(KERN_INFO "HW Breakpoint for %s write uninstalled\n", ksym_name); } From 0bd476e6c67190b5eb7b6e105c8db8ff61103281 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 6 Apr 2020 20:11:43 -0700 Subject: [PATCH 143/158] kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol() kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to modules despite having no in-tree users and being wide open to abuse by out-of-tree modules that can use them as a method to invoke arbitrary non-exported kernel functions. Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol(). Signed-off-by: Will Deacon Signed-off-by: Andrew Morton Reviewed-by: Greg Kroah-Hartman Reviewed-by: Christoph Hellwig Reviewed-by: Masami Hiramatsu Reviewed-by: Quentin Perret Acked-by: Alexei Starovoitov Cc: Thomas Gleixner Cc: Frederic Weisbecker Cc: K.Prasad Cc: Miroslav Benes Cc: Petr Mladek Cc: Joe Lawrence Cc: Mathieu Desnoyers Link: http://lkml.kernel.org/r/20200221114404.14641-4-will@kernel.org Signed-off-by: Linus Torvalds --- kernel/kallsyms.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index a9b3f660dee7..16c8c605f4b0 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -175,7 +175,6 @@ unsigned long kallsyms_lookup_name(const char *name) } return module_kallsyms_lookup_name(name); } -EXPORT_SYMBOL_GPL(kallsyms_lookup_name); int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *, unsigned long), @@ -194,7 +193,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *, } return module_kallsyms_on_each_symbol(fn, data); } -EXPORT_SYMBOL_GPL(kallsyms_on_each_symbol); static unsigned long get_symbol_pos(unsigned long addr, unsigned long *symbolsize, From 5404e7e0ac0cd03b83dd2476155254ee1e50529f Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Mon, 6 Apr 2020 20:11:46 -0700 Subject: [PATCH 144/158] reiserfs: clean up several indentation issues There are several places where code is indented incorrectly. Fix these. Signed-off-by: Colin Ian King Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200325135018.113431-1-colin.king@canonical.com Signed-off-by: Linus Torvalds --- fs/reiserfs/do_balan.c | 2 +- fs/reiserfs/ioctl.c | 11 ++++++----- fs/reiserfs/namei.c | 10 +++++----- 3 files changed, 12 insertions(+), 11 deletions(-) diff --git a/fs/reiserfs/do_balan.c b/fs/reiserfs/do_balan.c index 4075e41408b4..5129efc6f2e6 100644 --- a/fs/reiserfs/do_balan.c +++ b/fs/reiserfs/do_balan.c @@ -842,7 +842,7 @@ static void balance_leaf_paste_right_whole(struct tree_balance *tb, struct item_head *pasted; struct buffer_info bi; - buffer_info_init_right(tb, &bi); + buffer_info_init_right(tb, &bi); leaf_shift_right(tb, tb->rnum[0], tb->rbytes); /* append item in R[0] */ diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c index 45e1a5d11af3..adb21bea3d60 100644 --- a/fs/reiserfs/ioctl.c +++ b/fs/reiserfs/ioctl.c @@ -184,11 +184,12 @@ int reiserfs_unpack(struct inode *inode, struct file *filp) } /* we need to make sure nobody is changing the file size beneath us */ -{ - int depth = reiserfs_write_unlock_nested(inode->i_sb); - inode_lock(inode); - reiserfs_write_lock_nested(inode->i_sb, depth); -} + { + int depth = reiserfs_write_unlock_nested(inode->i_sb); + + inode_lock(inode); + reiserfs_write_lock_nested(inode->i_sb, depth); + } reiserfs_write_lock(inode->i_sb); diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c index 959a066b7bb0..1594687582f0 100644 --- a/fs/reiserfs/namei.c +++ b/fs/reiserfs/namei.c @@ -838,10 +838,10 @@ static int reiserfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode */ INC_DIR_INODE_NLINK(dir) - retval = reiserfs_new_inode(&th, dir, mode, NULL /*symlink */ , - old_format_only(dir->i_sb) ? - EMPTY_DIR_SIZE_V1 : EMPTY_DIR_SIZE, - dentry, inode, &security); + retval = reiserfs_new_inode(&th, dir, mode, NULL /*symlink */, + old_format_only(dir->i_sb) ? + EMPTY_DIR_SIZE_V1 : EMPTY_DIR_SIZE, + dentry, inode, &security); if (retval) { DEC_DIR_INODE_NLINK(dir) goto out_failed; @@ -967,7 +967,7 @@ static int reiserfs_rmdir(struct inode *dir, struct dentry *dentry) reiserfs_update_sd(&th, inode); DEC_DIR_INODE_NLINK(dir) - dir->i_size -= (DEH_SIZE + de.de_entrylen); + dir->i_size -= (DEH_SIZE + de.de_entrylen); reiserfs_update_sd(&th, dir); /* prevent empty directory from getting lost */ From 06d4f8152a01034b36591885321da54eb77398c1 Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Mon, 6 Apr 2020 20:11:49 -0700 Subject: [PATCH 145/158] kernel/kmod.c: fix a typo "assuems" -> "assumes" There is a typo in comment. Fix it. s/assuems/assumes/ Signed-off-by: Qiujun Huang Signed-off-by: Andrew Morton Acked-by: Luis Chamberlain Link: http://lkml.kernel.org/r/1585891029-6450-1-git-send-email-hqjagain@gmail.com Signed-off-by: Linus Torvalds --- kernel/kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/kmod.c b/kernel/kmod.c index bc6addd9152b..8b2b311afa95 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -35,7 +35,7 @@ * (u64) THREAD_SIZE * 8UL); * * If you need less than 50 threads would mean we're dealing with systems - * smaller than 3200 pages. This assuems you are capable of having ~13M memory, + * smaller than 3200 pages. This assumes you are capable of having ~13M memory, * and this would only be an be an upper limit, after which the OOM killer * would take effect. Systems like these are very unlikely if modules are * enabled. From fba4168edecdd2781bcd83cb131977ec1157f87c Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:11:52 -0700 Subject: [PATCH 146/158] gcov: gcc_4_7: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Acked-by: Peter Oberparleiter Link: http://lkml.kernel.org/r/20200213152241.GA877@embeddedor Signed-off-by: Linus Torvalds --- kernel/gcov/gcc_4_7.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/gcov/gcc_4_7.c b/kernel/gcov/gcc_4_7.c index ec37563674d6..908fdf5098c3 100644 --- a/kernel/gcov/gcc_4_7.c +++ b/kernel/gcov/gcc_4_7.c @@ -68,7 +68,7 @@ struct gcov_fn_info { unsigned int ident; unsigned int lineno_checksum; unsigned int cfg_checksum; - struct gcov_ctr_info ctrs[0]; + struct gcov_ctr_info ctrs[]; }; /** From 7ff87182d156e90b59455aac8503da512a1c5dba Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:11:55 -0700 Subject: [PATCH 147/158] gcov: gcc_3_4: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Cc: Peter Oberparleiter Link: http://lkml.kernel.org/r/20200302224501.GA14175@embeddedor Signed-off-by: Linus Torvalds --- kernel/gcov/gcc_3_4.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/gcov/gcc_3_4.c b/kernel/gcov/gcc_3_4.c index 801ee4b0b969..acb83558e5df 100644 --- a/kernel/gcov/gcc_3_4.c +++ b/kernel/gcov/gcc_3_4.c @@ -38,7 +38,7 @@ static struct gcov_info *gcov_info_head; struct gcov_fn_info { unsigned int ident; unsigned int checksum; - unsigned int n_ctrs[0]; + unsigned int n_ctrs[]; }; /** @@ -78,7 +78,7 @@ struct gcov_info { unsigned int n_functions; const struct gcov_fn_info *functions; unsigned int ctr_mask; - struct gcov_ctr_info counts[0]; + struct gcov_ctr_info counts[]; }; /** @@ -352,7 +352,7 @@ struct gcov_iterator { unsigned int count; int num_types; - struct type_info type_info[0]; + struct type_info type_info[]; }; static struct gcov_fn_info *get_func(struct gcov_iterator *iter) From 6524d79413f4ffd28480f541ef68f9ea199b2e0c Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:11:58 -0700 Subject: [PATCH 148/158] kernel/gcov/fs.c: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Cc: Peter Oberparleiter Link: http://lkml.kernel.org/r/20200302224851.GA26467@embeddedor Signed-off-by: Linus Torvalds --- kernel/gcov/fs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/gcov/fs.c b/kernel/gcov/fs.c index e5eb5ea7ea59..5e891c3c2d93 100644 --- a/kernel/gcov/fs.c +++ b/kernel/gcov/fs.c @@ -58,7 +58,7 @@ struct gcov_node { struct dentry *dentry; struct dentry **links; int num_loaded; - char name[0]; + char name[]; }; static const char objtree[] = OBJTREE; From 7baf219982281034f8c3a806869873bf098fb611 Mon Sep 17 00:00:00 2001 From: Krzysztof Kozlowski Date: Mon, 6 Apr 2020 20:12:02 -0700 Subject: [PATCH 149/158] init/Kconfig: clean up ANON_INODES and old IO schedulers options CONFIG_ANON_INODES is gone since commit 5dd50aaeb185 ("Make anon_inodes unconditional"). CONFIG_CFQ_GROUP_IOSCHED was replaced with CONFIG_BFQ_GROUP_IOSCHED in commit f382fb0bcef4 ("block: remove legacy IO schedulers"). Signed-off-by: Krzysztof Kozlowski Signed-off-by: Andrew Morton Cc: Masahiro Yamada Cc: Thomas Gleixner Cc: Greg Kroah-Hartman Link: http://lkml.kernel.org/r/20200130192419.3026-1-krzk@kernel.org Signed-off-by: Linus Torvalds --- init/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 2bc0e47689d8..574b7215aa6a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -872,7 +872,7 @@ config BLK_CGROUP This option only enables generic Block IO controller infrastructure. One needs to also enable actual IO controlling logic/policy. For enabling proportional weight division of disk bandwidth in CFQ, set - CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set + CONFIG_BFQ_GROUP_IOSCHED=y; for enabling throttling policy, set CONFIG_BLK_DEV_THROTTLING=y. See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information. @@ -1538,7 +1538,6 @@ config AIO config IO_URING bool "Enable IO uring support" if EXPERT - select ANON_INODES select IO_WQ default y help From 0887a7ebc97770c7870abf3075a2e8cd502a7f52 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:12:27 -0700 Subject: [PATCH 150/158] ubsan: add trap instrumentation option Patch series "ubsan: Split out bounds checker", v5. This splits out the bounds checker so it can be individually used. This is enabled in Android and hopefully for syzbot. Includes LKDTM tests for behavioral corner-cases (beyond just the bounds checker), and adjusts ubsan and kasan slightly for correct panic handling. This patch (of 6): The Undefined Behavior Sanitizer can operate in two modes: warning reporting mode via lib/ubsan.c handler calls, or trap mode, which uses __builtin_trap() as the handler. Using lib/ubsan.c means the kernel image is about 5% larger (due to all the debugging text and reporting structures to capture details about the warning conditions). Using the trap mode, the image size changes are much smaller, though at the loss of the "warning only" mode. In order to give greater flexibility to system builders that want minimal changes to image size and are prepared to deal with kernel code being aborted and potentially destabilizing the system, this introduces CONFIG_UBSAN_TRAP. The resulting image sizes comparison: text data bss dec hex filename 19533663 6183037 18554956 44271656 2a38828 vmlinux.stock 19991849 7618513 18874448 46484810 2c54d4a vmlinux.ubsan 19712181 6284181 18366540 44362902 2a4ec96 vmlinux.ubsan-trap CONFIG_UBSAN=y: image +4.8% (text +2.3%, data +18.9%) CONFIG_UBSAN_TRAP=y: image +0.2% (text +0.9%, data +1.6%) Additionally adjusts the CONFIG_UBSAN Kconfig help for clarity and removes the mention of non-existing boot param "ubsan_handle". Suggested-by: Elena Petrova Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Acked-by: Dmitry Vyukov Cc: Andrey Ryabinin Cc: Andrey Konovalov Cc: Alexander Potapenko Cc: Dan Carpenter Cc: "Gustavo A. R. Silva" Cc: Arnd Bergmann Cc: Ard Biesheuvel Link: http://lkml.kernel.org/r/20200227193516.32566-2-keescook@chromium.org Signed-off-by: Linus Torvalds --- lib/Kconfig.ubsan | 22 ++++++++++++++++++---- lib/Makefile | 2 ++ scripts/Makefile.ubsan | 9 +++++++-- 3 files changed, 27 insertions(+), 6 deletions(-) diff --git a/lib/Kconfig.ubsan b/lib/Kconfig.ubsan index 0e04fcb3ab3d..9deb655838b0 100644 --- a/lib/Kconfig.ubsan +++ b/lib/Kconfig.ubsan @@ -5,11 +5,25 @@ config ARCH_HAS_UBSAN_SANITIZE_ALL config UBSAN bool "Undefined behaviour sanity checker" help - This option enables undefined behaviour sanity checker + This option enables the Undefined Behaviour sanity checker. Compile-time instrumentation is used to detect various undefined - behaviours in runtime. Various types of checks may be enabled - via boot parameter ubsan_handle - (see: Documentation/dev-tools/ubsan.rst). + behaviours at runtime. For more details, see: + Documentation/dev-tools/ubsan.rst + +config UBSAN_TRAP + bool "On Sanitizer warnings, abort the running kernel code" + depends on UBSAN + depends on $(cc-option, -fsanitize-undefined-trap-on-error) + help + Building kernels with Sanitizer features enabled tends to grow + the kernel size by around 5%, due to adding all the debugging + text on failure paths. To avoid this, Sanitizer instrumentation + can just issue a trap. This reduces the kernel size overhead but + turns all warnings (including potentially harmless conditions) + into full exceptions that abort the running kernel code + (regardless of context, locks held, etc), which may destabilize + the system. For some system builders this is an acceptable + trade-off. config UBSAN_SANITIZE_ALL bool "Enable instrumentation for the entire kernel" diff --git a/lib/Makefile b/lib/Makefile index 3fc06399295d..685aee60de1d 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -286,7 +286,9 @@ quiet_cmd_build_OID_registry = GEN $@ clean-files += oid_registry_data.c obj-$(CONFIG_UCS2_STRING) += ucs2_string.o +ifneq ($(CONFIG_UBSAN_TRAP),y) obj-$(CONFIG_UBSAN) += ubsan.o +endif UBSAN_SANITIZE_ubsan.o := n KASAN_SANITIZE_ubsan.o := n diff --git a/scripts/Makefile.ubsan b/scripts/Makefile.ubsan index 019771b845c5..668a91510bfe 100644 --- a/scripts/Makefile.ubsan +++ b/scripts/Makefile.ubsan @@ -1,5 +1,10 @@ # SPDX-License-Identifier: GPL-2.0 ifdef CONFIG_UBSAN + +ifdef CONFIG_UBSAN_ALIGNMENT + CFLAGS_UBSAN += $(call cc-option, -fsanitize=alignment) +endif + CFLAGS_UBSAN += $(call cc-option, -fsanitize=shift) CFLAGS_UBSAN += $(call cc-option, -fsanitize=integer-divide-by-zero) CFLAGS_UBSAN += $(call cc-option, -fsanitize=unreachable) @@ -9,8 +14,8 @@ ifdef CONFIG_UBSAN CFLAGS_UBSAN += $(call cc-option, -fsanitize=bool) CFLAGS_UBSAN += $(call cc-option, -fsanitize=enum) -ifdef CONFIG_UBSAN_ALIGNMENT - CFLAGS_UBSAN += $(call cc-option, -fsanitize=alignment) +ifdef CONFIG_UBSAN_TRAP + CFLAGS_UBSAN += $(call cc-option, -fsanitize-undefined-trap-on-error) endif # -fsanitize=* options makes GCC less smart than usual and From 277a10850f9f4cb3429faf59293e2c89b1a320be Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:12:31 -0700 Subject: [PATCH 151/158] ubsan: split "bounds" checker from other options In order to do kernel builds with the bounds checker individually available, introduce CONFIG_UBSAN_BOUNDS, with the remaining options under CONFIG_UBSAN_MISC. For example, using this, we can start to expand the coverage syzkaller is providing. Right now, all of UBSan is disabled for syzbot builds because taken as a whole, it is too noisy. This will let us focus on one feature at a time. For the bounds checker specifically, this provides a mechanism to eliminate an entire class of array overflows with close to zero performance overhead (I cannot measure a difference). In my (mostly) defconfig, enabling bounds checking adds ~4200 checks to the kernel. Performance changes are in the noise, likely due to the branch predictors optimizing for the non-fail path. Some notes on the bounds checker: - it does not instrument {mem,str}*()-family functions, it only instruments direct indexed accesses (e.g. "foo[i]"). Dealing with the {mem,str}*()-family functions is a work-in-progress around CONFIG_FORTIFY_SOURCE[1]. - it ignores flexible array members, including the very old single byte (e.g. "int foo[1];") declarations. (Note that GCC's implementation appears to ignore _all_ trailing arrays, but Clang only ignores empty, 0, and 1 byte arrays[2].) [1] https://github.com/KSPP/linux/issues/6 [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92589 Suggested-by: Elena Petrova Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Reviewed-by: Andrey Ryabinin Acked-by: Dmitry Vyukov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Ard Biesheuvel Cc: Arnd Bergmann Cc: Dan Carpenter Cc: "Gustavo A. R. Silva" Link: http://lkml.kernel.org/r/20200227193516.32566-3-keescook@chromium.org Signed-off-by: Linus Torvalds --- lib/Kconfig.ubsan | 29 ++++++++++++++++++++++++----- scripts/Makefile.ubsan | 7 ++++++- 2 files changed, 30 insertions(+), 6 deletions(-) diff --git a/lib/Kconfig.ubsan b/lib/Kconfig.ubsan index 9deb655838b0..48469c95d78e 100644 --- a/lib/Kconfig.ubsan +++ b/lib/Kconfig.ubsan @@ -2,7 +2,7 @@ config ARCH_HAS_UBSAN_SANITIZE_ALL bool -config UBSAN +menuconfig UBSAN bool "Undefined behaviour sanity checker" help This option enables the Undefined Behaviour sanity checker. @@ -10,9 +10,10 @@ config UBSAN behaviours at runtime. For more details, see: Documentation/dev-tools/ubsan.rst +if UBSAN + config UBSAN_TRAP bool "On Sanitizer warnings, abort the running kernel code" - depends on UBSAN depends on $(cc-option, -fsanitize-undefined-trap-on-error) help Building kernels with Sanitizer features enabled tends to grow @@ -25,9 +26,26 @@ config UBSAN_TRAP the system. For some system builders this is an acceptable trade-off. +config UBSAN_BOUNDS + bool "Perform array index bounds checking" + default UBSAN + help + This option enables detection of directly indexed out of bounds + array accesses, where the array size is known at compile time. + Note that this does not protect array overflows via bad calls + to the {str,mem}*cpy() family of functions (that is addressed + by CONFIG_FORTIFY_SOURCE). + +config UBSAN_MISC + bool "Enable all other Undefined Behavior sanity checks" + default UBSAN + help + This option enables all sanity checks that don't have their + own Kconfig options. Disable this if you only want to have + individually selected checks. + config UBSAN_SANITIZE_ALL bool "Enable instrumentation for the entire kernel" - depends on UBSAN depends on ARCH_HAS_UBSAN_SANITIZE_ALL # We build with -Wno-maybe-uninitilzed, but we still want to @@ -44,7 +62,6 @@ config UBSAN_SANITIZE_ALL config UBSAN_NO_ALIGNMENT bool "Disable checking of pointers alignment" - depends on UBSAN default y if HAVE_EFFICIENT_UNALIGNED_ACCESS help This option disables the check of unaligned memory accesses. @@ -57,7 +74,9 @@ config UBSAN_ALIGNMENT config TEST_UBSAN tristate "Module for testing for undefined behavior detection" - depends on m && UBSAN + depends on m help This is a test module for UBSAN. It triggers various undefined behavior, and detect it. + +endif # if UBSAN diff --git a/scripts/Makefile.ubsan b/scripts/Makefile.ubsan index 668a91510bfe..5b15bc425ec9 100644 --- a/scripts/Makefile.ubsan +++ b/scripts/Makefile.ubsan @@ -5,14 +5,19 @@ ifdef CONFIG_UBSAN_ALIGNMENT CFLAGS_UBSAN += $(call cc-option, -fsanitize=alignment) endif +ifdef CONFIG_UBSAN_BOUNDS + CFLAGS_UBSAN += $(call cc-option, -fsanitize=bounds) +endif + +ifdef CONFIG_UBSAN_MISC CFLAGS_UBSAN += $(call cc-option, -fsanitize=shift) CFLAGS_UBSAN += $(call cc-option, -fsanitize=integer-divide-by-zero) CFLAGS_UBSAN += $(call cc-option, -fsanitize=unreachable) CFLAGS_UBSAN += $(call cc-option, -fsanitize=signed-integer-overflow) - CFLAGS_UBSAN += $(call cc-option, -fsanitize=bounds) CFLAGS_UBSAN += $(call cc-option, -fsanitize=object-size) CFLAGS_UBSAN += $(call cc-option, -fsanitize=bool) CFLAGS_UBSAN += $(call cc-option, -fsanitize=enum) +endif ifdef CONFIG_UBSAN_TRAP CFLAGS_UBSAN += $(call cc-option, -fsanitize-undefined-trap-on-error) From ae2e1aad3e48e495878d9f149e437a308bfdaefa Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:12:34 -0700 Subject: [PATCH 152/158] drivers/misc/lkdtm/bugs.c: add arithmetic overflow and array bounds checks Adds LKDTM tests for arithmetic overflow (both signed and unsigned), as well as array bounds checking. Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Acked-by: Dmitry Vyukov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Ard Biesheuvel Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Elena Petrova Cc: "Gustavo A. R. Silva" Link: http://lkml.kernel.org/r/20200227193516.32566-4-keescook@chromium.org Signed-off-by: Linus Torvalds --- drivers/misc/lkdtm/bugs.c | 75 ++++++++++++++++++++++++++++++++++++++ drivers/misc/lkdtm/core.c | 3 ++ drivers/misc/lkdtm/lkdtm.h | 3 ++ 3 files changed, 81 insertions(+) diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c index cc92bc3ed820..886459e0ddd9 100644 --- a/drivers/misc/lkdtm/bugs.c +++ b/drivers/misc/lkdtm/bugs.c @@ -11,6 +11,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 #include @@ -175,6 +176,80 @@ void lkdtm_HUNG_TASK(void) schedule(); } +volatile unsigned int huge = INT_MAX - 2; +volatile unsigned int ignored; + +void lkdtm_OVERFLOW_SIGNED(void) +{ + int value; + + value = huge; + pr_info("Normal signed addition ...\n"); + value += 1; + ignored = value; + + pr_info("Overflowing signed addition ...\n"); + value += 4; + ignored = value; +} + + +void lkdtm_OVERFLOW_UNSIGNED(void) +{ + unsigned int value; + + value = huge; + pr_info("Normal unsigned addition ...\n"); + value += 1; + ignored = value; + + pr_info("Overflowing unsigned addition ...\n"); + value += 4; + ignored = value; +} + +/* Intentially using old-style flex array definition of 1 byte. */ +struct array_bounds_flex_array { + int one; + int two; + char data[1]; +}; + +struct array_bounds { + int one; + int two; + char data[8]; + int three; +}; + +void lkdtm_ARRAY_BOUNDS(void) +{ + struct array_bounds_flex_array *not_checked; + struct array_bounds *checked; + volatile int i; + + not_checked = kmalloc(sizeof(*not_checked) * 2, GFP_KERNEL); + checked = kmalloc(sizeof(*checked) * 2, GFP_KERNEL); + + pr_info("Array access within bounds ...\n"); + /* For both, touch all bytes in the actual member size. */ + for (i = 0; i < sizeof(checked->data); i++) + checked->data[i] = 'A'; + /* + * For the uninstrumented flex array member, also touch 1 byte + * beyond to verify it is correctly uninstrumented. + */ + for (i = 0; i < sizeof(not_checked->data) + 1; i++) + not_checked->data[i] = 'A'; + + pr_info("Array access beyond bounds ...\n"); + for (i = 0; i < sizeof(checked->data) + 1; i++) + checked->data[i] = 'B'; + + kfree(not_checked); + kfree(checked); +} + void lkdtm_CORRUPT_LIST_ADD(void) { /* diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 5ce4ac8c06fc..a5e344df9166 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -130,6 +130,9 @@ static const struct crashtype crashtypes[] = { CRASHTYPE(HARDLOCKUP), CRASHTYPE(SPINLOCKUP), CRASHTYPE(HUNG_TASK), + CRASHTYPE(OVERFLOW_SIGNED), + CRASHTYPE(OVERFLOW_UNSIGNED), + CRASHTYPE(ARRAY_BOUNDS), CRASHTYPE(EXEC_DATA), CRASHTYPE(EXEC_STACK), CRASHTYPE(EXEC_KMALLOC), diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h index 8d13d0176624..601a2156a0d4 100644 --- a/drivers/misc/lkdtm/lkdtm.h +++ b/drivers/misc/lkdtm/lkdtm.h @@ -22,6 +22,9 @@ void lkdtm_SOFTLOCKUP(void); void lkdtm_HARDLOCKUP(void); void lkdtm_SPINLOCKUP(void); void lkdtm_HUNG_TASK(void); +void lkdtm_OVERFLOW_SIGNED(void); +void lkdtm_OVERFLOW_UNSIGNED(void); +void lkdtm_ARRAY_BOUNDS(void); void lkdtm_CORRUPT_LIST_ADD(void); void lkdtm_CORRUPT_LIST_DEL(void); void lkdtm_CORRUPT_USER_DS(void); From 1d28c8d6d076efdec762026f7192c7cf251c66be Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:12:38 -0700 Subject: [PATCH 153/158] ubsan: check panic_on_warn Syzkaller expects kernel warnings to panic when the panic_on_warn sysctl is set. More work is needed here to have UBSan reuse the WARN infrastructure, but for now, just check the flag manually. Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Ard Biesheuvel Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Dmitry Vyukov Cc: Elena Petrova Cc: "Gustavo A. R. Silva" Link: https://lore.kernel.org/lkml/CACT4Y+bsLJ-wFx_TaXqax3JByUOWB3uk787LsyMVcfW6JzzGvg@mail.gmail.com Link: http://lkml.kernel.org/r/20200227193516.32566-5-keescook@chromium.org Signed-off-by: Linus Torvalds --- lib/ubsan.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/lib/ubsan.c b/lib/ubsan.c index 7b9b58aee72c..429663eef6a7 100644 --- a/lib/ubsan.c +++ b/lib/ubsan.c @@ -156,6 +156,17 @@ static void ubsan_epilogue(void) "========================================\n"); current->in_ubsan--; + + if (panic_on_warn) { + /* + * This thread may hit another WARN() in the panic path. + * Resetting this prevents additional WARN() from panicking the + * system on this thread. Other threads are blocked by the + * panic_mutex in panic(). + */ + panic_on_warn = 0; + panic("panic_on_warn set ...\n"); + } } static void handle_overflow(struct overflow_data *data, void *lhs, From 1d2252fab9a1c9206d1c981c0c15d5e01e0f5729 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:12:42 -0700 Subject: [PATCH 154/158] kasan: unset panic_on_warn before calling panic() As done in the full WARN() handler, panic_on_warn needs to be cleared before calling panic() to avoid recursive panics. Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Acked-by: Dmitry Vyukov Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Ard Biesheuvel Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Elena Petrova Cc: "Gustavo A. R. Silva" Link: http://lkml.kernel.org/r/20200227193516.32566-6-keescook@chromium.org Signed-off-by: Linus Torvalds --- mm/kasan/report.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index cf5c17d5e361..80f23c9da6b0 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -92,8 +92,16 @@ static void end_report(unsigned long *flags) pr_err("==================================================================\n"); add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); spin_unlock_irqrestore(&report_lock, *flags); - if (panic_on_warn) + if (panic_on_warn) { + /* + * This thread may hit another WARN() in the panic path. + * Resetting this prevents additional WARN() from panicking the + * system on this thread. Other threads are blocked by the + * panic_mutex in panic(). + */ + panic_on_warn = 0; panic("panic_on_warn set ...\n"); + } kasan_enable_current(); } From ef065653e526a020d2a71549f413f14a830db799 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Mon, 6 Apr 2020 20:12:45 -0700 Subject: [PATCH 155/158] ubsan: include bug type in report header When syzbot tries to figure out how to deduplicate bug reports, it prefers seeing a hint about a specific bug type (we can do better than just "UBSAN"). This lifts the handler reason into the UBSAN report line that includes the file path that tripped a check. Unfortunately, UBSAN does not provide function names. Suggested-by: Dmitry Vyukov Signed-off-by: Kees Cook Signed-off-by: Andrew Morton Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Andrey Ryabinin Cc: Ard Biesheuvel Cc: Arnd Bergmann Cc: Dan Carpenter Cc: Elena Petrova Cc: "Gustavo A. R. Silva" Link: http://lkml.kernel.org/r/20200227193516.32566-7-keescook@chromium.org Link: https://lore.kernel.org/lkml/CACT4Y+bsLJ-wFx_TaXqax3JByUOWB3uk787LsyMVcfW6JzzGvg@mail.gmail.com Signed-off-by: Linus Torvalds --- lib/ubsan.c | 36 +++++++++++++++--------------------- 1 file changed, 15 insertions(+), 21 deletions(-) diff --git a/lib/ubsan.c b/lib/ubsan.c index 429663eef6a7..f8c0ccf35f29 100644 --- a/lib/ubsan.c +++ b/lib/ubsan.c @@ -45,13 +45,6 @@ static bool was_reported(struct source_location *location) return test_and_set_bit(REPORTED_BIT, &location->reported); } -static void print_source_location(const char *prefix, - struct source_location *loc) -{ - pr_err("%s %s:%d:%d\n", prefix, loc->file_name, - loc->line & LINE_MASK, loc->column & COLUMN_MASK); -} - static bool suppress_report(struct source_location *loc) { return current->in_ubsan || was_reported(loc); @@ -140,13 +133,14 @@ static void val_to_string(char *str, size_t size, struct type_descriptor *type, } } -static void ubsan_prologue(struct source_location *location) +static void ubsan_prologue(struct source_location *loc, const char *reason) { current->in_ubsan++; pr_err("========================================" "========================================\n"); - print_source_location("UBSAN: Undefined behaviour in", location); + pr_err("UBSAN: %s in %s:%d:%d\n", reason, loc->file_name, + loc->line & LINE_MASK, loc->column & COLUMN_MASK); } static void ubsan_epilogue(void) @@ -180,12 +174,12 @@ static void handle_overflow(struct overflow_data *data, void *lhs, if (suppress_report(&data->location)) return; - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, type_is_signed(type) ? + "signed-integer-overflow" : + "unsigned-integer-overflow"); val_to_string(lhs_val_str, sizeof(lhs_val_str), type, lhs); val_to_string(rhs_val_str, sizeof(rhs_val_str), type, rhs); - pr_err("%s integer overflow:\n", - type_is_signed(type) ? "signed" : "unsigned"); pr_err("%s %c %s cannot be represented in type %s\n", lhs_val_str, op, @@ -225,7 +219,7 @@ void __ubsan_handle_negate_overflow(struct overflow_data *data, if (suppress_report(&data->location)) return; - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, "negation-overflow"); val_to_string(old_val_str, sizeof(old_val_str), data->type, old_val); @@ -245,7 +239,7 @@ void __ubsan_handle_divrem_overflow(struct overflow_data *data, if (suppress_report(&data->location)) return; - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, "division-overflow"); val_to_string(rhs_val_str, sizeof(rhs_val_str), data->type, rhs); @@ -264,7 +258,7 @@ static void handle_null_ptr_deref(struct type_mismatch_data_common *data) if (suppress_report(data->location)) return; - ubsan_prologue(data->location); + ubsan_prologue(data->location, "null-ptr-deref"); pr_err("%s null pointer of type %s\n", type_check_kinds[data->type_check_kind], @@ -279,7 +273,7 @@ static void handle_misaligned_access(struct type_mismatch_data_common *data, if (suppress_report(data->location)) return; - ubsan_prologue(data->location); + ubsan_prologue(data->location, "misaligned-access"); pr_err("%s misaligned address %p for type %s\n", type_check_kinds[data->type_check_kind], @@ -295,7 +289,7 @@ static void handle_object_size_mismatch(struct type_mismatch_data_common *data, if (suppress_report(data->location)) return; - ubsan_prologue(data->location); + ubsan_prologue(data->location, "object-size-mismatch"); pr_err("%s address %p with insufficient space\n", type_check_kinds[data->type_check_kind], (void *) ptr); @@ -354,7 +348,7 @@ void __ubsan_handle_out_of_bounds(struct out_of_bounds_data *data, void *index) if (suppress_report(&data->location)) return; - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, "array-index-out-of-bounds"); val_to_string(index_str, sizeof(index_str), data->index_type, index); pr_err("index %s is out of range for type %s\n", index_str, @@ -375,7 +369,7 @@ void __ubsan_handle_shift_out_of_bounds(struct shift_out_of_bounds_data *data, if (suppress_report(&data->location)) goto out; - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, "shift-out-of-bounds"); val_to_string(rhs_str, sizeof(rhs_str), rhs_type, rhs); val_to_string(lhs_str, sizeof(lhs_str), lhs_type, lhs); @@ -407,7 +401,7 @@ EXPORT_SYMBOL(__ubsan_handle_shift_out_of_bounds); void __ubsan_handle_builtin_unreachable(struct unreachable_data *data) { - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, "unreachable"); pr_err("calling __builtin_unreachable()\n"); ubsan_epilogue(); panic("can't return from __builtin_unreachable()"); @@ -422,7 +416,7 @@ void __ubsan_handle_load_invalid_value(struct invalid_value_data *data, if (suppress_report(&data->location)) return; - ubsan_prologue(&data->location); + ubsan_prologue(&data->location, "invalid-load"); val_to_string(val_str, sizeof(val_str), data->type, val); From 29b46fa3dc57def8af3cec1d4e2984b5150be9d5 Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Mon, 6 Apr 2020 20:12:49 -0700 Subject: [PATCH 156/158] lib/Kconfig.debug: fix a typo "capabilitiy" -> "capability" s/capabilitiy/capability Signed-off-by: Qiujun Huang Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/1585818594-27373-1-git-send-email-hqjagain@gmail.com Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ddcf000022ae..50c1f5f08e6f 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1655,7 +1655,7 @@ config FAILSLAB Provide fault-injection capability for kmalloc. config FAIL_PAGE_ALLOC - bool "Fault-injection capabilitiy for alloc_pages()" + bool "Fault-injection capability for alloc_pages()" depends on FAULT_INJECTION help Provide fault-injection capability for alloc_pages(). From 43afe4d3661b04915aa357c5ab0f2a08718af9fd Mon Sep 17 00:00:00 2001 From: Somala Swaraj Date: Mon, 6 Apr 2020 20:12:53 -0700 Subject: [PATCH 157/158] ipc/mqueue.c: fix a brace coding style issue Signed-off-by: somala swaraj Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200301135530.18340-1-somalaswaraj@gmail.com Signed-off-by: Linus Torvalds --- ipc/mqueue.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 49a05ba3000d..dc8307bf2d74 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -239,11 +239,10 @@ static inline void msg_tree_erase(struct posix_msg_tree_node *leaf, info->msg_tree_rightmost = rb_prev(node); rb_erase(node, &info->msg_tree); - if (info->node_cache) { + if (info->node_cache) kfree(leaf); - } else { + else info->node_cache = leaf; - } } static inline struct msg_msg *msg_get(struct mqueue_inode_info *info) From 1cd377baa91844b9f87a2b72eabf7ff783946b5e Mon Sep 17 00:00:00 2001 From: Jason Yan Date: Mon, 6 Apr 2020 20:12:56 -0700 Subject: [PATCH 158/158] ipc/shm.c: make compat_ksys_shmctl() static Fix the following sparse warning: ipc/shm.c:1335:6: warning: symbol 'compat_ksys_shmctl' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Jason Yan Signed-off-by: Andrew Morton Link: http://lkml.kernel.org/r/20200403063933.24785-1-yanaijie@huawei.com Signed-off-by: Linus Torvalds --- ipc/shm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ipc/shm.c b/ipc/shm.c index ce1ca9f7c6e9..0ba6add05b35 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -1332,7 +1332,7 @@ static int copy_compat_shmid_from_user(struct shmid64_ds *out, void __user *buf, } } -long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr, int version) +static long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr, int version) { struct ipc_namespace *ns; struct shmid64_ds sem64;