linux-xiaomi-chiron/include/asm-generic
Peter Xu 679d103319 mm: introduce PTE_MARKER swap entry
Patch series "userfaultfd-wp: Support shmem and hugetlbfs", v8.


Overview
========

Userfaultfd-wp anonymous support was merged two years ago.  There're quite
a few applications that started to leverage this capability either to take
snapshots for user-app memory, or use it for full user controled swapping.

This series tries to complete the feature for uffd-wp so as to cover all
the RAM-based memory types.  So far uffd-wp is the only missing piece of
the rest features (uffd-missing & uffd-minor mode).

One major reason to do so is that anonymous pages are sometimes not
satisfying the need of applications, and there're growing users of either
shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem
between hypervisor process and device emulation process, shmem local live
migration for upgrades), or for performance on tlb hits.

All these mean that if a uffd-wp app wants to switch to any of the memory
types, it'll stop working.  I think it's worthwhile to have the kernel to
cover all these aspects.

This series chose to protect pages in pte level not page level.

One major reason is safety.  I have no idea how we could make it safe if
any of the uffd-privileged app can wr-protect a page that any other
application can use.  It means this app can block any process potentially
for any time it wants.

The other reason is that it aligns very well with not only the anonymous
uffd-wp solution, but also uffd as a whole.  For example, userfaultfd is
implemented fundamentally based on VMAs.  We set flags to VMAs showing the
status of uffd tracking.  For another per-page based protection solution,
it'll be crossing the fundation line on VMA-based, and it could simply be
too far away already from what's called userfaultfd.

PTE markers
===========

The patchset is based on the idea called PTE markers.  It was discussed in
one of the mm alignment sessions, proposed starting from v6, and this is
the 2nd version of it using PTE marker idea.

PTE marker is a new type of swap entry that is ony applicable to file
backed memories like shmem and hugetlbfs.  It's used to persist some
pte-level information even if the original present ptes in pgtable are
zapped.

Logically pte markers can store more than uffd-wp information, but so far
only one bit is used for uffd-wp purpose.  When the pte marker is
installed with uffd-wp bit set, it means this pte is wr-protected by uffd.

It solves the problem on e.g.  file-backed memory mapped ptes got zapped
due to any reason (e.g.  thp split, or swapped out), we can still keep the
wr-protect information in the ptes.  Then when the page fault triggers
again, we'll know this pte is wr-protected so we can treat the pte the
same as a normal uffd wr-protected pte.

The extra information is encoded into the swap entry, or swp_offset to be
explicit, with the swp_type being PTE_MARKER.  So far uffd-wp only uses
one bit out of the swap entry, the rest bits of swp_offset are still
reserved for other purposes.

There're two configs to enable/disable PTE markers:

  CONFIG_PTE_MARKER
  CONFIG_PTE_MARKER_UFFD_WP

We can set !PTE_MARKER to completely disable all the PTE markers, along
with uffd-wp support.  I made two config so we can also enable PTE marker
but disable uffd-wp file-backed for other purposes.  At the end of current
series, I'll enable CONFIG_PTE_MARKER by default, but that patch is
standalone and if anyone worries about having it by default, we can also
consider turn it off by dropping that oneliner patch.  So far I don't see
a huge risk of doing so, so I kept that patch.

In most cases, PTE markers should be treated as none ptes.  It is because
that unlike most of the other swap entry types, there's no PFN or block
offset information encoded into PTE markers but some extra well-defined
bits showing the status of the pte.  These bits should only be used as
extra data when servicing an upcoming page fault, and then we behave as if
it's a none pte.

I did spend a lot of time observing all the pte_none() users this time. 
It is indeed a challenge because there're a lot, and I hope I didn't miss
a single of them when we should take care of pte markers.  Luckily, I
don't think it'll need to be considered in many cases, for example: boot
code, arch code (especially non-x86), kernel-only page handlings (e.g. 
CPA), or device driver codes when we're tackling with pure PFN mappings.

I introduced pte_none_mostly() in this series when we need to handle pte
markers the same as none pte, the "mostly" is the other way to write
"either none pte or a pte marker".

I didn't replace pte_none() to cover pte markers for below reasons:

  - Very rare case of pte_none() callers will handle pte markers.  E.g., all
    the kernel pages do not require knowledge of pte markers.  So we don't
    pollute the major use cases.

  - Unconditionally change pte_none() semantics could confuse people, because
    pte_none() existed for so long a time.

  - Unconditionally change pte_none() semantics could make pte_none() slower
    even if in many cases pte markers do not exist.

  - There're cases where we'd like to handle pte markers differntly from
    pte_none(), so a full replace is also impossible.  E.g. khugepaged should
    still treat pte markers as normal swap ptes rather than none ptes, because
    pte markers will always need a fault-in to merge the marker with a valid
    pte.  Or the smap code will need to parse PTE markers not none ptes.

Patch Layout
============

Introducing PTE marker and uffd-wp bit in PTE marker:

  mm: Introduce PTE_MARKER swap entry
  mm: Teach core mm about pte markers
  mm: Check against orig_pte for finish_fault()
  mm/uffd: PTE_MARKER_UFFD_WP

Adding support for shmem uffd-wp:

  mm/shmem: Take care of UFFDIO_COPY_MODE_WP
  mm/shmem: Handle uffd-wp special pte in page fault handler
  mm/shmem: Persist uffd-wp bit across zapping for file-backed
  mm/shmem: Allow uffd wr-protect none pte for file-backed mem
  mm/shmem: Allows file-back mem to be uffd wr-protected on thps
  mm/shmem: Handle uffd-wp during fork()

Adding support for hugetlbfs uffd-wp:

  mm/hugetlb: Introduce huge pte version of uffd-wp helpers
  mm/hugetlb: Hook page faults for uffd write protection
  mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP
  mm/hugetlb: Handle UFFDIO_WRITEPROTECT
  mm/hugetlb: Handle pte markers in page faults
  mm/hugetlb: Allow uffd wr-protect none ptes
  mm/hugetlb: Only drop uffd-wp special pte if required
  mm/hugetlb: Handle uffd-wp during fork()

Misc handling on the rest mm for uffd-wp file-backed:

  mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered
  mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs

Enabling of uffd-wp on file-backed memory:

  mm/uffd: Enable write protection for shmem & hugetlbfs
  mm: Enable PTE markers by default
  selftests/uffd: Enable uffd-wp for shmem/hugetlbfs

Tests
=====

- Compile test on x86_64 and aarch64 on different configs
- Kernel selftests
- uffd-test [0]
- Umapsort [1,2] test for shmem/hugetlb, with swap on/off

[0] https://github.com/xzpeter/clibs/tree/master/uffd-test
[1] https://github.com/xzpeter/umap-apps/tree/peter
[2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs


This patch (of 23):

Introduces a new swap entry type called PTE_MARKER.  It can be installed
for any pte that maps a file-backed memory when the pte is temporarily
zapped, so as to maintain per-pte information.

The information that kept in the pte is called a "marker".  Here we define
the marker as "unsigned long" just to match pgoff_t, however it will only
work if it still fits in swp_offset(), which is e.g.  currently 58 bits on
x86_64.

A new config CONFIG_PTE_MARKER is introduced too; it's by default off.  A
bunch of helpers are defined altogether to service the rest of the pte
marker code.

[peterx@redhat.com: fixup]
  Link: https://lkml.kernel.org/r/Yk2rdB7SXZf+2BDF@xz-m1.local
Link: https://lkml.kernel.org/r/20220405014646.13522-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220405014646.13522-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:09 -07:00
..
bitops asm-generic/bitops: Always inline all bit manipulation helpers 2022-01-25 22:30:28 +01:00
vdso lib/vdso: Avoid highres update if clocksource is not VDSO capable 2020-02-17 20:12:17 +01:00
access_ok.h uaccess: remove CONFIG_SET_FS 2022-02-25 09:36:06 +01:00
asm-offsets.h
asm-prototypes.h
atomic.h locking/atomic: delete !ARCH_ATOMIC remnants 2021-05-26 13:20:52 +02:00
atomic64.h locking/atomic: delete !ARCH_ATOMIC remnants 2021-05-26 13:20:52 +02:00
audit_change_attr.h
audit_dir_write.h
audit_read.h
audit_signal.h
audit_write.h
barrier.h arm64 fixes/cleanups: 2022-01-22 09:22:10 +02:00
bitops.h include: move find.h from asm_generic to linux 2022-01-15 08:47:31 -08:00
bitsperlong.h lib: extend the scope of small_const_nbits() macro 2021-05-06 19:24:11 -07:00
bug.h Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
bugs.h
cache.h
cacheflush.h Add linux/cacheflush.h 2021-11-17 10:36:15 -05:00
checksum.h unify generic instances of csum_partial_copy_nocheck() 2020-08-20 15:45:14 -04:00
cmpxchg-local.h locking/atomic: cmpxchg: make generic a prefix 2021-05-26 13:20:50 +02:00
cmpxchg.h locking/atomic: delete !ARCH_ATOMIC remnants 2021-05-26 13:20:52 +02:00
compat.h compat: make linux/compat.h available everywhere 2021-07-23 14:20:24 +01:00
current.h
delay.h
device.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428 2019-06-05 17:37:16 +02:00
div64.h ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK() 2021-08-20 11:39:28 +01:00
dma-mapping.h
dma.h
early_ioremap.h mm/early_ioremap.c: remove redundant early_ioremap_shutdown() 2021-09-08 11:50:24 -07:00
emergency-restart.h
error-injection.h asm-generic/error-injection.h: fix a spelling mistake, and a coding style issue 2021-12-17 14:12:14 +01:00
exec.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
export.h asm-generic: export: Stub EXPORT_SYMBOL with __DISABLE_EXPORTS 2021-02-03 16:42:57 +00:00
extable.h
fb.h
fixmap.h mm: introduce common STRUCT_PAGE_MAX_SHIFT define 2018-12-14 15:05:45 -08:00
flat.h binfmt_flat: remove the persistent argument from flat_get_addr_from_rp 2019-06-24 09:16:47 +10:00
ftrace.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
futex.h futex: Fix additional regressions 2021-12-11 23:31:51 +01:00
getorder.h asm-generic: force inlining of get_order() to work around gcc10 poor decision 2020-12-15 22:46:15 -08:00
gpio.h gpio: Avoid kernel.h inclusion where it's possible 2020-02-10 12:58:36 +01:00
hardirq.h irqstat: Move declaration into asm-generic/hardirq.h 2020-11-23 10:31:06 +01:00
hugetlb.h mm: introduce PTE_MARKER swap entry 2022-05-13 07:20:09 -07:00
hw_irq.h
hyperv-tlfs.h KVM: x86: Add checks for reserved-to-zero Hyper-V hypercall fields 2022-02-10 13:50:36 -05:00
ide_iops.h
int-ll64.h
io.h asm-generic: build fixes for v5.15 2021-10-08 11:57:54 -07:00
ioctl.h
iomap.h parisc: Declare pci_iounmap() parisc version only when CONFIG_PCI enabled 2021-09-19 10:36:09 -07:00
irq.h
irq_regs.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
irq_work.h
irqflags.h
Kbuild Rework of the X86 irq stack handling: 2021-02-24 16:32:23 -08:00
kdebug.h
kmap_size.h mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL 2020-11-24 14:42:08 +01:00
kprobes.h treewide: Convert macro and uses of __section(foo) to __section("foo") 2020-10-25 14:51:49 -07:00
kvm_para.h
kvm_types.h KVM: Move x86's version of struct kvm_mmu_memory_cache to common code 2020-07-09 13:29:42 -04:00
linkage.h
local.h
local64.h
logic_io.h logic_io instance of iounmap() needs volatile on argument 2021-12-21 21:31:08 +01:00
mcs_spinlock.h
memory_model.h mm: remove CONFIG_DISCONTIGMEM 2021-06-29 10:53:55 -07:00
mm_hooks.h mm: remove arch_bprm_mm_init() hook 2020-01-23 10:41:16 -08:00
mmiowb.h asm-generic/mmiowb: Allow mmiowb_set_pending() when preemptible() 2020-07-17 10:02:03 +01:00
mmiowb_types.h asm-generic/mmiowb: Add generic implementation of mmiowb() tracking 2019-04-08 11:59:39 +01:00
mmu.h
mmu_context.h asm-generic: add generic MMU versions of mmu context functions 2020-10-26 16:45:03 +01:00
module.h
module.lds.h kbuild: preprocess module linker script 2020-09-25 00:36:41 +09:00
mshyperv.h Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device 2022-03-29 12:12:50 +00:00
msi.h Generic interrupt and irqchips subsystem: 2020-12-15 15:03:31 -08:00
nommu_context.h asm-generic: add generic MMU versions of mmu context functions 2020-10-26 16:45:03 +01:00
numa.h numa: Move numa implementation to common code 2021-01-14 15:08:55 -08:00
page.h c6x: remove architecture 2021-01-20 09:30:45 +01:00
param.h
parport.h
pci.h
pci_iomap.h parisc: Declare pci_iounmap() parisc version only when CONFIG_PCI enabled 2021-09-19 10:36:09 -07:00
percpu.h asm-generic: percpu: avoid Wshadow warning 2020-10-26 23:54:48 +00:00
pgalloc.h asm-generic: Prepare for riscv use of pud_alloc_one and pud_free 2022-01-19 17:54:08 -08:00
pgtable-nop4d.h mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t * 2021-07-08 11:48:22 -07:00
pgtable-nopmd.h mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t * 2021-07-08 11:48:22 -07:00
pgtable-nopud.h mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t * 2021-07-08 11:48:22 -07:00
pgtable_uffd.h userfaultfd: wp: add pmd_swp_*uffd_wp() helpers 2020-04-07 10:43:39 -07:00
preempt.h sched/core: Initialize the idle task with preemption disabled 2021-05-12 13:01:45 +02:00
qrwlock.h locking/arch: Move qrwlock.h include after qspinlock.h 2021-02-11 07:59:54 -05:00
qrwlock_types.h
qspinlock.h qspinlock: use signed temporaries for cmpxchg 2020-10-26 20:19:48 +01:00
qspinlock_types.h locking/qspinlock: Do not include atomic.h from qspinlock_types.h 2020-07-29 16:14:19 +02:00
resource.h
rwonce.h asm/rwonce: Don't pull <asm/barrier.h> into 'asm-generic/rwonce.h' 2020-07-21 10:50:36 +01:00
seccomp.h seccomp: Use -1 marker for end of mode 1 syscall list 2020-07-10 16:01:52 -07:00
sections.h asm-generic: Refactor dereference_[kernel]_function_descriptor() 2022-02-16 23:25:11 +11:00
serial.h
set_memory.h
shmparam.h treewide: remove SPDX "WITH Linux-syscall-note" from kernel-space headers 2019-05-14 19:52:48 -07:00
signal.h
simd.h
softirq_stack.h softirq: Move do_softirq_own_stack() to generic asm header 2021-02-10 23:34:16 +01:00
spinlock.h
statfs.h
string.h
switch_to.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 36 2019-05-24 17:27:11 +02:00
syscall.h ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h 2022-03-10 13:35:08 -06:00
syscalls.h
termios-base.h
termios.h
timex.h
tlb.h mm/mprotect: do not flush when not required architecturally 2022-05-13 07:20:05 -07:00
tlbflush.h
topology.h mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA 2021-06-29 10:53:55 -07:00
trace_clock.h
uaccess.h uaccess: remove CONFIG_SET_FS 2022-02-25 09:36:06 +01:00
unaligned.h asm-generic: fix __get_unaligned_be48() on 32 bit platforms 2022-04-12 16:31:38 -06:00
user.h
vermagic.h arch: split MODULE_ARCH_VERMAGIC definitions out to <asm/vermagic.h> 2020-04-23 10:50:26 +09:00
vga.h
vmlinux.lds.h Add support for Intel CET-IBT, available since Tigerlake (11th gen), which is a 2022-03-27 10:17:23 -07:00
vtime.h
word-at-a-time.h
xor.h lib/xor: make xor prototypes more friendly to compiler vectorization 2022-02-11 20:39:39 +11:00