Commit graph

996159 commits

Author SHA1 Message Date
Nathan Chancellor
e82d891a63 MAINTAINERS: add a couple more files to the Clang/LLVM section
The K: entry should ensure that Nick and I always get CC'd on patches that
touch these files but it is better to be explicit rather than implicit.

Link: https://lkml.kernel.org/r/20210114004059.2129921-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:53 -08:00
Xiaoming Ni
697edcb0e4 proc_sysctl: fix oops caused by incorrect command parameters
The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val).  If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.

For example:
  "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
  triggered. The call stack is as follows:
    Kernel command line: .... hung_task_panic
    ......
    Call trace:
    __pi_strlen+0x10/0x98
    parse_args+0x278/0x344
    do_sysctl_args+0x8c/0xfc
    kernel_init+0x5c/0xf4
    ret_from_fork+0x10/0x30

To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().

Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:53 -08:00
Thomas Gleixner
785025820a powerpc/mm/highmem: use __set_pte_at() for kmap_local()
The original PowerPC highmem mapping function used __set_pte_at() to
denote that the mapping is per CPU.  This got lost with the conversion
to the generic implementation.

Override the default map function.

Link: https://lkml.kernel.org/r/20210112170411.281464308@linutronix.de
Fixes: 47da42b27a ("powerpc/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:53 -08:00
Thomas Gleixner
8c0d5d78f3 mips/mm/highmem: use set_pte() for kmap_local()
set_pte_at() on MIPS invokes update_cache() which might recurse into
kmap_local().

Use set_pte() like the original MIPS highmem implementation did.

Link: https://lkml.kernel.org/r/20210112170411.187513575@linutronix.de
Fixes: a4c33e83bc ("mips/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Paul Cercueil <paul@crapouillou.net>
Reported-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Thomas Gleixner
a1dce7fd2a mm/highmem: prepare for overriding set_pte_at()
The generic kmap_local() map function uses set_pte_at(), but MIPS requires
set_pte() and PowerPC wants __set_pte_at().

Provide arch_kmap_local_set_pte() and default it to set_pte_at().

Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Thomas Gleixner
f99e02372a sparc/mm/highmem: flush cache and TLB
Patch series "mm/highmem: Fix fallout from generic kmap_local
conversions".

The kmap_local conversion wreckaged sparc, mips and powerpc as it missed
some of the details in the original implementation.

This patch (of 4):

The recent conversion to the generic kmap_local infrastructure failed to
assign the proper pre/post map/unmap flush operations for sparc.

Sparc requires cache flush before map/unmap and tlb flush afterwards.

Link: https://lkml.kernel.org/r/20210112170136.078559026@linutronix.de
Link: https://lkml.kernel.org/r/20210112170410.905976187@linutronix.de
Fixes: 3293efa978 ("sparc/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Dan Williams
dad4e5b390 mm: fix page reference leak in soft_offline_page()
The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.

Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.

Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: feec24a613 ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Arnd Bergmann
251b5497c5 ubsan: disable unsigned-overflow check for i386
Building ubsan kernels even for compile-testing introduced these
warnings in my randconfig environment:

  crypto/blake2b_generic.c:98:13: error: stack frame size of 9636 bytes in function 'blake2b_compress' [-Werror,-Wframe-larger-than=]
  static void blake2b_compress(struct blake2b_state *S,

  crypto/sha512_generic.c:151:13: error: stack frame size of 1292 bytes in function 'sha512_generic_block_fn' [-Werror,-Wframe-larger-than=]
  static void sha512_generic_block_fn(struct sha512_state *sst, u8 const *src,

  lib/crypto/curve25519-fiat32.c:312:22: error: stack frame size of 2180 bytes in function 'fe_mul_impl' [-Werror,-Wframe-larger-than=]
  static noinline void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])

  lib/crypto/curve25519-fiat32.c:444:22: error: stack frame size of 1588 bytes in function 'fe_sqr_impl' [-Werror,-Wframe-larger-than=]
  static noinline void fe_sqr_impl(u32 out[10], const u32 in1[10])

Further testing showed that this is caused by
-fsanitize=unsigned-integer-overflow, but is isolated to the 32-bit x86
architecture.

The one in blake2b immediately overflows the 8KB stack area
architectures, so better ensure this never happens by disabling the
option for 32-bit x86.

Link: https://lkml.kernel.org/r/20210112202922.2454435-1-arnd@kernel.org
Link: https://lore.kernel.org/lkml/20201230154749.746641-1-arnd@kernel.org/
Fixes: d0a3ac549f ("ubsan: enable for all*config builds")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Marco Elver <elver@google.com>
Cc: George Popescu <georgepope@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Andrey Konovalov
acb35b177c kasan, mm: fix resetting page_alloc tags for HW_TAGS
A previous commit added resetting KASAN page tags to
kernel_init_free_pages() to avoid false-positives due to accesses to
metadata with the hardware tag-based mode.

That commit did reset page tags before the metadata access, but didn't
restore them after.  As the result, KASAN fails to detect bad accesses
to page_alloc allocations on some configurations.

Fix this by recovering the tag after the metadata access.

Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com
Fixes: aa1ef4d7b3 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Andrey Konovalov
ce5716c618 kasan, mm: fix conflicts with init_on_alloc/free
A few places where SLUB accesses object's data or metadata were missed
in a previous patch.  This leads to false positives with hardware
tag-based KASAN when bulk allocations are used with init_on_alloc/free.

Fix the false-positives by resetting pointer tags during these accesses.

(The kasan_reset_tag call is removed from slab_alloc_node, as it's added
 into maybe_wipe_obj_freeptr.)

Link: https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901
Link: https://lkml.kernel.org/r/093428b5d2ca8b507f4a79f92f9929b35f7fada7.1610731872.git.andreyknvl@google.com
Fixes: aa1ef4d7b3 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Andrey Konovalov
76bc99e81a kasan: fix HW_TAGS boot parameters
The initially proposed KASAN command line parameters are redundant.

This change drops the complex "kasan.mode=off/prod/full" parameter and
adds a simpler kill switch "kasan=off/on" instead.  The new parameter
together with the already existing ones provides a cleaner way to
express the same set of features.

The full set of parameters with this change:

  kasan=off/on             - whether KASAN is enabled
  kasan.fault=report/panic - whether to only print a report or also panic
  kasan.stacktrace=off/on  - whether to collect alloc/free stack traces

Default values:

  kasan=on
  kasan.fault=report
  kasan.stacktrace=on  (if CONFIG_DEBUG_KERNEL=y)
  kasan.stacktrace=off (otherwise)

Link: https://linux-review.googlesource.com/id/Ib3694ed90b1e8ccac6cf77dfd301847af4aba7b8
Link: https://lkml.kernel.org/r/4e9c4a4bdcadc168317deb2419144582a9be6e61.1610736745.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Lecopzer Chen
5dabd1712c kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan_remove_zero_shadow() shall use original virtual address, start and
size, instead of shadow address.

Link: https://lkml.kernel.org/r/20210103063847.5963-1-lecopzer@gmail.com
Fixes: 0207df4fa1 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Lecopzer Chen
a11a496ee6 kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
During testing kasan_populate_early_shadow and kasan_remove_zero_shadow,
if the shadow start and end address in kasan_remove_zero_shadow() is not
aligned to PMD_SIZE, the remain unaligned PTE won't be removed.

In the test case for kasan_remove_zero_shadow():

    shadow_start: 0xffffffb802000000, shadow end: 0xffffffbfbe000000

    3-level page table:
      PUD_SIZE: 0x40000000 PMD_SIZE: 0x200000 PAGE_SIZE: 4K

0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in
kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next
address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE.

In the correct condition, this should fallback to the next level
kasan_remove_pmd_table() but the condition flow always continue to skip
the unaligned part.

Fix by correcting the condition when next and addr are neither aligned.

Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com
Fixes: 0207df4fa1 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:52 -08:00
Linus Torvalds
e68061375f - Fix a kernel panic in mips-cpu due to invalid irq domain hierarchy.
- Fix to not lose IPIs on bcm2836.
 
  - Fix for a bogus marking of ITS devices as shared due to unitialized
    stack variable.
 
  - Clear a phantom interrupt on qcom-pdc to unblock suspend.
 
  - Small cleanups, warning and build fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANZ4kACgkQEsHwGGHe
 VUohthAAtkwq9kXmn1H4hQuay8Zly7L5D7dn6EUrVl+347cZi+DZmjYRv0Mu/5Ad
 L7KhLHDJ6XqGW4qsPxGkYJPgaWiXyo+C0OQXLdwFcrT0kOvjbOYoFCAarRkqIdQb
 5jr6J6t1MWb5Ktb8xTebUXv5kn91rvYj8ArcdlaVB8Kdp6jB8Irq9AROjZMK3Seh
 AHC9jWnf3tGPUXRhNk0rpBeDzdkI6G82KjO2LdZMT2SC9WLX2vH710wBeY8+Um42
 5rseckupLl7KJb02WpK0IDi/6bidWc9K3NsOh9QqjQme4O9a0459q7f6SoR8/PYi
 FWwB95NwGWahnVk26qfLwiV/Sd3RKM72uQxICu6XaI0+asrbOflIuK1NbEAbpf+S
 ZgtVqV3c8i09ij3NsJIWsZ5vjwtFyRUU26JgTxzWKzRf4FftSW0pHWr/VvMflXhK
 l0E68kvB4KmV7Bd5r3gavctWVFeMrZhTGAHxiiG91oISUpp+X634CCa4NyMh+Ryr
 6V+LhIADlzNOuyzHsG8xpsUmtSwutcsHSFQtqUkcXpG/1o2OpSPmhYQRySD4lz+4
 dpo3CFtvrBUWRAHo2bsaLorqU0LlBFc+h9sTUlYMLgyRKkOQtde1UN1lTIu7z0Ll
 GFqAYmttyjSlDKo/5MeBQeWp4Oq8d+Ckz9xkNIL4qJV0LLTUvMs=
 =Z32e
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a kernel panic in mips-cpu due to invalid irq domain hierarchy.

 - Fix to not lose IPIs on bcm2836.

 - Fix for a bogus marking of ITS devices as shared due to unitialized
   stack variable.

 - Clear a phantom interrupt on qcom-pdc to unblock suspend.

 - Small cleanups, warning and build fixes.

* tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Export irq_check_status_bit()
  irqchip/mips-cpu: Set IPI domain parent chip
  irqchip/pruss: Simplify the TI_PRUSS_INTC Kconfig
  irqchip/loongson-liointc: Fix build warnings
  driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
  irqchip/bcm2836: Fix IPI acknowledgement after conversion to handle_percpu_devid_irq
  irqchip/irq-sl28cpld: Convert comma to semicolon
  genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs()
2021-01-24 10:24:20 -08:00
Linus Torvalds
32d43270ca - Adjust objtool to handle a recent binutils change to not generate unused
symbols anymore.
 
  - Revert the fail-the-build-on-fatal-errors objtool strategy for now due to the
  ever-increasing matrix of supported toolchains/plugins and them causing too
  many such fatal errors currently.
 
  - Do not add empty symbols to objdump's rbtree to accommodate clang removing
  section symbols.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANZFcACgkQEsHwGGHe
 VUp0jg//UtL0PMCun6DLVf8jTLwtzlCb25/FFLyF6RpPQvt+Zpda89ZqG9R81pXm
 lMMc3L+2YBddVG2pnUC6jyWwZIpx+M5D0ZKa7AKY6K5o7tS/9BtCPrWwmuf+6TDi
 6meLWy0hDOxSS5YifwH7LR8aj57SfsHqNfO4LF2ml857MX31Wwr/x5yryWPqho1g
 8v4sK+cAesu8m7leVAVwbdSEiqEP9NMQxR3Te/4+aT3Xyqc/+EPttFJ30564/gaF
 Zes1CqmUB7G9l8c9igvvCNqZyYyy8OoPp/UjW6NTu7soYhutsWkz/28deiW9WGks
 sKiJ5E/lEIimuORj0M++85CZcS4SaH4MRbfXi2F4BisGFS8c7CSwH3457WAnCuOf
 FZ/kaVYN2CP9DWbBQI032hdUkWScEzF2racNQ6uXghlhLFQE3l/sBPS2WsHzKhnF
 Jtwt1fGEHAaC3LeI2AenmSLX8q/5chZMTByp5z7Lq97SSsrk65N5uFSuSaU6cK5G
 0WDA5r6v0dy6szqzLe7mzaKdx2MOK6ygLisGBJDtYr7FZ5szHXz6e2vgJUEB82WB
 H21/mGf0Rl4GhHIH76W14imEWlQscp2wIEXJ8RWgoytRkWgHxBe59dMrwTB0s4jB
 R/KPlwA1eUzEaTBbARGMRGNv+Te/Z3QMYAXLlq1BZW3EKuw4L84=
 =FWZ1
 -----END PGP SIGNATURE-----

Merge tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fixes from Borislav Petkov:

 - Adjust objtool to handle a recent binutils change to not generate
   unused symbols anymore.

 - Revert the fail-the-build-on-fatal-errors objtool strategy for now
   due to the ever-increasing matrix of supported toolchains/plugins and
   them causing too many such fatal errors currently.

 - Do not add empty symbols to objdump's rbtree to accommodate clang
   removing section symbols.

* tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Don't fail on missing symbol table
  objtool: Don't fail the kernel build on fatal errors
  objtool: Don't add empty symbols to the rbtree
2021-01-24 10:17:03 -08:00
Linus Torvalds
24c56ee06c - Correct the marking of kthreads which are supposed to run on a specific,
single CPU vs such which are affine to only one CPU, mark per-cpu workqueue
  threads as such and make sure that marking "survives" CPU hotplug. Fix CPU
  hotplug issues with such kthreads.
 
  - A fix to not push away tasks on CPUs coming online.
 
  - Have workqueue CPU hotplug code use cpu_possible_mask when breaking affinity
    on CPU offlining so that pending workers can finish on newly arrived onlined
    CPUs too.
 
  - Dump tasks which haven't vacated a CPU which is currently being unplugged.
 
  - Register a special scale invariance callback which gets called on resume
  from RAM to read out APERF/MPERF after resume and thus make the schedutil
  scaling governor more precise.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANYCAACgkQEsHwGGHe
 VUo+OBAAjfqkijDlXiGX6lrT5gRx5NZICpeMgbWa7J13XHT1ysD/b0fMGFIUyF6k
 aszDLTl8U/S1/qGAYlzTSPAFcdZ+ENiFqQ48ozMk4jZC3p0quHTjs/PdiSG6kYBi
 +e4smht+bSyLKxsG8hN0kJ+mLEd+uIQ13kP4YkxPgWbJ9WNP/U6HHGBo0rBchtSe
 Kn6bdd8CfwmC6rSazp7kdQoFoWeQaoMI1ODX3VphK1GtL1wq8WSICzRhpg3caeyG
 3lCIddoNW9mCA9Nkc6R6HeV3uW9JGkPAjnmtTIEHDbg9pib7xNT978ieTQuqNDCi
 DlAHDGumzoaiVJZhD/1fj/RXMJr2YUHxtrXWNsXpiKJ9g8Tn+WC0UW/4+Mx2L/km
 0RSoXJlMs1fGopS2I/fObZ6RPhmg4D+gJsMCdaHQzX4NgxZAGhNNPxMckZ0IM8A0
 2NNXSHUZHVTHeJEW0E/glOcpWb5hG+vDwiBMNEWfTwYpTfrw2EEOZaKniZE7WlSL
 4ItM9rkLGl1KToJzAH4A0oUtSy3vtSCo8B1noGlc09Lj+oCIBlr81z9+C79a2oxG
 qE7Xd4X7y7Qs3JeCbRZWQa7/2Kf1v4XnjELrJJeCZC85r0ZqJDwRX8w7lkmW2XPU
 m4J2prr/DDZSqrRh23/xC1fsU+vcBKSfKUFKAH4Lg2VIaUfSUEk=
 =2DAF
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Correct the marking of kthreads which are supposed to run on a
   specific, single CPU vs such which are affine to only one CPU, mark
   per-cpu workqueue threads as such and make sure that marking
   "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.

 - A fix to not push away tasks on CPUs coming online.

 - Have workqueue CPU hotplug code use cpu_possible_mask when breaking
   affinity on CPU offlining so that pending workers can finish on newly
   arrived onlined CPUs too.

 - Dump tasks which haven't vacated a CPU which is currently being
   unplugged.

 - Register a special scale invariance callback which gets called on
   resume from RAM to read out APERF/MPERF after resume and thus make
   the schedutil scaling governor more precise.

* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Relax the set_cpus_allowed_ptr() semantics
  sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  sched: Prepare to use balance_push in ttwu()
  workqueue: Restrict affinity change to rescuer
  workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  kthread: Extract KTHREAD_IS_PER_CPU
  sched: Don't run cpu-online with balance_push() enabled
  workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
  sched/core: Print out straggler tasks in sched_cpu_dying()
  x86: PM: Register syscore_ops for scale invariance
2021-01-24 10:09:20 -08:00
Linus Torvalds
025929f468 - Fix an integer overflow in the NTP RTC synchronization which lead to latter
happening every 2 seconds instead of the intended every 11 minutes.
 
  - Get rid of now unused get_seconds().
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANVSEACgkQEsHwGGHe
 VUr6pQ/+MLoKCld4bR4rXbzD2TSKaIaHt7BuYCOcE4Z3PyxpfYFdX0iJVn8J277X
 VAvbHOEJLFijGylpCD5vcMnK018GaW5DTvPgb7ijOEgdzHzqOoC/a30WMJWtnBVo
 dKrtEgbQPgB8eSPuQah5RSWRd6lrywJp79WNbB4k7yqsTVvQLQmbKtjc656jCvCI
 PKpc7zT/UZcChue6zZUtSXnLPBOM1aog0/7pXbViZc2F8Lv3ulIZy27o9sXSetu2
 bsdZzexKEY/Thqx1LiEOt5PhQm5gKSBsBWuS+wPIyI0V0E5eb0rnAlTLyx7hE3yL
 9R9axXPhGH5gRM05ikYaF2k4VscImgGLfT18blIb4jWSZ/C+HCtlyfJmcx5VgG4Z
 F4ZC5esiUD9vM++cdrw8Xeu/QmP+yHBrcspKshjK/NcaEnTm7DyhtyGbFzdEglti
 NpnNyhrX6EX1AY9RoVpLS3delytbdYABK2BXUBdVpLb034BG5f9p2Z6D1hsmzqE+
 rOLxTS0d34enfFglwxCjUkwrO/weEvdpCeqcnm0TyLtq6BGoGGbkKcf510MQ1CB+
 UqZxS7WIeYF1IIrm4YVD/ONiH6O1hn/QK6AmHFivgW5fKxvZXIJG850+kYX1JNca
 LUfF43xi6OfgGdllUWaPsi6fRlkvP+YVd7y37d9pSC7QStdysLQ=
 =fFOL
 -----END PGP SIGNATURE-----

Merge tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Fix an integer overflow in the NTP RTC synchronization which led to
   the latter happening every 2 seconds instead of the intended every 11
   minutes.

 - Get rid of now unused get_seconds().

* tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp: Fix RTC synchronization on 32-bit platforms
  timekeeping: Remove unused get_seconds()
2021-01-24 09:58:38 -08:00
Linus Torvalds
17b6c49da3 - Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when the FPU
    is used in-kernel and fix a boot crash on K7 due to early MXCSR access before
    CR4.OSFXSR is even set.
 
  - A couple of noinstr annotation fixes
 
  - Correct die ID setting on AMD for users of topology information which need
    the correct die ID
 
  - A SEV-ES fix to handle string port IO to/from kernel memory properly
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmANUr0ACgkQEsHwGGHe
 VUos4hAAlBik/z+y+DaZGJyxtpST2YQaEbwbW3UMqyLsdVnLTTRnKzC1T+fEfD2Q
 SxtCPYH5iuPbCgOOoQboWt6Aa53JlX9bRBZ/87Ub/ELJ9NgMxMQFXAiaDZAAY6Zy
 L2B13KpoGOifPjrGDgksnafyqYv1CYesiArfOffHgvC3/0j7ONdda2SRDQ697TBw
 FSV/WfUjCo0+JdXRRaP6YH5t9MxFerHxVH38xTDFwXikS9CVyddosLo5EP2wAQvi
 5+160i2jB25vyMEsFBr5wE0xDpWLUdClVpzHXXPG2i0P+NHATiBcreTMPzeYOUXu
 Hfc/y4ukOVDoMGlHLNKHq89alI87soMJIEjm2sAG1ZIypKyMJw7YUXQNRR3TcP0U
 c7/C3W1mCWD1+8nLtlIMM0Z20DacQOf9YWko95+uh08+S52KpTOgnx+mpoZjK1PQ
 Wv9HxPJKycrgRNhfverN5FSiOEW/DdvqNfVHTjuuzNLyKdM1NoZ/YTIyABk4RfFq
 USUnC5rk4GqvCYdaLTEKkAJvLCmRKgVYd75Rc4/pPKILS6kv82vpj3BjClBaH0h1
 yrvpafvXzOhwKP/J5q0vm57NJdqPZwuW4Ah+74tptmQL4rga84U4FOs3JpNJq0uu
 1mj6xSFD8ZyI11BSkYbZAHTy1eNERze+azftCSPq/6EifYvqnsE=
 =3rZM
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add a new Intel model number for Alder Lake

 - Differentiate which aspects of the FPU state get saved/restored when
   the FPU is used in-kernel and fix a boot crash on K7 due to early
   MXCSR access before CR4.OSFXSR is even set.

 - A couple of noinstr annotation fixes

 - Correct die ID setting on AMD for users of topology information which
   need the correct die ID

 - A SEV-ES fix to handle string port IO to/from kernel memory properly

* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add another Alder Lake CPU to the Intel family
  x86/mmx: Use KFPU_387 for MMX string operations
  x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  x86/topology: Make __max_die_per_package available unconditionally
  x86: __always_inline __{rd,wr}msr()
  x86/mce: Remove explicit/superfluous tracing
  locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
  locking/lockdep: Cure noinstr fail
  x86/sev: Fix nonistr violation
  x86/entry: Fix noinstr fail
  x86/cpu/amd: Set __max_die_per_package on AMD
  x86/sev-es: Handle string port IO to kernel memory properly
2021-01-24 09:46:05 -08:00
Linus Torvalds
14c50a6618 powerpc fixes for 5.11 #5
Fix a bad interaction between the scv handling and the fallback L1D flush, which
 could lead to user register corruption. Only affects people using scv (~no one)
 on machines with old firmware that are missing the L1D flush.
 
 Two small selftest fixes.
 
 Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das, Tulio
 Magno Quites Machado Filho.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmAMrdETHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgF+xD/97Hl6qlS5GllC1mVHUW/eyekfqt7io
 XHSPeeayZddj3FhfX+eyfICUtmeQP9MXhp9Jyz/JyD+RCpIIiEbJ40CnCpsLeOFj
 RmRxiVX1lB8tEp5gMZnVdv8z3XDTkzt9YDPlHA7Ta/rqkD0ursF8yVnDG61ZjWnF
 +2uAozJp9p+nOVrd6wqd5wzdJyWyVXHCkCfyN/HD1rPAR0b6oFm8R3Pca7UR6Qk3
 Nt5deAaTisl3jDT+1C+PNq/YErHeERwNyLcRP7dT7VvO9Cch2ijxqc3e0loLT1Nz
 bTL6bI1JhqePXZQRWRTxpk0A6yss0hWlMudSoRbQzhDbcEY+4QxK90oXevalHXcC
 C3e3qgO6JoN1hN4tw9kFQr1r+x+xsKpR+jO9MfE7ndFGfAPc2An5rSRbFEB6cUvr
 qqVh3iNHNGILeE74EyAthjGF2nZEmxI1DAFZFjoDoeWMi+VyE2MAzcIMJg22mhYx
 5n64CCrqqTjhXtJgXm31L8WUKHz2JZnwwH3nTnrPXswzTMhkYp2fTQSySWCE4Zyo
 xygoMfnxIuNkL3hZN0cxNzjr0uGrWIdQq/JFui1Wek68qNK0NEH5r5FmDTTFgk0D
 A3sJAEmKehs3Ka64LA5ejYza0zU1wnrOcyJKDpxQAWKldcKiQQw7Gup9dQe218Vc
 ZemyuT9YDlOilQ==
 =4nYe
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix a bad interaction between the scv handling and the fallback L1D
   flush, which could lead to user register corruption. Only affects
   people using scv (~no one) on machines with old firmware that are
   missing the L1D flush.

 - Two small selftest fixes.

Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das,
and Tulio Magno Quites Machado Filho.

* tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: fix scv entry fallback flush vs interrupt
  selftests/powerpc: Only test lwm/stmw on big endian
  selftests/powerpc: Fix exit status of pkey tests
2021-01-24 09:40:51 -08:00
Linus Torvalds
c509ce2378 for-linus-2021-01-24
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYA1opwAKCRCRxhvAZXjc
 osnpAP4wjExvtwgh1eA7IgBPtAFzL1EPK2lrv7WM6yuMJNh23wEAxU+quoNrBT7U
 R5UQvmXi2SwxjeGXR/BTLq/HU9rSJA4=
 =6YJX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:

 - Jann reported sparse complaints because of a missing __user
   annotation in a helper we added way back when we added
   pidfd_send_signal() to avoid compat syscall handling. Fix it.

 - Yanfei replaces a reference in a comment to the _do_fork() helper I
   removed a while ago with a reference to the new kernel_clone()
   replacement

 - Alexander Guril added a simple coding style fix

* tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  kthread: remove comments about old _do_fork() helper
  Kernel: fork.c: Fix coding style: Do not use {} around single-line statements
  signal: Add missing __user annotation to copy_siginfo_from_user_any
2021-01-24 09:35:28 -08:00
Linus Torvalds
4dcd3bcc20 an important signal handling patch for stable, and two small cleanup patches
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAM+XcACgkQiiy9cAdy
 T1E44gwAnonpHv2eDi6Lo6/Ev3hP7kETaIcYc3Jql61gz0rn880cNh67C5osdxVY
 hdaKycuwQOBQbHcrt6kOfpOOcnjVXb92L/THzUJcHgUQx4bUFHNhuYt8NTrQFsnd
 ke+nLVbRcAiQRVAN46d9y2YQBA64SGUrbjjGyFD0ew13ORpsGRVBQ1UjEXzUvObV
 d0rFrlVoAIZA3RAj5uxBeHnyIy3pZaviEhg0ZlL4QEqEDhcKh82kWcX4AdZFOpwI
 qrgJ1/TtJPj4bhPPCsWXwkSQSF9JETmigRCn03fRyykvdt2f4xIgvVyaqhWqmpvv
 BZL8BKzutcFw7c5FTgNvjxAH5G2xGpefdPKE+ve0AwGnL0yrWIjk6tbpCPs9BqEe
 nD2Pxgb+r9aCtXYYYgkZ59YOF50gdtJF7ffFqMWZ+OGckFSHOpLeFP8IW9Q102b9
 VRkMTZQSd0E60ClVvuaEFIVHX52bJxM0kgdtH/2DVGdARllA4QjZuQdXZuNughYi
 zKckmEyQ
 =wHrR
 -----END PGP SIGNATURE-----

Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "An important signal handling patch for stable, and two small cleanup
  patches"

* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: do not fail __smb_send_rqst if non-fatal signals are pending
  fs/cifs: Simplify bool comparison.
  fs/cifs: Assign boolean values to a bool variable
2021-01-24 09:27:14 -08:00
Shakeel Butt
5c447d274f mm: fix numa stats for thp migration
Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration.  Fix that.

For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment
there is no need to handle THP migration as kernel still does not have
write support for file THP but to be more future proof, this patch adds
the THP support for those stats as well.

Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com
Fixes: e71769ae52 ("mm: enable thp migration for shmem thp")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Shakeel Butt
8a8792f600 mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration
but not the memcg numa stats.

That was not an issue until recently the commit 5f9a4f4a70 ("mm:
memcontrol: add the missing numa_stat interface for cgroup v2") exposed
numa stats for the memcg.

So fix the file_dirty per-memcg numa stat.

Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com
Fixes: 5f9a4f4a70 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Roman Gushchin
3de7d4f25a mm: memcg/slab: optimize objcg stock draining
Imran Khan reported a 16% regression in hackbench results caused by the
commit f2fe7b09a5 ("mm: memcg/slab: charge individual slab objects
instead of pages").  The regression is noticeable in the case of a
consequent allocation of several relatively large slab objects, e.g.
skb's.  As soon as the amount of stocked bytes exceeds PAGE_SIZE,
drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
to a number of atomic operations in page_counter_uncharge().

The corresponding call graph is below (provided by Imran Khan):

  |__alloc_skb
  |    |
  |    |__kmalloc_reserve.isra.61
  |    |    |
  |    |    |__kmalloc_node_track_caller
  |    |    |    |
  |    |    |    |slab_pre_alloc_hook.constprop.88
  |    |    |     obj_cgroup_charge
  |    |    |    |    |
  |    |    |    |    |__memcg_kmem_charge
  |    |    |    |    |    |
  |    |    |    |    |    |page_counter_try_charge
  |    |    |    |    |
  |    |    |    |    |refill_obj_stock
  |    |    |    |    |    |
  |    |    |    |    |    |drain_obj_stock.isra.68
  |    |    |    |    |    |    |
  |    |    |    |    |    |    |__memcg_kmem_uncharge
  |    |    |    |    |    |    |    |
  |    |    |    |    |    |    |    |page_counter_uncharge
  |    |    |    |    |    |    |    |    |
  |    |    |    |    |    |    |    |    |page_counter_cancel
  |    |    |    |
  |    |    |    |
  |    |    |    |__slab_alloc
  |    |    |    |    |
  |    |    |    |    |___slab_alloc
  |    |    |    |    |
  |    |    |    |slab_post_alloc_hook

Instead of directly uncharging the accounted kernel memory, it's
possible to refill the generic page-sized per-cpu stock instead.  It's a
much faster operation, especially on a default hierarchy.  As a bonus,
__memcg_kmem_uncharge_page() will also get faster, so the freeing of
page-sized kernel allocations (e.g.  large kmallocs) will become faster.

A similar change has been done earlier for the socket memory by the
commit 475d0487a2 ("mm: memcontrol: use per-cpu stocks for socket
memory uncharging").

Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com
Fixes: f2fe7b09a5 ("mm: memcg/slab: charge individual slab objects instead of pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Michal Koutn <mkoutny@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Mike Rapoport
d3921cb8be mm: fix initialization of struct page for holes in memory layout
There could be struct pages that are not backed by actual physical
memory.  This can happen when the actual memory bank is not a multiple
of SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem()
function that iterates through PFNs in holes in memblock.memory and if
there is a struct page corresponding to a PFN, the fields if this page
are set to default values and the page is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

On a system that has firmware reserved holes in a zone above ZONE_DMA,
for instance in a configuration below:

	# grep -A1 E820 /proc/iomem
	7a17b000-7a216fff : Unknown E820 type
	7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.

Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.

Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org
Fixes: 73a6e474cb ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Mike Rapoport
bde9cfa3af x86/setup: don't remove E820_TYPE_RAM for pfn 0
Patch series "mm: fix initialization of struct page for holes in  memory layout", v3.

Commit 73a6e474cb ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.

Initially there were crashes during compaction that Qian Cai reported
back in April [1].  It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].

[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org

This patch (of 2):

The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory.  As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.

If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch().  These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.

Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.

Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 09:20:52 -08:00
Jens Axboe
02a13674fa io_uring: account io_uring internal files as REQ_F_INFLIGHT
We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 10:15:33 -07:00
Pavel Begunkov
9d5c819068 io_uring: fix sleeping under spin in __io_clean_op
[   27.629441] BUG: sleeping function called from invalid context
	at fs/file.c:402
[   27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0,
	pid: 1012, name: io_wqe_worker-0
[   27.633220] 1 lock held by io_wqe_worker-0/1012:
[   27.634286]  #0: ffff888105e26c98 (&ctx->completion_lock)
	{....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70
[   27.649249] Call Trace:
[   27.649874]  dump_stack+0xac/0xe3
[   27.650666]  ___might_sleep+0x284/0x2c0
[   27.651566]  put_files_struct+0xb8/0x120
[   27.652481]  __io_clean_op+0x10c/0x2a0
[   27.653362]  __io_cqring_fill_event+0x2c1/0x350
[   27.654399]  __io_req_complete.part.102+0x41/0x70
[   27.655464]  io_openat2+0x151/0x300
[   27.656297]  io_issue_sqe+0x6c/0x14e0
[   27.660991]  io_wq_submit_work+0x7f/0x240
[   27.662890]  io_worker_handle_work+0x501/0x8a0
[   27.664836]  io_wqe_worker+0x158/0x520
[   27.667726]  kthread+0x134/0x180
[   27.669641]  ret_from_fork+0x1f/0x30

Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.

Reported-by: Abaci <abaci@linux.alibaba.com>
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 10:15:33 -07:00
Christoph Hellwig
f736d93d76
xfs: support idmapped mounts
Enable idmapped mounts for xfs. This basically just means passing down
the user_namespace argument from the VFS methods down to where it is
passed to the relevant helpers.

Note that full-filesystem bulkstat is not supported from inside idmapped
mounts as it is an administrative operation that acts on the whole file
system. The limitation is not applied to the bulkstat single operation
that just operates on a single inode.

Link: https://lore.kernel.org/r/20210121131959.646623-40-christian.brauner@ubuntu.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:46 +01:00
Christian Brauner
14f3db5542
ext4: support idmapped mounts
Enable idmapped mounts for ext4. All dedicated helpers we need for this
exist. So this basically just means we're passing down the
user_namespace argument from the VFS methods to the relevant helpers.

Let's create simple example where we idmap an ext4 filesystem:

 root@f2-vm:~# truncate -s 5G ext4.img

 root@f2-vm:~# mkfs.ext4 ./ext4.img
 mke2fs 1.45.5 (07-Jan-2020)
 Discarding device blocks: done
 Creating filesystem with 1310720 4k blocks and 327680 inodes
 Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
 Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (16384 blocks): done
 Writing superblocks and filesystem accounting information: done

 root@f2-vm:~# losetup -f --show ./ext4.img
 /dev/loop0

 root@f2-vm:~# mount /dev/loop0 /mnt

 root@f2-vm:~# ls -al /mnt/
 total 24
 drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
 drwx------  2 root root 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped1 where we map uid and gid
 # 0 to uid and gid 1000
 root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/

 root@f2-vm:/# ls -al /idmapped1/
 total 24
 drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
 drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
 drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found

 # Let's create an idmapped mount at /idmapped2 where we map uid and gid
 # 0 to uid and gid 2000
 root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/

 root@f2-vm:/# ls -al /idmapped2/
 total 24
 drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
 drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
 drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found

Let's create another example where we idmap the rootfs filesystem
without a mapping for uid 0 and gid 0:

 # Create an idmapped mount of for a full POSIX range of rootfs under
 # /mnt but without a mapping for uid 0 to reduce attack surface

 root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/

 # Since we don't have a mapping for uid and gid 0 all files owned by
 # uid and gid 0 should show up as uid and gid 65534:
 root@f2-vm:/# ls -al /mnt/
 total 664
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
 drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
 drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
 drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
 lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
 lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
 lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
 drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
 drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
 drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
 lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
 drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
 drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
 drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
 drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
 drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var

 # Since we do have a mapping for uid and gid 1000 all files owned by
 # uid and gid 1000 should simply show up as uid and gid 1000:
 root@f2-vm:/# ls -al /mnt/home/ubuntu/
 total 40
 drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
 drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
 -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
 -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
 -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
 -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
 -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
 -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo

Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:46 +01:00
Christian Brauner
4b78993681
fat: handle idmapped mounts
Let fat handle idmapped mounts. This allows to have the same fat mount
appear in multiple locations with different id mappings. This allows to
expose a vfat formatted USB stick to multiple user with different ids on
the host or in user namespaces allowing for dac permissions:

mount -o uid=1000,gid=1000 /dev/sdb /mnt

u1001@f2-vm:/lower1$ ls -ln /mnt/
total 4
-rwxr-xr-x 1 1000 1000 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 1000 1000 0 Oct 28 01:09 bbb
-rwxr-xr-x 1 1000 1000 0 Oct 28 01:10 ccc
-rwxr-xr-x 1 1000 1000 0 Oct 28 03:46 ddd
-rwxr-xr-x 1 1000 1000 0 Oct 28 04:01 eee

mount-idmapped --map-mount b:1000:1001:1

u1001@f2-vm:/lower1$ ls -ln /lower1/
total 4
-rwxr-xr-x 1 1001 1001 4 Oct 28 03:44 aaa
-rwxr-xr-x 1 1001 1001 0 Oct 28 01:09 bbb
-rwxr-xr-x 1 1001 1001 0 Oct 28 01:10 ccc
-rwxr-xr-x 1 1001 1001 0 Oct 28 03:46 ddd
-rwxr-xr-x 1 1001 1001 0 Oct 28 04:01 eee

u1001@f2-vm:/lower1$ touch /lower1/fff

u1001@f2-vm:/lower1$ ls -ln /lower1/fff
-rwxr-xr-x 1 1001 1001 0 Oct 28 04:03 /lower1/fff

u1001@f2-vm:/lower1$ ls -ln /mnt/fff
-rwxr-xr-x 1 1000 1000 0 Oct 28 04:03 /mnt/fff

Link: https://lore.kernel.org/r/20210121131959.646623-38-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:46 +01:00
Christian Brauner
01eadc8dd9
tests: add mount_setattr() selftests
Add a range of selftests for the new mount_setattr() syscall to verify
that it works as expected. This tests that:
- no invalid flags can be specified
- changing properties of a single mount works and leaves other mounts in
  the mount tree unchanged
- changing a mount tre to read-only when one of the mounts has writers
  fails and leaves the whole mount tree unchanged
- changing mount properties from multiple threads works
- changing atime settings works
- changing mount propagation works
- changing the mount options of a mount tree where the individual mounts
  in the tree have different mount options only changes the flags that
  were requested to change
- changing mount options from another mount namespace fails
- changing mount options from another user namespace fails
- idmapped mounts

Note, the main test-suite for idmapped mounts is part of xfstests and is
pretty huge. These tests here just make sure that the syscalls bits work
correctly.

 TAP version 13
 1..20
 # Starting 20 tests from 3 test cases.
 #  RUN           mount_setattr.invalid_attributes ...
 #            OK  mount_setattr.invalid_attributes
 ok 1 mount_setattr.invalid_attributes
 #  RUN           mount_setattr.extensibility ...
 #            OK  mount_setattr.extensibility
 ok 2 mount_setattr.extensibility
 #  RUN           mount_setattr.basic ...
 #            OK  mount_setattr.basic
 ok 3 mount_setattr.basic
 #  RUN           mount_setattr.basic_recursive ...
 #            OK  mount_setattr.basic_recursive
 ok 4 mount_setattr.basic_recursive
 #  RUN           mount_setattr.mount_has_writers ...
 #            OK  mount_setattr.mount_has_writers
 ok 5 mount_setattr.mount_has_writers
 #  RUN           mount_setattr.mixed_mount_options ...
 #            OK  mount_setattr.mixed_mount_options
 ok 6 mount_setattr.mixed_mount_options
 #  RUN           mount_setattr.time_changes ...
 #            OK  mount_setattr.time_changes
 ok 7 mount_setattr.time_changes
 #  RUN           mount_setattr.multi_threaded ...
 #            OK  mount_setattr.multi_threaded
 ok 8 mount_setattr.multi_threaded
 #  RUN           mount_setattr.wrong_user_namespace ...
 #            OK  mount_setattr.wrong_user_namespace
 ok 9 mount_setattr.wrong_user_namespace
 #  RUN           mount_setattr.wrong_mount_namespace ...
 #            OK  mount_setattr.wrong_mount_namespace
 ok 10 mount_setattr.wrong_mount_namespace
 #  RUN           mount_setattr_idmapped.invalid_fd_negative ...
 #            OK  mount_setattr_idmapped.invalid_fd_negative
 ok 11 mount_setattr_idmapped.invalid_fd_negative
 #  RUN           mount_setattr_idmapped.invalid_fd_large ...
 #            OK  mount_setattr_idmapped.invalid_fd_large
 ok 12 mount_setattr_idmapped.invalid_fd_large
 #  RUN           mount_setattr_idmapped.invalid_fd_closed ...
 #            OK  mount_setattr_idmapped.invalid_fd_closed
 ok 13 mount_setattr_idmapped.invalid_fd_closed
 #  RUN           mount_setattr_idmapped.invalid_fd_initial_userns ...
 #            OK  mount_setattr_idmapped.invalid_fd_initial_userns
 ok 14 mount_setattr_idmapped.invalid_fd_initial_userns
 #  RUN           mount_setattr_idmapped.attached_mount_inside_current_mount_namespace ...
 #            OK  mount_setattr_idmapped.attached_mount_inside_current_mount_namespace
 ok 15 mount_setattr_idmapped.attached_mount_inside_current_mount_namespace
 #  RUN           mount_setattr_idmapped.attached_mount_outside_current_mount_namespace ...
 #            OK  mount_setattr_idmapped.attached_mount_outside_current_mount_namespace
 ok 16 mount_setattr_idmapped.attached_mount_outside_current_mount_namespace
 #  RUN           mount_setattr_idmapped.detached_mount_inside_current_mount_namespace ...
 #            OK  mount_setattr_idmapped.detached_mount_inside_current_mount_namespace
 ok 17 mount_setattr_idmapped.detached_mount_inside_current_mount_namespace
 #  RUN           mount_setattr_idmapped.detached_mount_outside_current_mount_namespace ...
 #            OK  mount_setattr_idmapped.detached_mount_outside_current_mount_namespace
 ok 18 mount_setattr_idmapped.detached_mount_outside_current_mount_namespace
 #  RUN           mount_setattr_idmapped.change_idmapping ...
 #            OK  mount_setattr_idmapped.change_idmapping
 ok 19 mount_setattr_idmapped.change_idmapping
 #  RUN           mount_setattr_idmapped.idmap_mount_tree_invalid ...
 #            OK  mount_setattr_idmapped.idmap_mount_tree_invalid
 ok 20 mount_setattr_idmapped.idmap_mount_tree_invalid
 # PASSED: 20 / 20 tests passed.
 # Totals: pass:20 fail:0 xfail:0 xpass:0 skip:0 error:0

Link: https://lore.kernel.org/r/20210121131959.646623-37-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:45 +01:00
Christian Brauner
9caccd4154
fs: introduce MOUNT_ATTR_IDMAP
Introduce a new mount bind mount property to allow idmapping mounts. The
MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
together with a file descriptor referring to a user namespace.

The user namespace referenced by the namespace file descriptor will be
attached to the bind mount. All interactions with the filesystem going
through that mount will be mapped according to the mapping specified in
the user namespace attached to it.

Using user namespaces to mark mounts means we can reuse all the existing
infrastructure in the kernel that already exists to handle idmappings
and can also use this for permission checking to allow unprivileged user
to create idmapped mounts in the future.

Idmapping a mount is decoupled from the caller's user and mount
namespace. This means idmapped mounts can be created in the initial
user namespace which is an important use-case for systemd-homed,
portable usb-sticks between systems, sharing data between the initial
user namespace and unprivileged containers, and other use-cases that
have been brought up. For example, assume a home directory where all
files are owned by uid and gid 1000 and the home directory is brought to
a new laptop where the user has id 12345. The system administrator can
simply create a mount of this home directory with a mapping of
1000:12345:1 and other mappings to indicate the ids should be kept.
(With this it is e.g. also possible to create idmapped mounts on the
host with an identity mapping 1:1:100000 where the root user is not
mapped. A user with root access that e.g. has been pivot rooted into
such a mount on the host will be not be able to execute, read, write, or
create files as root.)

Given that mapping a mount is decoupled from the caller's user namespace
a sufficiently privileged process such as a container manager can set up
an idmapped mount for the container and the container can simply pivot
root to it. There's no need for the container to do anything. The mount
will appear correctly mapped independent of the user namespace the
container uses. This means we don't need to mark a mount as idmappable.

In order to create an idmapped mount the caller must currently be
privileged in the user namespace of the superblock the mount belongs to.
Once a mount has been idmapped we don't allow it to change its mapping.
This keeps permission checking and life-cycle management simple. Users
wanting to change the idmapped can always create a new detached mount
with a different idmapping.

Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:43:45 +01:00
Christian Brauner
2a1867219c
fs: add mount_setattr()
This implements the missing mount_setattr() syscall. While the new mount
api allows to change the properties of a superblock there is currently
no way to change the properties of a mount or a mount tree using file
descriptors which the new mount api is based on. In addition the old
mount api has the restriction that mount options cannot be applied
recursively. This hasn't changed since changing mount options on a
per-mount basis was implemented in [1] and has been a frequent request
not just for convenience but also for security reasons. The legacy
mount syscall is unable to accommodate this behavior without introducing
a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
mount. Changing MS_REC to apply to the whole mount tree would mean
introducing a significant uapi change and would likely cause significant
regressions.

The new mount_setattr() syscall allows to recursively clear and set
mount options in one shot. Multiple calls to change mount options
requesting the same changes are idempotent:

int mount_setattr(int dfd, const char *path, unsigned flags,
                  struct mount_attr *uattr, size_t usize);

Flags to modify path resolution behavior are specified in the @flags
argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
restrict path resolution as introduced with openat2() might be supported
in the future.

The mount_setattr() syscall can be expected to grow over time and is
designed with extensibility in mind. It follows the extensible syscall
pattern we have used with other syscalls such as openat2(), clone3(),
sched_{set,get}attr(), and others.
The set of mount options is passed in the uapi struct mount_attr which
currently has the following layout:

struct mount_attr {
	__u64 attr_set;
	__u64 attr_clr;
	__u64 propagation;
	__u64 userns_fd;
};

The @attr_set and @attr_clr members are used to clear and set mount
options. This way a user can e.g. request that a set of flags is to be
raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
@attr_set while at the same time requesting that another set of flags is
to be lowered such as removing noexec from a mount tree by specifying
MOUNT_ATTR_NOEXEC in @attr_clr.

Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
not a bitmap, users wanting to transition to a different atime setting
cannot simply specify the atime setting in @attr_set, but must also
specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
@attr_clr.

The @propagation field lets callers specify the propagation type of a
mount tree. Propagation is a single property that has four different
settings and as such is not really a flag argument but an enum.
Specifically, it would be unclear what setting and clearing propagation
settings in combination would amount to. The legacy mount() syscall thus
forbids the combination of multiple propagation settings too. The goal
is to keep the semantics of mount propagation somewhat simple as they
are overly complex as it is.

The @userns_fd field lets user specify a user namespace whose idmapping
becomes the idmapping of the mount. This is implemented and explained in
detail in the next patch.

[1]: commit 2e4b7fcd92 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")

Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:42:45 +01:00
Christian Brauner
5b490500f9
fs: add attr_flags_to_mnt_flags helper
Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
which we will use in follow-up patches too.

Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner
fbdc2f6c40
fs: split out functions to hold writers
When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
aren't currently any active writers. Split this logic out into simple
helpers that we can use in follow-up patches.

Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner
e58ace1a0f
namespace: only take read lock in do_reconfigure_mnt()
do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
which seems unnecessary since we're not changing the superblock. We're
only checking whether it is already read-only. Setting other mount
attributes is protected by lock_mount_hash() afaict and not by s_umount.

The history of down_write(&sb->s_umount) lock being taken when setting
mount attributes dates back to the introduction of MNT_READONLY in [2].
This introduced the concept of having read-only mounts in contrast to
just having a read-only superblock. When it got introduced it was simply
plumbed into do_remount() which already took down_write(&sb->s_umount)
because it was only used to actually change the superblock before [2].
Afaict, it would've already been possible back then to only use
down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
options were protected by the vfsmount lock already. But that would've
meant special casing the locking for MS_BIND | MS_REMOUNT in
do_remount() which people might not have considered worth it.
Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
lock was simply copied over.
Now that we have this be a separate helper only take the
down_read(&sb->s_umount) lock since we're only interested in checking
whether the super block is currently read-only and blocking any writers
from changing it. Essentially, checking that the super block is
read-only has the advantage that we can avoid having to go into the
slowpath and through MNT_WRITE_HOLD and can simply set the read-only
flag on the mount in set_mount_attributes().

[1]: commit 43f5e655ef ("vfs: Separate changing mount flags full remount")
[2]: commit 2e4b7fcd92 ("[PATCH] r/o bind mounts: honor mount writer counts at remount")

Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner
d033cb6784
mount: make {lock,unlock}_mount_hash() static
The lock_mount_hash() and unlock_mount_hash() helpers are never called
outside a single file. Remove them from the header and make them static
to reflect this fact. There's no need to have them callable from other
places right now, as Christoph observed.

Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner
68847c9417
namespace: take lock_mount_hash() directly when changing flags
Changing mount options always ends up taking lock_mount_hash() but when
MNT_READONLY is requested and neither the mount nor the superblock are
MNT_READONLY we end up taking the lock, dropping it, and retaking it to
change the other mount attributes. Instead, let's acquire the lock once
when changing the mount attributes. This simplifies the locking in these
codepath, makes them easier to reason about and avoids having to
reacquire the lock right after dropping it.

Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:34 +01:00
Christian Brauner
899bf2ceb3
nfs: do not export idmapped mounts
Prevent nfs from exporting idmapped mounts until we have ported it to
support exporting idmapped mounts.

Link: https://lore.kernel.org/linux-api/20210123130958.3t6kvgkl634njpsm@wittgenstein
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "J. Bruce Fields" <bfields@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:29:33 +01:00
Christian Brauner
029a52ada6
overlayfs: do not mount on top of idmapped mounts
Prevent overlayfs from being mounted on top of idmapped mounts.
Stacking filesystems need to be prevented from being mounted on top of
idmapped mounts until they have have been converted to handle this.

Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
0f16ff0f54
ecryptfs: do not mount on top of idmapped mounts
Prevent ecryptfs from being mounted on top of idmapped mounts.
Stacking filesystems need to be prevented from being mounted on top of
idmapped mounts until they have have been converted to handle this.

Link: https://lore.kernel.org/r/20210121131959.646623-28-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
a2d2329e30
ima: handle idmapped mounts
IMA does sometimes access the inode's i_uid and compares it against the
rules' fowner. Enable IMA to handle idmapped mounts by passing down the
mount's user namespace. We simply make use of the helpers we introduced
before. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
3cee6079f6
apparmor: handle idmapped mounts
The i_uid and i_gid are mostly used when logging for AppArmor. This is
broken in a bunch of places where the global root id is reported instead
of the i_uid or i_gid of the file. Nonetheless, be kind and log the
mapped inode if we're coming from an idmapped mount. If the initial user
namespace is passed nothing changes so non-idmapped mounts will see
identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-26-christian.brauner@ubuntu.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner
1ab29965b3
exec: handle idmapped mounts
When executing a setuid binary the kernel will verify in bprm_fill_uid()
that the inode has a mapping in the caller's user namespace before
setting the callers uid and gid. Let bprm_fill_uid() handle idmapped
mounts. If the inode is accessed through an idmapped mount it is mapped
according to the mount's user namespace. Afterwards the checks are
identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-24-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:19 +01:00
Christian Brauner
435ac6214e
would_dump: handle idmapped mounts
When determining whether or not to create a coredump the vfs will verify
that the caller is privileged over the inode. Make the would_dump()
helper handle idmapped mounts by passing down the mount's user namespace
of the exec file. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-23-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:19 +01:00
Christian Brauner
0f5d220b42
ioctl: handle idmapped mounts
Enable generic ioctls to handle idmapped mounts by passing down the
mount's user namespace. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-22-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:19 +01:00
Christian Brauner
b816dd5dde
init: handle idmapped mounts
Enable the init helpers to handle idmapped mounts by passing down the
mount's user namespace. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-21-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:19 +01:00
Christian Brauner
9eccd12ce7
fcntl: handle idmapped mounts
Enable the setfl() helper to handle idmapped mounts by passing down the
mount's user namespace. If the initial user namespace is passed nothing
changes so non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-20-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:19 +01:00