wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL
address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmE3uicACgkQMUZtbf5S
IrtJVA//XdE8qAmw1JukjyYC87JH2ale20eoZ6ERn7/09e4tdv3M6dOTI4YfrM6+
CMNP5MP2qit3IzY+lN0+yt9AAFH7k85z3MA8zLxsXN4z63OJcZvFv/G/OWy4Wp/0
vOo/DH+rF3LR+fZZvjJI+8Xi9/orsRpD12cwGmjGRxybh+XcnHKI/GvK2RgE6oBR
015RfBbbQBpzFQvESLnSwDzabN1XFEL1x/bz7N8ek3okfO/tab+f3E1tb6eYtTy+
jyDyOWpayd4xDttKNMUuxwS1q+/oAWOAq8PzkaF/ZG2sBH1Z4yZN9ZtsLNZmPG8N
5L1FEem/Nmgr54T9v/FhfiryhhGGysVfVgtQcCBkKRmVn1Kk2L6dFvtuanPtFFd3
llbi5PvCDJy3rbMmxKmyoM3T4jpMwWxQRZKsosw+k/WQfb8/SUOjgpY713V1Wx/P
S+2uadU4l9Ql9sF6X0IqZABnnt+j/BuDo6C6vVq7vyj0iQ9hEX9YxC0ybrAHOYpH
suHWKndodRfTxxVOg8xRNYwXyRLNbm1AP6LMDNKBlFUjwNSZ362qFX7W7DuXoRup
Rrnb8V1QFvM+pyFb2a0qNtBS68IXbjCdVQX5e8a5ELaAUnDPefNrfPN+/rrTLEtV
LnusmBF+02llVSYdr88t1e+LmzqS/aqXFy2ry4y6owjq20ld2O0=
=Zvuz
-----END PGP SIGNATURE-----
Merge tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes and stragglers from Jakub Kicinski:
"Networking stragglers and fixes, including changes from netfilter,
wireless and can.
Current release - regressions:
- qrtr: revert check in qrtr_endpoint_post(), fixes audio and wifi
- ip_gre: validate csum_start only on pull
- bnxt_en: fix 64-bit doorbell operation on 32-bit kernels
- ionic: fix double use of queue-lock, fix a sleeping in atomic
- can: c_can: fix null-ptr-deref on ioctl()
- cs89x0: disable compile testing on powerpc
Current release - new code bugs:
- bridge: mcast: fix vlan port router deadlock, consistently disable
BH
Previous releases - regressions:
- dsa: tag_rtl4_a: fix egress tags, only port 0 was working
- mptcp: fix possible divide by zero
- netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
- netfilter: socket: icmp6: fix use-after-scope
- stmmac: fix MAC not working when system resume back with WoL active
Previous releases - always broken:
- ip/ip6_gre: use the same logic as SIT interfaces when computing
v6LL address
- seg6: set fc_nlinfo in nh_create_ipv4, nh_create_ipv6
- mptcp: only send extra TCP acks in eligible socket states
- dsa: lantiq_gswip: fix maximum frame length
- stmmac: fix overall budget calculation for rxtx_napi
- bnxt_en: fix firmware version reporting via devlink
- renesas: sh_eth: add missing barrier to fix freeing wrong tx
descriptor
Stragglers:
- netfilter: conntrack: switch to siphash
- netfilter: refuse insertion if chain has grown too large
- ncsi: add get MAC address command to get Intel i210 MAC address"
* tag 'net-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
ieee802154: Remove redundant initialization of variable ret
net: stmmac: fix MAC not working when system resume back with WoL active
net: phylink: add suspend/resume support
net: renesas: sh_eth: Fix freeing wrong tx descriptor
bonding: 3ad: pass parameter bond_params by reference
cxgb3: fix oops on module removal
can: c_can: fix null-ptr-deref on ioctl()
can: rcar_canfd: add __maybe_unused annotation to silence warning
net: wwan: iosm: Unify IO accessors used in the driver
net: wwan: iosm: Replace io.*64_lo_hi() with regular accessors
net: qcom/emac: Replace strlcpy with strscpy
ip6_gre: Revert "ip6_gre: add validation for csum_start"
net: hns3: make hclgevf_cmd_caps_bit_map0 and hclge_cmd_caps_bit_map0 static
selftests/bpf: Test XDP bonding nest and unwind
bonding: Fix negative jump label count on nested bonding
MAINTAINERS: add VM SOCKETS (AF_VSOCK) entry
stmmac: dwmac-loongson:Fix missing return value
iwlwifi: fix printk format warnings in uefi.c
net: create netdev->dev_addr assignment helpers
bnxt_en: Fix possible unintended driver initiated error recovery
...
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as they
do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is disabled
- bugfixes and cleanups, especially with respect to 1) vCPU reset and
2) choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmE2CIAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMyqwf+Ky2WoThuQ9Ra0r/m8pUTAx5+gsAf
MmG24rNLE+26X0xuBT9Q5+etYYRLrRTWJvo5cgHooz7muAYW6scR+ho5xzvLTAxi
DAuoijkXsSdGoFCp0OMUHiwG3cgY5N7feTEwLPAb2i6xr/l6SZyCP4zcwiiQbJ2s
UUD0i3rEoNQ02/hOEveud/ENxzUli9cmmgHKXR3kNgsJClSf1fcuLnhg+7EGMhK9
+c2V+hde5y0gmEairQWm22MLMRolNZ5NL4kjykiNh2M5q9YvbHe5+f/JmENlNZMT
bsUQT6Ry1ukuJ0V59rZvUw71KknPFzZ3d6HgW4pwytMq6EJKiISHzRbVnQ==
=FCAB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual
PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as
they do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized
LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is
disabled
- bugfixes and cleanups, especially with respect to vCPU reset and
choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless
necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
KVM: Drop unused kvm_dirty_gfn_invalid()
KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
KVM: MMU: mark role_regs and role accessors as maybe unused
KVM: MIPS: Remove a "set but not used" variable
x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
KVM: stats: Add VM stat for remote tlb flush requests
KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
kvm: x86: Increase MAX_VCPUS to 1024
kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
KVM: s390: index kvm->arch.idle_mask by vcpu_idx
KVM: s390: Enable specification exception interpretation
KVM: arm64: Trim guest debug exception handling
KVM: SVM: Add 5-level page table support for SVM
...
This moves the crash recovery worker to the freezable work queue to
avoid interaction with other drivers during suspend & resume. It fixes a
couple of typos in comments.
It adds support for handling the audio DSP on SDM660 and it fixes a race
between the Qualcomm wireless subsystem driver and the associated driver
for the RF chip.
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEBd4DzF816k8JZtUlCx85Pw2ZrcUFAmE3cD4bHGJqb3JuLmFu
ZGVyc3NvbkBsaW5hcm8ub3JnAAoJEAsfOT8Nma3FjdkQAKTm0N2mjd9mpKMmOhPx
z9lsD/OcJnM7Rpo16jBzfMkJF90LFqSIjc0grUTaBXMMLtdSlA08QnkTqK1CLdQ8
haseuZ7zs6H9MycdoL9krCMAmSAcNttcbYW4/TBpilK4n6NNXZs1tgBkk3YUVRRx
fym1n2mpWr9kMAy6rpY/HUoLR5s3zNPGJ4K02uEqVVpueYAr1ZK72oC9EO/e5m1e
I6M4fhfT6Htypk8oCJUFcvvRqu1O03fpsJ7cwvbaE/L2+96ttje+Ux4P69x/9B08
oNfMA7sVVd68tblRcYO7hT9GN/W7FIhgLVWK8r3IPr5iDUTutUSNl2WBH1xXh7kT
baifEKtbIBYEuw+1zHE2topBs2P5nhfv9uOWfvOtV38ahtPuncUUGjzmTlf02LgP
hnxrJpaIsHktw/OgY9lq2apyGCvwTN/xpB9Ng0j5MaByhkatO+vpTIwAVi6OSoRN
G95/IXHXdY1C6a6rtVvhIXNkHIwFApPDD2ZeveYJs2/WfL05h9pj979Q60h3S6od
Ff8uQjZe10d7ogFH6R5NQdhCPdBV6BEtUFiQMrhpm1qrNk/TAX9dBnkMl9xSwfgy
dRYiHwEgn2fGe3K3yExI77MRdzChvxrPDZYXTCar3nqaLrXPx8qM1rc/RbkF/PXE
3FKpUS9XTQUsLdMSkGh8bZ3L
=EX0n
-----END PGP SIGNATURE-----
Merge tag 'rproc-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc
Pull remoteproc updates from Bjorn Andersson:
- move the crash recovery worker to the freezable work queue to avoid
interaction with other drivers during suspend & resume
- fix a couple of typos in comments
- add support for handling the audio DSP on SDM660
- fix a race between the Qualcomm wireless subsystem driver and the
associated driver for the RF chip
* tag 'rproc-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/andersson/remoteproc:
remoteproc: q6v5_pas: Add sdm660 ADSP PIL compatible
dt-bindings: remoteproc: qcom: adsp: Add SDM660 ADSP
remoteproc: use freezable workqueue for crash notifications
remoteproc: fix kernel doc for struct rproc_ops
remoteproc: fix an typo in fw_elf_get_class code comments
remoteproc: qcom: wcnss: Fix race with iris probe
- Add support for registering devices via MFD cells to Simple MFD (I2C)
- New Drivers
- Add support for Renesas Synchronization Management Unit (SMU)
- New Device Support
- Add support for N5010 to Intel M10 BMC
- Add support for Cannon Lake to Intel LPSS ACPI
- Add support for Samsung SSG{1,2} to ST-Ericsson's U8500 family
- Add support for TQMx110EB and TQMxE40x to TQ-Systems PLD TQMx86
- New Functionality
- Add support for GPIO to Intel LPC ICH
- Add support for Reset to Texas Instruments TPS65086
- Fix-ups
- Trivial, sorting, whitespace, renaming, etc; mt6360-core, db8500-prcmu-regs, tqmx86
- Device Tree fiddling; syscon, axp20x, qcom,pm8008, ti,tps65086, brcm,cru
- Use proper APIs for IRQ map resolution; ab8500-core, stmpe, tc3589x, wm8994-irq
- Pass 'supplied-from' property through axp288_fuel_gauge via swnode
- Remove unused file entry; MAINTAINERS
- Make interrupt line optional; tps65086
- Rename db8500-cpuidle driver symbol; db8500-prcmu
- Remove support for unused hardware; tqmx86
- Provide a standard LPC clock frequency for unknown boards; tqmx86
- Remove unused code; ti_am335x_tscadc
- Use of_iomap() instead of ioremap(); syscon
- Bug Fixes
- Clear GPIO IRQ resource flags when no IRQ is set; tqmx86
- Fix incorrect/misleading frequencies; db8500-prcmu
- Mitigate namespace clash with other GPIOBASE users
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEdrbJNaO+IJqU8IdIUa+KL4f8d2EFAmE3bf4ACgkQUa+KL4f8
d2F2YxAApt42Zm+DrwQvt6a5k8ZS3LVIfiMBqVZOdlcN9LcoqKsZKgnQRyqjjKul
rX4YcCLqfwv01FxBk37Xf1toTnbIvggKZjjEt7tf2ZHO1oxzuHnnODVrm6mxXEeX
inDK8fBvmevebq5K2VH5yEAZJ5my1/IjcBs8S/txeCbW8BYu/U6Bw0gMfuQjdW64
+3jgKk+o03uN9OhGKIp2eBNoF/RkdautpfK6Pyl3bPmKxCZ7BMRrGw11I+HjZLqz
AzfAmcQne0m8kQ8rvLYXyQrNu51xCcfxThrw8A4diTKqOXjxX0W9/OfgdcCo8uBa
OVz22DKkK+UyajvNnfMZLzVfj9HtZTRBPl13OZN7WA42oztLWT7IgEQq+MTr1fef
FOQJ1njiix6oXwKMdDR/Z9xoVfxQkcLIwCIw/wMC+kT/bKdZI0UtJok5iCH2Se0A
zcdHKqc3khUR+55A9Ie1JeaZAlCViFEXdpF2SQjqBQ6c6r94mMhJryTWDcDV6Thm
e9GwUXzG2J6u6/0mSHdeBNMDVbDetbWKTTRMMD5FlKv3d60ZvczutT7qjbkiol8S
OzS+YnjtwlMtPKpK/4+Wgfbsuu1KWOdHkIcVvEj9fdf2Kkqhsi2kXqq95GnkeZP5
vyZ1mt/sneX9LnHi3NCPQJcXj2wBZSmsnsN/pZJZhTFFu8UBTIU=
=7Zv8
-----END PGP SIGNATURE-----
Merge tag 'mfd-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"Core Frameworks:
- Add support for registering devices via MFD cells to Simple MFD (I2C)
New Drivers:
- Add support for Renesas Synchronization Management Unit (SMU)
New Device Support:
- Add support for N5010 to Intel M10 BMC
- Add support for Cannon Lake to Intel LPSS ACPI
- Add support for Samsung SSG{1,2} to ST-Ericsson's U8500 family
- Add support for TQMx110EB and TQMxE40x to TQ-Systems PLD TQMx86
New Functionality:
- Add support for GPIO to Intel LPC ICH
- Add support for Reset to Texas Instruments TPS65086
Fix-ups:
- Trivial, sorting, whitespace, renaming, etc; mt6360-core, db8500-prcmu-regs, tqmx86
- Device Tree fiddling; syscon, axp20x, qcom,pm8008, ti,tps65086, brcm,cru
- Use proper APIs for IRQ map resolution; ab8500-core, stmpe, tc3589x, wm8994-irq
- Pass 'supplied-from' property through axp288_fuel_gauge via swnode
- Remove unused file entry; MAINTAINERS
- Make interrupt line optional; tps65086
- Rename db8500-cpuidle driver symbol; db8500-prcmu
- Remove support for unused hardware; tqmx86
- Provide a standard LPC clock frequency for unknown boards; tqmx86
- Remove unused code; ti_am335x_tscadc
- Use of_iomap() instead of ioremap(); syscon
Bug Fixes:
- Clear GPIO IRQ resource flags when no IRQ is set; tqmx86
- Fix incorrect/misleading frequencies; db8500-prcmu
- Mitigate namespace clash with other GPIOBASE users"
* tag 'mfd-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (31 commits)
mfd: lpc_sch: Rename GPIOBASE to prevent build error
mfd: syscon: Use of_iomap() instead of ioremap()
dt-bindings: mfd: Add Broadcom CRU
mfd: ti_am335x_tscadc: Delete superfluous error message
mfd: tqmx86: Assume 24MHz LPC clock for unknown boards
mfd: tqmx86: Add support for TQ-Systems DMI IDs
mfd: tqmx86: Add support for TQMx110EB and TQMxE40x
mfd: tqmx86: Fix typo in "platform"
mfd: tqmx86: Remove incorrect TQMx90UC board ID
mfd: tqmx86: Clear GPIO IRQ resource when no IRQ is set
mfd: simple-mfd-i2c: Add support for registering devices via MFD cells
mfd/cpuidle: ux500: Rename driver symbol
mfd: tps65086: Add cell entry for reset driver
mfd: tps65086: Make interrupt line optional
dt-bindings: mfd: Convert tps65086.txt to YAML
MAINTAINERS: Adjust ARM/NOMADIK/Ux500 ARCHITECTURES to file renaming
mfd: db8500-prcmu: Handle missing FW variant
mfd: db8500-prcmu: Rename register header
mfd: axp20x: Add supplied-from property to axp288_fuel_gauge cell
mfd: Don't use irq_create_mapping() to resolve a mapping
...
- new driver: gpio-virtio allowing a guest VM running linux to access
GPIO lines provided by the host
- split the GPIO driver out of the rockchip pin control driver
- add support for a new model to gpio-aspeed-sgpio, refactor the driver
and use generic device property interfaces, improve property sanitization
- add ACPI support to gpio-tegra186
- improve the code setting the line names to support multiple GPIO banks
per device
- constify a bunch of OF functions in the core GPIO code and make the
declaration for one of the core OF functions we use consistent within its
header
- use software nodes in intel_quark_i2c_gpio
- add support for the gpio-line-names property in gpio-mt7621
- use the standard GPIO function for setting the GPIO names in gpio-brcmstb
- fix a bunch of leaks and other bugs in gpio-mpc8xxx
- use generic pm callbacks in gpio-ml-ioh
- improve resource management and PM handling in gpio-mlxbf2
- modernize and improve the gpio-dwapb driver
- coding style improvements in gpio-rcar
- documentation fixes and improvements
- update the MAINTAINERS entry for gpio-zynq
- minor tweaks in several drivers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEFp3rbAvDxGAT0sefEacuoBRx13IFAmE3ItgACgkQEacuoBRx
13LKqA//Q2E9QYJx0lLO0C15JKzJjqYGyVqSm73Huo8pKWRSKr9YuB4/tBPKz5Ya
pg/h7wbsk+mtQ0pqot/SEGVLo1rK6ZcPiCDkYkuaVsY9pS0zX7A/Sb2N7xKML+Nj
wTFuhSZH2byxH6QgUrX3RLMagW4p/owlDvFjZ6Z6Vh8Ulsnb0pdL3nYg5L017GMT
A1ySzbP79NK3LfOLTqdqgALv4EF2x+paolyEpI/Jv0naBYkIP4AcbOBQEVMpoCY/
XEcIdPvqMyPm4PdYSy3iCqtkf7jclDbV030SHlir2bKHjI79l8ARy0Tu6hvISRSG
8XMwt6ke40GYnPkESZTkWlqeVHYzli84FxYXYLnqFa/21c4qswHk/aZZq5h83fn8
7aonkEQQuHfQM00MvLu0mhtKXYdLbqv7jjd0CYChwxQSpu0iu7IQSWW8c2YmGvvt
vqfM8TdKyGNPAmSl2/enPKOr+LugG4rcgMehU9/p6QvHbB6y2SxC1MykldPOdC6d
53PeDeNP6XOp2s10zVPWh6P0rbrMaEtv/GZ143kUw9bhb1g3woX5SaS7W76cOXhE
kty6g1e8xNaKDZbJ++UAh7G9IGQdtz0xRCXDUHFUc89uCThc9RcHowNPtzMMnmgM
ucWJ81XnGDSDyzzG/f7uhcOtszWYCOtFmteooaMGB36/pH2CG9Y=
=y/RY
-----END PGP SIGNATURE-----
Merge tag 'gpio-updates-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio updates from Bartosz Golaszewski:
"We mostly have various improvements and refactoring all over the place
but also some interesting new features - like the virtio GPIO driver
that allows guest VMs to use host's GPIOs. We also have a new/old GPIO
driver for rockchip - this one has been split out of the pinctrl
driver.
Summary:
- new driver: gpio-virtio allowing a guest VM running linux to access
GPIO lines provided by the host
- split the GPIO driver out of the rockchip pin control driver
- add support for a new model to gpio-aspeed-sgpio, refactor the
driver and use generic device property interfaces, improve property
sanitization
- add ACPI support to gpio-tegra186
- improve the code setting the line names to support multiple GPIO
banks per device
- constify a bunch of OF functions in the core GPIO code and make the
declaration for one of the core OF functions we use consistent
within its header
- use software nodes in intel_quark_i2c_gpio
- add support for the gpio-line-names property in gpio-mt7621
- use the standard GPIO function for setting the GPIO names in
gpio-brcmstb
- fix a bunch of leaks and other bugs in gpio-mpc8xxx
- use generic pm callbacks in gpio-ml-ioh
- improve resource management and PM handling in gpio-mlxbf2
- modernize and improve the gpio-dwapb driver
- coding style improvements in gpio-rcar
- documentation fixes and improvements
- update the MAINTAINERS entry for gpio-zynq
- minor tweaks in several drivers"
* tag 'gpio-updates-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: (35 commits)
gpio: mpc8xxx: Use 'devm_gpiochip_add_data()' to simplify the code and avoid a leak
gpio: mpc8xxx: Fix a potential double iounmap call in 'mpc8xxx_probe()'
gpio: mpc8xxx: Fix a resources leak in the error handling path of 'mpc8xxx_probe()'
gpio: viperboard: remove platform_set_drvdata() call in probe
gpio: virtio: Add missing mailings lists in MAINTAINERS entry
gpio: virtio: Fix sparse warnings
gpio: remove the obsolete MX35 3DS BOARD MC9S08DZ60 GPIO functions
gpio: max730x: Use the right include
gpio: Add virtio-gpio driver
gpio: mlxbf2: Use DEFINE_RES_MEM_NAMED() helper macro
gpio: mlxbf2: Use devm_platform_ioremap_resource()
gpio: mlxbf2: Drop wrong use of ACPI_PTR()
gpio: mlxbf2: Convert to device PM ops
gpio: dwapb: Get rid of legacy platform data
mfd: intel_quark_i2c_gpio: Convert GPIO to use software nodes
gpio: dwapb: Read GPIO base from gpio-base property
gpio: dwapb: Unify ACPI enumeration checks in get_irq() and configure_irqs()
gpiolib: Deduplicate forward declaration in the consumer.h header
MAINTAINERS: update gpio-zynq.yaml reference
gpio: tegra186: Add ACPI support
...
Changes for kgdb/kdb this cycle are dominated by a change from
Sumit that removes as small (256K) private heap from kdb. This is
change I've hoped for ever since I discovered how few users of this
heap remained in the kernel, so many thanks to Sumit for hunting
these down. Other change is an incremental step towards SPDX headers.
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAmE2OuUACgkQfOMlXTn3
iKEcTQ/7BiMPnSzPeh7srNIG1a7/ig8CUtu+qMsrU50BUvCBzhas3+fGbzlZHV4p
GwCGTuRDWhVJjntpCkLYroSSF0h/AhN68pxqUx6EJUWEYMGuuMBuVtAdXYMSYbse
MW+Jw6Tec2fWPyw6lyWd0vwOKLzhf/pC8axbmGESzgdmZrsWeA4XEeF5XRHsHyJf
pL4GvBXN/mzH2TSQNQmSgda5VLnK3m6B66tlT17mywG8QTKwID8I4LXs9doNwqeW
6fkwJswAsCa+nO1GjMsRcA8HoUd7mtqLCYbjZzFVQdFJ6kF9eBKSPvgbaqr6oNvv
t6VBfm908+fM4/4yEQSuwVPT32IPe7I/Bk+aUqnSpPiXU8v75u0lIfB929O/DS07
X+NjqpXN7v6mEpLAFgWhqPyzTODw40OhpOi1uf9jzPB/OSNm9y1zs21FJQSx6KGa
AahMHrRoeZwlxNdD/2RBf9pQCSBWF5R1DyIye2iQX+e/jrvTXHMCK33tpcbcE9Cl
bMVqnnJ4Pnmul6hAAf9WlJduIrACV1Oq1iQrtXWE3wdFZdQuvqKvwHUK/Zko+O0f
cdW6neyW7Slo+8q2qXuEE0Bp+Ae61oZTrGIdfj29ZdXwp0+sBzcA5CFO7MKPJllV
yEq41Bxc3cgJPcsFMZWSaHcEAI5fq1iBmquVdCA0/vYbP9dsO1U=
=iMIF
-----END PGP SIGNATURE-----
Merge tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Changes for kgdb/kdb this cycle are dominated by a change from Sumit
that removes as small (256K) private heap from kdb. This is change
I've hoped for ever since I discovered how few users of this heap
remained in the kernel, so many thanks to Sumit for hunting these
down.
The other change is an incremental step towards SPDX headers"
* tag 'kgdb-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kernel: debug: Convert to SPDX identifier
kdb: Rename members of struct kdbtab_t
kdb: Simplify kdb_defcmd macro logic
kdb: Get rid of redundant kdb_register_flags()
kdb: Rename struct defcmd_set to struct kdb_macro
kdb: Get rid of custom debug heap allocator
This reverts commit 9857a17f20.
That commit was completely broken, and I should have caught on to it
earlier. But happily, the kernel test robot noticed the breakage fairly
quickly.
The breakage is because "try_get_page()" is about avoiding the page
reference count overflow case, but is otherwise the exact same as a
plain "get_page()".
In contrast, "try_get_compound_head()" is an entirely different beast,
and uses __page_cache_add_speculative() because it's not just about the
page reference count, but also about possibly racing with the underlying
page going away.
So all the commentary about how
"try_get_page() has fallen a little behind in terms of maintenance,
try_get_compound_head() handles speculative page references more
thoroughly"
was just completely wrong: yes, try_get_compound_head() handles
speculative page references, but the point is that try_get_page() does
not, and must not.
So there's no lack of maintainance - there are fundamentally different
semantics.
A speculative page reference would be entirely wrong in "get_page()",
and it's entirely wrong in "try_get_page()". It's not about
speculation, it's purely about "uhhuh, you can't get this page because
you've tried to increment the reference count too much already".
The reason the kernel test robot noticed this bug was that it hit the
VM_BUG_ON() in __page_cache_add_speculative(), which is all about
verifying that the context of any speculative page access is correct.
But since that isn't what try_get_page() is all about, the VM_BUG_ON()
tests things that are not correct to test for try_get_page().
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joakim Zhang reports that Wake-on-Lan with the stmmac ethernet driver broke
when moving the incorrect handling of mac link state out of mac_config().
This reason this breaks is because the stmmac's WoL is handled by the MAC
rather than the PHY, and phylink doesn't cater for that scenario.
This patch adds the necessary phylink code to handle suspend/resume events
according to whether the MAC still needs a valid link or not. This is the
barest minimum for this support.
Reported-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Tested-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE1EsEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppSqEADAaYVwbrNvdeCVNLiCiQb8u6mOP/y+wgIu
90aPaNEqpLKEx3hRkN0eOlLymkST2Z9iVXyOdwP0PHXnKygCd/mx9ZocRkO9uGs4
8vNUOK5eMWKtDg4BbWQDR8UxdL+la9udCZgLXWyoMQVij3apSIQlIWxenTZ0OVuJ
BOuxZ+T06NADkxMBOpB25TiOJkHSDvLOSfQySpJqJQKdpcOTFOqSEMfM9xjRUrju
hAxVgjwiovO0AacQ9yie+BgvHySPNggu6Ko63lrujA1JOceWQZXjAP1a3/V6Y0ke
kJ6cRZbhQLMCVDoEG8H+KxFwxJsfnryL3lEDCzs1wkaEksQiHB6SEetqjVz07M+C
S7FaENUwFnAcVTVOEK/geqotGOInUnIUcVCBdT8nLpeuco3cr81TYu1FF9Vv1cB9
O2N1o5dNhl91zbY5SlqjVKpBvJ2N62d4GPF6GcuVxHyqwd9ZtonQ3lB9eqP17Sqb
a6njGDmgUaPvlAF3xt6nPAm0a3AMgQivltvYyTwpS3AYfI6uRINI7K5kH0g4nL3F
CnX33dTVS25y4UBejxvzg2G6LDyat7DiMZadV3c6gN4rNatifAo8ySO6RKmbS8M2
IREF8Hxk81mPqGd1Svmv3E3MgQUngbPrx1zCe7Lh9Hygc65I06hW4gj760po2yVc
Q/dRtXfMcQ==
=U7lH
-----END PGP SIGNATURE-----
Merge tag 'libata-5.15-2021-09-05' of git://git.kernel.dk/linux-block
Pull libata fixes from Jens Axboe:
"Fixes for queued trim on certain Samsung SSDs, in conjunction with
certain ATI controllers"
* tag 'libata-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD.
libata: add ATA_HORKAGE_NO_NCQ_TRIM for Samsung 860 and 870 SSDs
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEz5eEQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmk1D/wML8Im2erR5s0PaWZgYxXlgEKrJDwJm/p+
2Uixrn/9kQAhwH+0kJnCiI+HwlL3LU+5/iAdeGtdYMcVaotPPmm5V3jfud8+RuAi
E+uIOdULXgQKj8pkiQ2h5mvYd0BxGkGH38gUqilSwFrY2HTpbfxreCHhYoQaE/7o
DiGNgbhJglSFIBuIgS4cfpLkI3FdaAmrCydZ9zaqEv/G/bx9aA9lwSbAJadhTbmt
Qc1vvbh2FB9YvgZX8qfaneyDKzQbwqTvKxCe2SOVMOp/X0feJym7WZUvrPr04EoZ
zBaLDkmn44re4iWPbide7+KQJ8NMQQDBiuxwF5WxdF3hrcsiwqmKgDtBEGWXFMeV
CUZ9Osrfb480UKsDExtxLhQqGz1JZqIPZdtDvSJb8MunPZtvTz27NNFyyb9aBrlX
WiwEHqAOE1W33buPCNyuYLGDVYis4/TkwF0NZpMwsyPdN0Iz/M8Z5F5BHhC7BYoP
U8KMsX3XvddxB113U+IMVqI/SuvT125U65brklQlQeLEHnH57ceII9mNGfNic6LR
bcIu7Fb5J1U5nAMeeLCSXsEYXs+peYgI1UOWXaWgSVixUAyU8H+OqsBVIl8eiMjr
TTbdIMmfWqENE3wBM709FQQLoMmGl1YjBkGmBXKZjNHcDrf9X56rimSxRD2i2okg
r2JczxQ5uQ==
=QoQg
-----END PGP SIGNATURE-----
Merge tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"As sometimes happens, two reports came in around the merge window open
that led to some fixes. Hence this one is a bit bigger than usual
followup fixes, but most of it will be going towards stable, outside
of the fixes that are addressing regressions from this merge window.
In detail:
- postgres is a heavy user of signals between tasks, and if we're
unlucky this can interfere with io-wq worker creation. Make sure
we're resilient against unrelated signal handling. This set of
changes also includes hardening against allocation failures, which
could previously had led to stalls.
- Some use cases that end up having a mix of bounded and unbounded
work would have starvation issues related to that. Split the
pending work lists to handle that better.
- Completion trace int -> unsigned -> long fix
- Fix issue with REGISTER_IOWQ_MAX_WORKERS and SQPOLL
- Fix regression with hash wait lock in this merge window
- Fix retry issued on block devices (Ming)
- Fix regression with links in this merge window (Pavel)
- Fix race with multi-shot poll and completions (Xiaoguang)
- Ensure regular file IO doesn't inadvertently skip completion
batching (Pavel)
- Ensure submissions are flushed after running task_work (Pavel)"
* tag 'for-5.15/io_uring-2021-09-04' of git://git.kernel.dk/linux-block:
io_uring: io_uring_complete() trace should take an integer
io_uring: fix possible poll event lost in multi shot mode
io_uring: prolong tctx_task_work() with flushing
io_uring: don't disable kiocb_done() CQE batching
io_uring: ensure IORING_REGISTER_IOWQ_MAX_WORKERS works with SQPOLL
io-wq: make worker creation resilient against signals
io-wq: get rid of FIXED worker flag
io-wq: only exit on fatal signals
io-wq: split bounded and unbounded work into separate lists
io-wq: fix queue stalling race
io_uring: don't submit half-prepared drain request
io_uring: fix queueing half-created requests
io-wq: ensure that hash wait lock is IRQ disabling
io_uring: retry in case of short read on block device
io_uring: IORING_OP_WRITE needs hash_reg_file set
io-wq: fix race between adding work and activating a free worker
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmEnfogPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDF9oQAINWHN1n30gsxcErMV8gH+XAyhDq2vTjkExQ
Qz5ddo4R5zeVkj0nkunFSK+W3xYz+W97X3I+IaiiHvk5D6dUatj37IyYlazX5iFT
7mbjTAqY7GRxfd6um7uK+CTRCApXY49GGkCVLGA5f+6mQ0JMVXaK9AKlsXKWUQLZ
JvLasUgKkseN6IEJWmPDNBdIeiKBTZloeZMdlM2vSm34HsuirSS5LmshdzJQzSk8
QSEqwXZX50afzJLNlB9Qa6V1tokjZVoYIBk0vAPO83tTh9HIyGL/PFAqBeq2rnWT
M19fFFbx5vizap4ICbpviLmZ5AOywCoBmbPBT79eMAJ53rOqHUJhU1y/3DoiVzxu
LJZI4wmGBQZVivOWOqyEZcNtTAagPLhyrLhMzYulBLwAjfFJmUHdSOxYtx+2Ysvr
SDIPN31FKWrvifTXTqJHDmaaXusi2CNZUOPzVSe2I14SbX+ZX2ny9DltlbRgPNuc
hGJagI5cZc0ngd4mAIzjjNmgBS2B+dSc8dOo71dRNJRLtQLiNHcAyQNJyFme+4xI
NpvpkvzxBAs8rG2X0YIR/Cz3W3yZoCYuQNcoPk7+F/bUTK47VocQCS+gLucHVLbT
H4286EV5n4nZ7E01oJ6uWnDnslPvrx9Sz2fxsrWYkBDR+xrz0EprrGsftFaILprz
Ic43uXfd
=LuHM
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.15
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
Add a new stat that counts the number of times a remote TLB flush is
requested, regardless of whether it kicks vCPUs out of guest mode. This
allows us to look at how often flushes are initiated.
Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
TLB flush implementation, so apply it there too.
Original-by: David Matlack <dmatlack@google.com>
Signed-off-by: Jing Zhang <jingzhangos@google.com>
Message-Id: <20210817002639.3856694-1-jingzhangos@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recent work on converting address list to a tree made it obvious
we need an abstraction around writing netdev->dev_addr. Without
such abstraction updating the main device address is invisible
to the core.
Introduce a number of helpers which for now just wrap memcpy()
but in the future can make necessary changes to the address
tree.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Simplifying the Kconfig use of FTRACE and TRACE_IRQFLAGS_SUPPORT
- bootconfig now can start histograms
- bootconfig supports group/all enabling
- histograms now can put values in linear size buckets
- execnames can be passed to synthetic events
- Introduction of "event probes" that attach to other events and
can retrieve data from pointers of fields, or record fields
as different types (a pointer to a string as a string instead
of just a hex number)
- Various fixes and clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYTJDixQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnPLAP9XviWrZD27uFj6LU/Vp2umbq8la1aC
oW8o9itUGpLoHQD+OtsMpQXsWrxoNw/JD1OWCH4J0YN+TnZAUUG2E9e0twA=
=OZXG
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- simplify the Kconfig use of FTRACE and TRACE_IRQFLAGS_SUPPORT
- bootconfig can now start histograms
- bootconfig supports group/all enabling
- histograms now can put values in linear size buckets
- execnames can be passed to synthetic events
- introduce "event probes" that attach to other events and can retrieve
data from pointers of fields, or record fields as different types (a
pointer to a string as a string instead of just a hex number)
- various fixes and clean ups
* tag 'trace-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (35 commits)
tracing/doc: Fix table format in histogram code
selftests/ftrace: Add selftest for testing duplicate eprobes and kprobes
selftests/ftrace: Add selftest for testing eprobe events on synthetic events
selftests/ftrace: Add test case to test adding and removing of event probe
selftests/ftrace: Fix requirement check of README file
selftests/ftrace: Add clear_dynamic_events() to test cases
tracing: Add a probe that attaches to trace events
tracing/probes: Reject events which have the same name of existing one
tracing/probes: Have process_fetch_insn() take a void * instead of pt_regs
tracing/probe: Change traceprobe_set_print_fmt() to take a type
tracing/probes: Use struct_size() instead of defining custom macros
tracing/probes: Allow for dot delimiter as well as slash for system names
tracing/probe: Have traceprobe_parse_probe_arg() take a const arg
tracing: Have dynamic events have a ref counter
tracing: Add DYNAMIC flag for dynamic events
tracing: Replace deprecated CPU-hotplug functions.
MAINTAINERS: Add an entry for os noise/latency
tracepoint: Fix kerneldoc comments
bootconfig/tracing/ktest: Update ktest example for boot-time tracing
tools/bootconfig: Use per-group/all enable option in ftrace2bconf script
...
Pull MAP_DENYWRITE removal from David Hildenbrand:
"Remove all in-tree usage of MAP_DENYWRITE from the kernel and remove
VM_DENYWRITE.
There are some (minor) user-visible changes:
- We no longer deny write access to shared libaries loaded via legacy
uselib(); this behavior matches modern user space e.g. dlopen().
- We no longer deny write access to the elf interpreter after exec
completed, treating it just like shared libraries (which it often
is).
- We always deny write access to the file linked via /proc/pid/exe:
sys_prctl(PR_SET_MM_MAP/EXE_FILE) will fail if write access to the
file cannot be denied, and write access to the file will remain
denied until the link is effectivel gone (exec, termination,
sys_prctl(PR_SET_MM_MAP/EXE_FILE)) -- just as if exec'ing the file.
Cross-compiled for a bunch of architectures (alpha, microblaze, i386,
s390x, ...) and verified via ltp that especially the relevant tests
(i.e., creat07 and execve04) continue working as expected"
* tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux:
fs: update documentation of get_write_access() and friends
mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
mm: remove VM_DENYWRITE
binfmt: remove in-tree usage of MAP_DENYWRITE
kernel/fork: always deny write access to current MM exe_file
kernel/fork: factor out replacing the current MM exe_file
binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
In this cycle, we've addressed some performance issues such as lock contention,
misbehaving compress_cache, allowing extent_cache for compressed files, and new
sysfs to adjust ra_size for fadvise. In order to diagnose the performance issues
quickly, we also added an iostat which shows the IO latencies periodically. On
the stability side, we've found two memory leakage cases in the error path in
compression flow. And, we've also fixed various corner cases in fiemap, quota,
checkpoint=disable, zstd, and so on.
Enhancement:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fix:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code paths such
as dio, unnecessary operations, debugfs/f2fs/status, sanity check, and typos.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmEyw1sACgkQQBSofoJI
UNLJmA/+NHUgwUjLMcHvmLyp6QYpQDZtKj93/sRDo+YHOYNdYFjWWUb329PYTKWS
kEdzApCP+KHfVxeSkiL/x3qWP+RlTkIf96P0kR3/BKi0tjg25G2riFWztusDDFpt
xi+AW5sUFDvIx1tFumvQHAQedSwBgcZ96ovT5EwxEuONkljhZC9phEC6vSXz9nOR
e2EQIyezbC5O21np1KSeqSgqRMpVkJkVcEHy4VmpMBCLMOOYPepWwKw+yPaV/jR/
zUXdo2/53vma50M5LCDPCtjCtWQgLoeNeGLxyjfzQuTJU6TmtPY65JObLPt6pUSj
fRW6qIziTZbVYXzOWBD0EYilv2N4c3BNJdhQCpx2Vyjw9/LLxzqKPOUyzBoa1kjY
eZVvmaLXVCKsoJdHDSi7OH/4BqS6SuSZE8eO/nGkgswqiErHZ0Vwl3bFCWC7r/Bk
r2U5spJx/83XO6c9H1bzeWEies1DRtwnCDIRRuw35RtJ4uHZaqCfkuJ7rOBwC90X
4SpaAKdUxP2RWc3GKELBIhaqPn7vyMy9ile6VU14PjM8UcY5hyE87T2azqR8gGut
nVjRL4cbMGTPj6m1Qj8KqBRSaLuShe6AncUy7bNGiM+JlcLcdB6OJ1ZYLl9hjx2r
TbIouXThgcZ4SIK0DEaBLKz2b9/0TfaO9gw1XzpRma+bWA1pApM=
=W67o
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this cycle, we've addressed some performance issues such as lock
contention, misbehaving compress_cache, allowing extent_cache for
compressed files, and new sysfs to adjust ra_size for fadvise.
In order to diagnose the performance issues quickly, we also added an
iostat which shows the IO latencies periodically.
On the stability side, we've found two memory leakage cases in the
error path in compression flow. And, we've also fixed various corner
cases in fiemap, quota, checkpoint=disable, zstd, and so on.
Enhancements:
- avoid long checkpoint latency by releasing nat_tree_lock
- collect and show iostats periodically
- support extent_cache for compressed files
- add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
- report f2fs GC status via sysfs
- add discard_unit=%s in mount option to handle zoned device
Bug fixes:
- fix two memory leakages when an error happens in the compressed IO flow
- fix commpress_cache to get the right LBA
- fix fiemap to deal with compressed case correctly
- fix wrong EIO returns due to SBI_NEED_FSCK
- fix missing writes when enabling checkpoint back
- fix quota deadlock
- fix zstd level mount option
In addition to the above major updates, we've cleaned up several code
paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
check, and typos"
* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
f2fs: should put a page beyond EOF when preparing a write
f2fs: deallocate compressed pages when error happens
f2fs: enable realtime discard iff device supports discard
f2fs: guarantee to write dirty data when enabling checkpoint back
f2fs: fix to unmap pages from userspace process in punch_hole()
f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
f2fs: fix to account missing .skipped_gc_rwsem
f2fs: adjust unlock order for cleanup
f2fs: Don't create discard thread when device doesn't support realtime discard
f2fs: rebuild nat_bits during umount
f2fs: introduce periodic iostat io latency traces
f2fs: separate out iostat feature
f2fs: compress: do sanity check on cluster
f2fs: fix description about main_blkaddr node
f2fs: convert S_IRUGO to 0444
f2fs: fix to keep compatibility of fault injection interface
f2fs: support fault injection for f2fs_kmem_cache_alloc()
f2fs: compress: allow write compress released file after truncate to zero
f2fs: correct comment in segment.h
f2fs: improve sbi status info in debugfs/f2fs/status
...
- New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
- Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers
-----BEGIN PGP SIGNATURE-----
iQIyBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmExP7AACgkQ18tUv7Cl
QOshTg/zBz7OfrS23CcLLgNidTJ6S7JOuj1DShG+YzsYXT8f9Nl1DadLM7yAEyok
6JZzC8rXYzJcmYztHZzRyTuzj1+tGGb0u/MrD0bBk42VEel6eOjH/Y9ybn12Gf/E
aqlcJh8hPx44U8oo5EFjRJsg2h28O06vywqhJz+sTbkqKN4hlAgMOo5ysAB+1thg
BrTlR84EKBw5QqxPJ1WPmq9tEyGebU9Yrj1p8f0Uf015IeRNeTOXx3NzmdPshphf
2yJvjumwEzqkcHXTJFDfP6ikIcGPPMNVAOK8DHb+vDGzNsOXW7dDM7GuWA3U8DlU
ZHvyyb05Wwe6Wwg8xwx90FEXcYZFfZbSKmI9z2uoOuGFzNG07zWzPDzRft+qrOvU
VMMwP9oEh71+qesmWTvqIbR2RjxqbCYlTcc8mBrD66DROi6jZ2jznraNC85sxG0Q
b8GE+2SnYr2Q25yehj2xrRlOXyiYNkeeYmIpIquEqH9o7cSyDNJhBWbzIv6x+ith
O/S06ZVKMc9X1nH5t5121XcHrSTMMVA/67WMyKfKMxWnrADAWPQALG+ttoTcbRu7
Txew3Jb+hB8+ZdHAqbPf1l1i+7USQl1CRHMw3GRvNjCL2qcjZb1R7eyJRSQQtUyw
q6SJRGe6Sn1FTUnn96Hv15Zy8VHx+q0cOL/EQVzL1RzJIXYcag==
=Ad/3
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"New Features:
- Better client responsiveness when server isn't replying
- Use refcount_t in sunrpc rpc_client refcount tracking
- Add srcaddr and dst_port to the sunrpc sysfs info files
- Add basic support for connection sharing between servers with multiple NICs`
Bugfixes and Cleanups:
- Sunrpc tracepoint cleanups
- Disconnect after ib_post_send() errors to avoid deadlocks
- Fix for tearing down rpcrdma_reps
- Fix a potential pNFS layoutget livelock loop
- pNFS layout barrier fixes
- Fix a potential memory corruption in rpc_wake_up_queued_task_set_status()
- Fix reconnection locking
- Fix return value of get_srcport()
- Remove rpcrdma_post_sends()
- Remove pNFS dead code
- Remove copy size restriction for inter-server copies
- Overhaul the NFS callback service
- Clean up sunrpc TCP socket shutdowns
- Always provide aligned buffers to RPC read layers"
* tag 'nfs-for-5.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (39 commits)
NFS: Always provide aligned buffers to the RPC read layers
NFSv4.1 add network transport when session trunking is detected
SUNRPC enforce creation of no more than max_connect xprts
NFSv4 introduce max_connect mount options
SUNRPC add xps_nunique_destaddr_xprts to xprt_switch_info in sysfs
SUNRPC keep track of number of transports to unique addresses
NFSv3: Delete duplicate judgement in nfs3_async_handle_jukebox
SUNRPC: Tweak TCP socket shutdown in the RPC client
SUNRPC: Simplify socket shutdown when not reusing TCP ports
NFSv4.2: remove restriction of copy size for inter-server copy.
NFS: Clean up the synopsis of callback process_op()
NFS: Extract the xdr_init_encode/decode() calls from decode_compound
NFS: Remove unused callback void decoder
NFS: Add a private local dispatcher for NFSv4 callback operations
SUNRPC: Eliminate the RQ_AUTHERR flag
SUNRPC: Set rq_auth_stat in the pg_authenticate() callout
SUNRPC: Add svc_rqst::rq_auth_stat
SUNRPC: Add dst_port to the sysfs xprt info file
SUNRPC: Add srcaddr as a file in sysfs
sunrpc: Fix return value of get_srcport()
...
syzbot found that forcing a big quantum attribute would crash hosts fast,
essentially using this:
tc qd replace dev eth0 root fq_codel quantum 4294967295
This is because fq_codel_dequeue() would have to loop
~2^31 times in :
if (flow->deficit <= 0) {
flow->deficit += q->quantum;
list_move_tail(&flow->flowchain, &q->old_flows);
goto begin;
}
SFQ max quantum is 2^19 (half a megabyte)
Lets adopt a max quantum of one megabyte for FQ_CODEL.
Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Protect nft_ct template with global mutex, from Pavel Skripkin.
2) Two recent commits switched inet rt and nexthop exception hashes
from jhash to siphash. If those two spots are problematic then
conntrack is affected as well, so switch voer to siphash too.
While at it, add a hard upper limit on chain lengths and reject
insertion if this is hit. Patches from Florian Westphal.
3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN,
from Benjamin Hesmans.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
netfilter: socket: icmp6: fix use-after-scope
netfilter: refuse insertion if chain has grown too large
netfilter: conntrack: switch to siphash
netfilter: conntrack: sanitize table size default settings
netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex
====================
Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It currently takes a long, and while that's normally OK, the io_uring
limit is an int. Internally in io_uring it's an int, but sometimes it's
passed as a long. That can yield confusing results where a completions
seems to generate a huge result:
ou-sqp-1297-1298 [001] ...1 788.056371: io_uring_complete: ring 000000000e98e046, user_data 0x0, result 4294967171, cflags 0
which is due to -ECANCELED being stored in an unsigned, and then passed
in as a long. Using the right int type, the trace looks correct:
iou-sqp-338-339 [002] ...1 15.633098: io_uring_complete: ring 00000000e0ac60cf, user_data 0x0, result -125, cflags 0
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmExXHoVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGAZwP/iHdEZzuQ4cz2uXUaV0fevj9jjPU
zJ8wrrNabAiT6f5x861DsARQSR4OSt3zN0tyBNgZwUdotbe7ED5GegrgIUBMWlML
QskhTEIZj7TexAX/20vx671gtzI3JzFg4c9BuriXCFRBvychSevdJPr65gMDOesL
vOJnXe+SGXG2+fPWi/PxrcOItNRcveqo2GiWHT3g0Cv/DJUulu81gEkz3hrufnMR
cjMeSkV0nJJcvI755OQBOUnEuigW64k4m2WxHPG24tU8cQOCqV6lqwOfNQBAn4+F
OoaCMyPQT9gvGYwGExQMCXGg0wbUt1qnxzOVoA2qFCwbo+MFhqjBvPXab6VJm7CE
mY3RrTtvxSqBdHI6EGcYeLjhycK9b+LLoJ1qc3S9FK8It6NoFFp4XV0R6ItPBls7
mWi9VSpyI6k0AwLq+bGXEHvaX/bnnf/vfqn8H+w6mRZdXjFV8EB2DiOSRX/OqjVG
RnvTtXzWWThLyXvWR3Jox4+7X6728oL7akLemoeZI6oTbJDm7dQgwpz5HbSyHXLh
d+gUF3Y/6lqxT5N9GSVDxpD1bEMh2I7nGQ4M7WGbGas/3yUemF8wbBqGQo4a+YeD
d9vGAUxDp2PQTtL2sjFo5Gd4PZEM9g7vwWzRvHe0o5NxKEXcBg25b8cD1hxrN9Y4
Y1AAnc0kLO+My3PC
=lw3M
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Add -s option (strict mode) to merge_config.sh to make it fail when
any symbol is redefined.
- Show a warning if a different compiler is used for building external
modules.
- Infer --target from ARCH for CC=clang to let you cross-compile the
kernel without CROSS_COMPILE.
- Make the integrated assembler default (LLVM_IAS=1) for CC=clang.
- Add <linux/stdarg.h> to the kernel source instead of borrowing
<stdarg.h> from the compiler.
- Add Nick Desaulniers as a Kbuild reviewer.
- Drop stale cc-option tests.
- Fix the combination of CONFIG_TRIM_UNUSED_KSYMS and CONFIG_LTO_CLANG
to handle symbols in inline assembly.
- Show a warning if 'FORCE' is missing for if_changed rules.
- Various cleanups
* tag 'kbuild-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kbuild: redo fake deps at include/ksym/*.h
kbuild: clean up objtool_args slightly
modpost: get the *.mod file path more simply
checkkconfigsymbols.py: Fix the '--ignore' option
kbuild: merge vmlinux_link() between ARCH=um and other architectures
kbuild: do not remove 'linux' link in scripts/link-vmlinux.sh
kbuild: merge vmlinux_link() between the ordinary link and Clang LTO
kbuild: remove stale *.symversions
kbuild: remove unused quiet_cmd_update_lto_symversions
gen_compile_commands: extract compiler command from a series of commands
x86: remove cc-option-yn test for -mtune=
arc: replace cc-option-yn uses with cc-option
s390: replace cc-option-yn uses with cc-option
ia64: move core-y in arch/ia64/Makefile to arch/ia64/Kbuild
sparc: move the install rule to arch/sparc/Makefile
security: remove unneeded subdir-$(CONFIG_...)
kbuild: sh: remove unused install script
kbuild: Fix 'no symbols' warning when CONFIG_TRIM_UNUSD_KSYMS=y
kbuild: Switch to 'f' variants of integrated assembler flag
kbuild: Shuffle blank line to improve comment meaning
...
Including:
- New DART IOMMU driver for Apple Silicon M1 chips.
- Optimizations for iommu_[map/unmap] performance
- Selective TLB flush support for the AMD IOMMU driver to make
it more efficient on emulated IOMMUs.
- Rework IOVA setup and default domain type setting to move more
code out of IOMMU drivers and to support runtime switching
between certain types of default domains.
- VT-d Updates from Lu Baolu:
- Update the virtual command related registers
- Enable Intel IOMMU scalable mode by default
- Preset A/D bits for user space DMA usage
- Allow devices to have more than 32 outstanding PRs
- Various cleanups
- ARM SMMU Updates from Will Deacon:
- SMMUv3: Minor optimisation to avoid zeroing struct members on CMD submission
- SMMUv3: Increased use of batched commands to reduce submission latency
- SMMUv3: Refactoring in preparation for ECMDQ support
- SMMUv2: Fix races when probing devices with identical StreamIDs
- SMMUv2: Optimise walk cache flushing for Qualcomm implementations
- SMMUv2: Allow deep sleep states for some Qualcomm SoCs with shared clocks
- Various smaller optimizations, cleanups, and fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmEyKYAACgkQK/BELZcB
GuOzAxAAnJ02PG07BnFFFGN2/o3eVON4LQUXquMePjcZ8A8oQf073jO/ybWNnpJK
5V+DHRg2CAugFHks/EwIrxFXAWZuStrcnk81d8t6T6ROQl47Zv1qshksUTnsDQnz
V7mQ1P/pcsBwCUf73aD9ncLmLkiuTVfHKoKfe3gHeQI+H+2Lw4ijzB8kIwUqhkHI
heJZLDmO87S2Mr7zlCmMQH5R550fHrTKSbUCx9QqFu3GgWsjkU+3u1S17xR1bEoW
hmhJhyAw+MLrSgdeG4U9o+6AcQuRELEHfVSq7PtDxQ6hEVziGYGY4Nk+YiEcXFiv
mu9qfEkaP/2QOKszvks+nhHrwDnJ9WLnEEskiEFwjsaFauIKsRscfYVUBTWeYXJT
9t/PVngigWLDhGO0NEPthQvJExvJJs1MQQ72CcA6dd0XdGpN+aRglIUWUJP/nQHd
doAx4/1YWnHVkWWUef8NgmVvlHdoXjA7vy4QGL9FYCqV6ImfhAkJYKJ99X6Ovlmk
gje/Kx+5wUPT2nXNbTkjalIylyUNpugMY4xD7K06VXjvRMUf2SbYNDQxYJaDDld6
nDt0F0NvEyrj7HO8egwIZbX3MOikhMGHur48yEyCTbm+9oHQffkODq1o4OfuxJh2
nq0G5Plln9CEmhQVwzibcPSNlYPe8AZbbXqQ9DrJFusEpYj+01c=
=zVaQ
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
- New DART IOMMU driver for Apple Silicon M1 chips
- Optimizations for iommu_[map/unmap] performance
- Selective TLB flush support for the AMD IOMMU driver to make it more
efficient on emulated IOMMUs
- Rework IOVA setup and default domain type setting to move more code
out of IOMMU drivers and to support runtime switching between certain
types of default domains
- VT-d Updates from Lu Baolu:
- Update the virtual command related registers
- Enable Intel IOMMU scalable mode by default
- Preset A/D bits for user space DMA usage
- Allow devices to have more than 32 outstanding PRs
- Various cleanups
- ARM SMMU Updates from Will Deacon:
SMMUv3:
- Minor optimisation to avoid zeroing struct members on CMD submission
- Increased use of batched commands to reduce submission latency
- Refactoring in preparation for ECMDQ support
SMMUv2:
- Fix races when probing devices with identical StreamIDs
- Optimise walk cache flushing for Qualcomm implementations
- Allow deep sleep states for some Qualcomm SoCs with shared clocks
- Various smaller optimizations, cleanups, and fixes
* tag 'iommu-updates-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (85 commits)
iommu/io-pgtable: Abstract iommu_iotlb_gather access
iommu/arm-smmu: Fix missing unlock on error in arm_smmu_device_group()
iommu/vt-d: Add present bit check in pasid entry setup helpers
iommu/vt-d: Use pasid_pte_is_present() helper function
iommu/vt-d: Drop the kernel doc annotation
iommu/vt-d: Allow devices to have more than 32 outstanding PRs
iommu/vt-d: Preset A/D bits for user space DMA usage
iommu/vt-d: Enable Intel IOMMU scalable mode by default
iommu/vt-d: Refactor Kconfig a bit
iommu/vt-d: Remove unnecessary oom message
iommu/vt-d: Update the virtual command related registers
iommu: Allow enabling non-strict mode dynamically
iommu: Merge strictness and domain type configs
iommu: Only log strictness for DMA domains
iommu: Expose DMA domain strictness via sysfs
iommu: Express DMA strictness via the domain type
iommu/vt-d: Prepare for multiple DMA domain types
iommu/arm-smmu: Prepare for multiple DMA domain types
iommu/amd: Prepare for multiple DMA domain types
iommu: Introduce explicit type for non-strict DMA domains
...
Pull swiotlb updates from Konrad Rzeszutek Wilk:
"A new feature called restricted DMA pools. It allows SWIOTLB to
utilize per-device (or per-platform) allocated memory pools instead of
using the global one.
The first big user of this is ARM Confidential Computing where the
memory for DMA operations can be set per platform"
* 'stable/for-linus-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb: (23 commits)
swiotlb: use depends on for DMA_RESTRICTED_POOL
of: restricted dma: Don't fail device probe on rmem init failure
of: Move of_dma_set_restricted_buffer() into device.c
powerpc/svm: Don't issue ultracalls if !mem_encrypt_active()
s390/pv: fix the forcing of the swiotlb
swiotlb: Free tbl memory in swiotlb_exit()
swiotlb: Emit diagnostic in swiotlb_exit()
swiotlb: Convert io_default_tlb_mem to static allocation
of: Return success from of_dma_set_restricted_buffer() when !OF_ADDRESS
swiotlb: add overflow checks to swiotlb_bounce
swiotlb: fix implicit debugfs declarations
of: Add plumbing for restricted DMA pool
dt-bindings: of: Add restricted DMA pool
swiotlb: Add restricted DMA pool initialization
swiotlb: Add restricted DMA alloc/free support
swiotlb: Refactor swiotlb_tbl_unmap_single
swiotlb: Move alloc_size to swiotlb_find_slots
swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing
swiotlb: Update is_swiotlb_active to add a struct device argument
swiotlb: Update is_swiotlb_buffer to add a struct device argument
...
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
Split off from prev patch in the series that implements the syscall.
Link: https://lkml.kernel.org/r/20210809185259.405936-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Jan Engelhardt <jengelh@inai.de>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.
memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.
Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.
This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().
Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI]
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv]
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Introduce multi-preference mempolicy", v7.
This patch series introduces the concept of the MPOL_PREFERRED_MANY
mempolicy. This mempolicy mode can be used with either the
set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED
interface, it allows an application to set a preference for nodes which
will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode,
it takes a set of nodes. Like the MPOL_BIND interface, it works over a
set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
OOM killer if those preferred nodes are not available.
Along with these patches are patches for libnuma, numactl, numademo, and
memhog. They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
usage: `numactl -P 0,3,4`
The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and
latency requirements allowing preference to be given to all nodes with
"fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast
memory (or perhaps slow memory), but doesn't care which node it runs
on. The application can prefer a set of nodes and then xpu bind to
the local node (cpu, accelerator, etc). This reverses the nodes are
chosen today where the kernel attempts to use local memory to the CPU
whenever possible. This will attempt to use the local accelerator to
the memory.
2. The Tortoise - The administrator (or the application itself) is
aware it only needs slow memory, and so can prefer that.
Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.
Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.
> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added
complexity is nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This
confuses the notion of binding and is less flexible than the current
solution.
3. Create flags or new modes that helps with some ordering. This
offers both a friendlier API as well as a solution for more customized
usage. It's unknown if it's worth the complexity to support this.
Here is sample code for how this might work:
> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>
This patch (of 5):
The NUMA APIs currently allow passing in a "preferred node" as a single
bit set in a nodemask. If more than one bit it set, bits after the first
are ignored.
This single node is generally OK for location-based NUMA where memory
being allocated will eventually be operated on by a single CPU. However,
in systems with multiple memory types, folks want to target a *type* of
memory instead of a location. For instance, someone might want some
high-bandwidth memory but do not care about the CPU next to which it is
allocated. Or, they want a cheap, high capacity allocation and want to
target all NUMA nodes which have persistent memory in volatile mode. In
both of these cases, the application wants to target a *set* of nodes, but
does not want strict MPOL_BIND behavior as that could lead to OOM killer
or SIGSEGV.
So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
requirement. This is not a pie-in-the-sky dream for an API. This was a
response to a specific ask of more than one group at Intel. Specifically:
1. There are existing libraries that target memory types such as
https://github.com/memkind/memkind. These are known to suffer from
SIGSEGV's when memory is low on targeted memory "kinds" that span more
than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
example of this.
2. Volatile-use persistent memory users want to have a memory policy
which is targeted at either "cheap and slow" (PMEM) or "expensive and
fast" (DRAM). However, they do not want to experience allocation
failures when the targeted type is unavailable.
3. Allocate-then-run. Generally, we let the process scheduler decide
on which physical CPU to run a task. That location provides a default
allocation policy, and memory availability is not generally considered
when placing tasks. For situations where memory is valuable and
constrained, some users want to allocate memory first, *then* allocate
close compute resources to the allocation. This is the reverse of the
normal (CPU) model. Accelerators such as GPUs that operate on
core-mm-managed memory are interested in this model.
A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.
[mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling]
Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com
Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.com
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>b
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The proactive compaction[1] gets triggered for every 500msec and run
compaction on the node for COMPACTION_HPAGE_ORDER (usually order-9) pages
based on the value set to sysctl.compaction_proactiveness. Triggering the
compaction for every 500msec in search of COMPACTION_HPAGE_ORDER pages is
not needed for all applications, especially on the embedded system
usecases which may have few MB's of RAM. Enabling the proactive
compaction in its state will endup in running almost always on such
systems.
Other side, proactive compaction can still be very much useful for getting
a set of higher order pages in some controllable manner(controlled by
using the sysctl.compaction_proactiveness). So, on systems where enabling
the proactive compaction always may proove not required, can trigger the
same from user space on write to its sysctl interface. As an example, say
app launcher decide to launch the memory heavy application which can be
launched fast if it gets more higher order pages thus launcher can prepare
the system in advance by triggering the proactive compaction from
userspace.
This triggering of proactive compaction is done on a write to
sysctl.compaction_proactiveness by user.
[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=facdaa917c4d5a376d09d25865f5a863f906234a
[akpm@linux-foundation.org: tweak vm.rst, per Mike]
Link: https://lkml.kernel.org/r/1627653207-12317-1-git-send-email-charante@codeaurora.org
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The return value of kswapd_run() is unused now. Clean it up.
Link: https://lkml.kernel.org/r/20210717065911.61497-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can get memcg directly form vmpr instead of vmpr->memcg->css->memcg, so
add a new func helper vmpressure_to_memcg(). And no code will use
vmpressure_to_css(), so delete it.
Link: https://lkml.kernel.org/r/20210630112146.455103-1-suhui@zeku.com
Signed-off-by: Hui Su <suhui@zeku.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some method is obviously needed to enable reclaim-based migration.
Just like traditional autonuma, there will be some workloads that will
benefit like workloads with more "static" configurations where hot pages
stay hot and cold pages stay cold. If pages come and go from the hot and
cold sets, the benefits of this approach will be more limited.
The benefits are truly workload-based and *not* hardware-based. We do not
believe that there is a viable threshold where certain hardware
configurations should have this mechanism enabled while others do not.
To be conservative, earlier work defaulted to disable reclaim- based
migration and did not include a mechanism to enable it. This proposes add
a new sysfs file
/sys/kernel/mm/numa/demotion_enabled
as a method to enable it.
We are open to any alternative that allows end users to enable this
mechanism or disable it if workload harm is detected (just like
traditional autonuma).
Once this is enabled page demotion may move data to a NUMA node that does
not fall into the cpuset of the allocating process. This could be
construed to violate the guarantees of cpusets. However, since this is an
opt-in mechanism, the assumption is that anyone enabling it is content to
relax the guarantees.
Link: https://lkml.kernel.org/r/20210721063926.3024591-9-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-10-ying.huang@intel.com
Signed-off-by: Huang Ying <ying.huang@intel.com>
Originally-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Account the number of demoted pages.
Add pgdemote_kswapd and pgdemote_direct VM counters showed in
/proc/vmstat.
[ daveh:
- __count_vm_events() a bit, and made them look at the THP
size directly rather than getting data from migrate_pages()
]
Link: https://lkml.kernel.org/r/20210721063926.3024591-5-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-6-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is mostly derived from a patch from Yang Shi:
https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/
Add code to the reclaim path (shrink_page_list()) to "demote" data to
another NUMA node instead of discarding the data. This always avoids the
cost of I/O needed to read the page back in and sometimes avoids the
writeout cost when the page is dirty.
A second pass through shrink_page_list() will be made if any demotions
fail. This essentially falls back to normal reclaim behavior in the case
that demotions fail. Previous versions of this patch may have simply
failed to reclaim pages which were eligible for demotion but were unable
to be demoted in practice.
For some cases, for example, MADV_PAGEOUT, the pages are always discarded
instead of demoted to follow the kernel API definition. Because
MADV_PAGEOUT is defined as freeing specified pages regardless in which
tier they are.
Note: This just adds the start of infrastructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().
[dave.hansen@linux.intel.com: v11]
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210721063926.3024591-4-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Wei Xu <weixugc@google.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Under normal circumstances, migrate_pages() returns the number of pages
migrated. In error conditions, it returns an error code. When returning
an error code, there is no way to know how many pages were migrated or not
migrated.
Make migrate_pages() return how many pages are demoted successfully for
all cases, including when encountering errors. Page reclaim behavior will
depend on this in subsequent patches.
Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter]
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "userfaultfd: minor bug fixes".
Three unrelated bug fixes. The first two addresses possible issues (not
too theoretical ones), but I did not encounter them in practice.
The third patch addresses a test bug that causes the test to fail on my
system. It has been sent before as part of a bigger RFC.
This patch (of 3):
mmap_changing is currently a boolean variable, which is set and cleared
without any lock that protects against concurrent modifications.
mmap_changing is supposed to mark whether userfaultfd page-faults handling
should be retried since mappings are undergoing a change. However,
concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
mmap_changing to be false, although the remove event was still not read
(hence acknowledged) by the user.
Change mmap_changing to atomic_t and increase/decrease appropriately. Add
a debug assertion to see whether mmap_changing is negative.
Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
Fixes: df2cc96e77 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guillaume Morin reported hitting the following WARNING followed by GPF or
NULL pointer deference either in cgroups_destroy or in the kill_css path.:
percpu ref (css_release) <= 0 (-1) after switching to atomic
WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130
CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G O 5.10.60 #1
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130
Call Trace:
rcu_core+0x30f/0x530
rcu_core_si+0xe/0x10
__do_softirq+0x103/0x2a2
run_ksoftirqd+0x2b/0x40
smpboot_thread_fn+0x11a/0x170
kthread+0x10a/0x140
ret_from_fork+0x22/0x30
Upon further examination, it was discovered that the css structure was
associated with hugetlb reservations.
For private hugetlb mappings the vma points to a reserve map that
contains a pointer to the css. At mmap time, reservations are set up
and a reference to the css is taken. This reference is dropped in the
vma close operation; hugetlb_vm_op_close. However, if a vma is split no
additional reference to the css is taken yet hugetlb_vm_op_close will be
called twice for the split vma resulting in an underflow.
Fix by taking another reference in hugetlb_vm_op_open. Note that the
reference is only taken for the owner of the reserve map. In the more
common fork case, the pointer to the reserve map is cleared for
non-owning vmas.
Link: https://lkml.kernel.org/r/20210830215015.155224-1-mike.kravetz@oracle.com
Fixes: e9fe92ae0c ("hugetlb_cgroup: add reservation accounting for private mappings")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Guillaume Morin <guillaume@morinfr.org>
Suggested-by: Guillaume Morin <guillaume@morinfr.org>
Tested-by: Guillaume Morin <guillaume@morinfr.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline. But
if the page is not slab page all the effort is wasted in vain. Even
though it is a slab page, it is not guaranteed the page could be freed
at all.
However the side effect and cost is quite high. It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches. It could make the most
workingset gone in order to just offline a page. And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.
Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.
Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:
BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
in-flight: 409271:memory_failure_work_func
pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
workqueue mm_percpu_wq: flags=0x8
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
pending: vmstat_update
workqueue cgroup_destroy: flags=0x0
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
pending: css_release_work_fn
There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1
Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
Workqueue: events memory_failure_work_func
RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
Call Trace:
__queue_work+0xd6/0x3c0
queue_work_on+0x1c/0x30
uncharge_batch+0x10e/0x110
mem_cgroup_uncharge_list+0x6d/0x80
release_pages+0x37f/0x3f0
__pagevec_release+0x1c/0x50
__invalidate_mapping_pages+0x348/0x380
inode_lru_isolate+0x10a/0x160
__list_lru_walk_one+0x7b/0x170
list_lru_walk_one+0x4a/0x60
prune_icache_sb+0x37/0x50
super_cache_scan+0x123/0x1a0
do_shrink_slab+0x10c/0x2c0
shrink_slab+0x1f1/0x290
drop_slab_node+0x4d/0x70
soft_offline_page+0x1ac/0x5b0
memory_failure_work_func+0x6a/0x90
process_one_work+0x19e/0x340
worker_thread+0x30/0x360
kthread+0x116/0x130
The lockup made the machine is quite unusable. And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.
But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:
soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.
Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: David Mackey <tdmackey@twitter.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3
and 4 can be overlapped by sub-field for early NID, and can be
unexpectedly set on NUMA systems. There are a few non-critical issues
related to this:
- Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces
pfn_to_online_page() through the slow path, but doesn't actually break
the kernel.
- A kdump generation tool like makedumpfile uses this field to calculate
the physical address to read. So wrong bits can make the tool access to
wrong address and fail to create kdump. This can be avoided by the
tool, so it's not critical.
To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of
available bits of section flag field.
Link: https://lkml.kernel.org/r/20210707045548.810271-1-naoya.horiguchi@linux.dev
Fixes: 1f90a3477d ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Kazuhito Hagio <k-hagio-ab@nec.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Wang Wensheng <wangwensheng4@huawei.com>
Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: Kazu <k-hagio-ab@nec.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As the last users of __section_nr() are gone, let's remove unused function
__section_nr().
Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com
Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts
mem_section to section_nr could be costly since it iterates all section
roots to check if the given mem_section is in its range.
On the other hand, __nr_to_section() which converts section_nr to
mem_section can be done in O(1).
Let's pass section_nr instead of mem_section ptr to find_memory_block() in
order to reduce needless iterations.
Link: https://lkml.kernel.org/r/20210707150212.855-3-ohoono.kwon@samsung.com
Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fault_in_pages_writeable() and fault_in_pages_readable() treat the size
parameter as unsigned, doing pointer math with the value, so make this
explicit and set it to be a size_t type which all callers currently treat
it as anyway.
This solves the issue where static checkers get nervous seeing pointer
arithmetic happening with a signed value.
Link: https://lkml.kernel.org/r/20210727111136.457638-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Jordy Zomer <jordy@pwning.systems>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.
The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages. Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:
1) an optimization to defer the cache flush for page cache pages not
mapped into userspace
2) additional flushing for mapped page cache pages if cache aliases
are possible
Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 2d146aa3aa ("mm: memcontrol: switch to rstat"), last user
of memcg_stat_item_in_bytes() is gone. And since commit fa40d1ee9f
("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only
the declaration of mem_cgroup_select_victim_node() is remained here.
Remove them.
Link: https://lkml.kernel.org/r/20210807082835.61281-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We used to have per-cpu memcg and lruvec stats and the readers have to
traverse and sum the stats from each cpu. This summing was racy and may
expose transient negative values. So, an explicit check was added to
avoid such scenarios. Now these stats are moved to rstat infrastructure
and are no more per-cpu, so we can remove the fixup for transient negative
values.
Link: https://lkml.kernel.org/r/20210728012243.3369123-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At the moment memcg stats are read in four contexts:
1. memcg stat user interfaces
2. dirty throttling
3. page fault
4. memory reclaim
Currently the kernel flushes the stats for first two cases. Flushing the
stats for remaining two casese may have performance impact. Always
flushing the memcg stats on the page fault code path may negatively
impacts the performance of the applications. In addition flushing in the
memory reclaim code path, though treated as slowpath, can become the
source of contention for the global lock taken for stat flushing because
when system or memcg is under memory pressure, many tasks may enter the
reclaim path.
This patch uses following mechanisms to solve these challenges:
1. Periodically flush the stats from root memcg every 2 seconds. This
will time limit the out of sync stats.
2. Asynchronously flush the stats after fixed number of stat updates.
In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for
2 seconds.
3. For avoiding thundering herd to flush the stats particularly from
the memory reclaim context, introduce memcg local spinlock and let only
one flusher active at a time. This could have been done through
cgroup_rstat_lock lock but that lock is used by other subsystem and for
userspace reading memcg stats. So, it is better to keep flushers
introduced by this patch decoupled from cgroup_rstat_lock. However we
would have to use irqsafe version of rstat flush but that is fine as
this code path will be flushing for whole tree and do the work for
everyone. No one will be waiting for that worker.
[shakeelb@google.com: fix sleep-in-wrong context bug]
Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com
Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit 2d146aa3aa ("mm: memcontrol: switch to rstat") switched memcg
stats to rstat infrastructure but skipped the conversion of the lruvec
stats as such stats are read in the performance critical code paths and
flushing stats may have impacted the performances of the applications.
This patch converts the lruvec stats to rstat and later patches add
mechanisms to keep the performance impact to minimum.
The rstat conversion comes with the price i.e. memory cost. Effectively
this patch reverts the savings done by the commit f3344adf38 ("mm:
memcontrol: optimize per-lruvec stats counter memory usage"). However
this cost is justified due to negative impact of the inaccurate lruvec
stats on many heuristics. One such case is reported in [1].
The memory reclaim code is filled with plethora of heuristics and many of
those heuristics reads the lruvec stats. So, inaccurate stats can make
such heuristics ineffective. [1] reports the impact of inaccurate lruvec
stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can
impact the deactivation and aging anon heuristics as well.
[1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/
Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com
Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and
cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static
key check inline before calling the main body of the function. This
minimizes the memcg overhead in the pagefault and exit_mmap paths when
memcgs are disabled using cgroup_disable=memory command-line option. This
change results in ~1% overhead reduction when running PFT test [1]
comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory}
configuration on an 8-core ARM64 Android device.
[1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite
Link: https://lkml.kernel.org/r/20210713010934.299876-3-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>