linux-xiaomi-chiron

Author	SHA1	Message	Date
Yang Shi	668e4147d8	mm/vmscan: add page demotion counter Account the number of demoted pages. Add pgdemote_kswapd and pgdemote_direct VM counters showed in /proc/vmstat. [ daveh: - __count_vm_events() a bit, and made them look at the THP size directly rather than getting data from migrate_pages() ] Link: https://lkml.kernel.org/r/20210721063926.3024591-5-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-6-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Wei Xu <weixugc@google.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:16 -07:00
Dave Hansen	26aa2d199d	mm/migrate: demote pages during reclaim This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of discarding the data. This always avoids the cost of I/O needed to read the page back in and sometimes avoids the writeout cost when the page is dirty. A second pass through shrink_page_list() will be made if any demotions fail. This essentially falls back to normal reclaim behavior in the case that demotions fail. Previous versions of this patch may have simply failed to reclaim pages which were eligible for demotion but were unable to be demoted in practice. For some cases, for example, MADV_PAGEOUT, the pages are always discarded instead of demoted to follow the kernel API definition. Because MADV_PAGEOUT is defined as freeing specified pages regardless in which tier they are. Note: This just adds the start of infrastructure for migration. It is actually disabled next to the FIXME in migrate_demote_page_ok(). [dave.hansen@linux.intel.com: v11] Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210721063926.3024591-4-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-5-ying.huang@intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Wei Xu <weixugc@google.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:16 -07:00
Yang Shi	5ac95884a7	mm/migrate: enable returning precise migrate_pages() success count Under normal circumstances, migrate_pages() returns the number of pages migrated. In error conditions, it returns an error code. When returning an error code, there is no way to know how many pages were migrated or not migrated. Make migrate_pages() return how many pages are demoted successfully for all cases, including when encountering errors. Page reclaim behavior will depend on this in subsequent patches. Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Oscar Salvador <osalvador@suse.de> [optional parameter] Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Xu <weixugc@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:16 -07:00
Nadav Amit	a759a909d4	userfaultfd: change mmap_changing to atomic Patch series "userfaultfd: minor bug fixes". Three unrelated bug fixes. The first two addresses possible issues (not too theoretical ones), but I did not encounter them in practice. The third patch addresses a test bug that causes the test to fail on my system. It has been sent before as part of a bigger RFC. This patch (of 3): mmap_changing is currently a boolean variable, which is set and cleared without any lock that protects against concurrent modifications. mmap_changing is supposed to mark whether userfaultfd page-faults handling should be retried since mappings are undergoing a change. However, concurrent calls, for instance to madvise(MADV_DONTNEED), might cause mmap_changing to be false, although the remove event was still not read (hence acknowledged) by the user. Change mmap_changing to atomic_t and increase/decrease appropriately. Add a debug assertion to see whether mmap_changing is negative. Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com Fixes: `df2cc96e77` ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:16 -07:00
Mike Kravetz	09a26e8327	hugetlb: fix hugetlb cgroup refcounting during vma split Guillaume Morin reported hitting the following WARNING followed by GPF or NULL pointer deference either in cgroups_destroy or in the kill_css path.: percpu ref (css_release) <= 0 (-1) after switching to atomic WARNING: CPU: 23 PID: 130 at lib/percpu-refcount.c:196 percpu_ref_switch_to_atomic_rcu+0x127/0x130 CPU: 23 PID: 130 Comm: ksoftirqd/23 Kdump: loaded Tainted: G O 5.10.60 #1 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x127/0x130 Call Trace: rcu_core+0x30f/0x530 rcu_core_si+0xe/0x10 __do_softirq+0x103/0x2a2 run_ksoftirqd+0x2b/0x40 smpboot_thread_fn+0x11a/0x170 kthread+0x10a/0x140 ret_from_fork+0x22/0x30 Upon further examination, it was discovered that the css structure was associated with hugetlb reservations. For private hugetlb mappings the vma points to a reserve map that contains a pointer to the css. At mmap time, reservations are set up and a reference to the css is taken. This reference is dropped in the vma close operation; hugetlb_vm_op_close. However, if a vma is split no additional reference to the css is taken yet hugetlb_vm_op_close will be called twice for the split vma resulting in an underflow. Fix by taking another reference in hugetlb_vm_op_open. Note that the reference is only taken for the owner of the reserve map. In the more common fork case, the pointer to the reserve map is cleared for non-owning vmas. Link: https://lkml.kernel.org/r/20210830215015.155224-1-mike.kravetz@oracle.com Fixes: `e9fe92ae0c` ("hugetlb_cgroup: add reservation accounting for private mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Guillaume Morin <guillaume@morinfr.org> Suggested-by: Guillaume Morin <guillaume@morinfr.org> Tested-by: Guillaume Morin <guillaume@morinfr.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:16 -07:00
Yang Shi	d0505e9f7d	mm: hwpoison: don't drop slab caches for offlining non-LRU page In the current implementation of soft offline, if non-LRU page is met, all the slab caches will be dropped to free the page then offline. But if the page is not slab page all the effort is wasted in vain. Even though it is a slab page, it is not guaranteed the page could be freed at all. However the side effect and cost is quite high. It does not only drop the slab caches, but also may drop a significant amount of page caches which are associated with inode caches. It could make the most workingset gone in order to just offline a page. And the offline is not guaranteed to succeed at all, actually I really doubt the success rate for real life workload. Furthermore the worse consequence is the system may be locked up and unusable since the page cache release may incur huge amount of works queued for memcg release. Actually we ran into such unpleasant case in our production environment. Firstly, the workqueue of memory_failure_work_func is locked up as below: BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15 in-flight: 409271:memory_failure_work_func pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work workqueue mm_percpu_wq: flags=0x8 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 pending: vmstat_update workqueue cgroup_destroy: flags=0x0 pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072 pending: css_release_work_fn There were over 12K css_release_work_fn queued, and this caused a few lockups due to the contention of worker pool lock with IRQ disabled, for example: NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G L 5.10.32-t1.el7.twitter.x86_64 #1 Hardware name: TYAN F5AMT /z /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021 Workqueue: events memory_failure_work_func RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0 Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47 RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002 RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140 RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000 R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001 R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068 FS: 0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0 Call Trace: __queue_work+0xd6/0x3c0 queue_work_on+0x1c/0x30 uncharge_batch+0x10e/0x110 mem_cgroup_uncharge_list+0x6d/0x80 release_pages+0x37f/0x3f0 __pagevec_release+0x1c/0x50 __invalidate_mapping_pages+0x348/0x380 inode_lru_isolate+0x10a/0x160 __list_lru_walk_one+0x7b/0x170 list_lru_walk_one+0x4a/0x60 prune_icache_sb+0x37/0x50 super_cache_scan+0x123/0x1a0 do_shrink_slab+0x10c/0x2c0 shrink_slab+0x1f1/0x290 drop_slab_node+0x4d/0x70 soft_offline_page+0x1ac/0x5b0 memory_failure_work_func+0x6a/0x90 process_one_work+0x19e/0x340 worker_thread+0x30/0x360 kthread+0x116/0x130 The lockup made the machine is quite unusable. And it also made the most workingset gone, the reclaimabled slab caches were reduced from 12G to 300MB, the page caches were decreased from 17G to 4G. But the most disappointing thing is all the effort doesn't make the page offline, it just returns: soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 () It seems the aggressive behavior for non-LRU page didn't pay back, so it doesn't make too much sense to keep it considering the terrible side effect. Link: https://lkml.kernel.org/r/20210819054116.266126-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reported-by: David Mackey <tdmackey@twitter.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:15 -07:00
Naoya Horiguchi	01c8d337d1	mm/sparse: set SECTION_NID_SHIFT to 6 Currently SECTION_NID_SHIFT is set to 3, which is incorrect because bit 3 and 4 can be overlapped by sub-field for early NID, and can be unexpectedly set on NUMA systems. There are a few non-critical issues related to this: - Having SECTION_TAINT_ZONE_DEVICE set for wrong sections forces pfn_to_online_page() through the slow path, but doesn't actually break the kernel. - A kdump generation tool like makedumpfile uses this field to calculate the physical address to read. So wrong bits can make the tool access to wrong address and fail to create kdump. This can be avoided by the tool, so it's not critical. To fix it, set SECTION_NID_SHIFT to 6 which is the minimum number of available bits of section flag field. Link: https://lkml.kernel.org/r/20210707045548.810271-1-naoya.horiguchi@linux.dev Fixes: `1f90a3477d` ("mm: teach pfn_to_online_page() about ZONE_DEVICE section collisions") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Kazuhito Hagio <k-hagio-ab@nec.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wang Wensheng <wangwensheng4@huawei.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: Kazu <k-hagio-ab@nec.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:14 -07:00
Ohhoon Kwon	11e02d3729	mm: sparse: remove __section_nr() function As the last users of __section_nr() are gone, let's remove unused function __section_nr(). Link: https://lkml.kernel.org/r/20210707150212.855-4-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:14 -07:00
Ohhoon Kwon	fc1f5e980a	mm: sparse: pass section_nr to find_memory_block With CONFIG_SPARSEMEM_EXTREME enabled, __section_nr() which converts mem_section to section_nr could be costly since it iterates all section roots to check if the given mem_section is in its range. On the other hand, __nr_to_section() which converts section_nr to mem_section can be done in O(1). Let's pass section_nr instead of mem_section ptr to find_memory_block() in order to reduce needless iterations. Link: https://lkml.kernel.org/r/20210707150212.855-3-ohoono.kwon@samsung.com Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:14 -07:00
Greg Kroah-Hartman	e15710bf04	mm: change fault_in_pages_* to have an unsigned size parameter fault_in_pages_writeable() and fault_in_pages_readable() treat the size parameter as unsigned, doing pointer math with the value, so make this explicit and set it to be a size_t type which all callers currently treat it as anyway. This solves the issue where static checkers get nervous seeing pointer arithmetic happening with a signed value. Link: https://lkml.kernel.org/r/20210727111136.457638-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reported-by: Jordy Zomer <jordy@pwning.systems> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:13 -07:00
Christoph Hellwig	f358afc52c	mm: remove flush_kernel_dcache_page flush_kernel_dcache_page is a rather confusing interface that implements a subset of flush_dcache_page by not being able to properly handle page cache mapped pages. The only callers left are in the exec code as all other previous callers were incorrect as they could have dealt with page cache pages. Replace the calls to flush_kernel_dcache_page with calls to flush_dcache_page, which for all architectures does either exactly the same thing, can contains one or more of the following: 1) an optimization to defer the cache flush for page cache pages not mapped into userspace 2) additional flushing for mapped page cache pages if cache aliases are possible Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Alex Shi <alexs@kernel.org> Cc: Geoff Levand <geoff@infradead.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Nick Hu <nickhu@andestech.com> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:13 -07:00
Miaohe Lin	bec49c067c	mm, memcg: remove unused functions Since commit `2d146aa3aa` ("mm: memcontrol: switch to rstat"), last user of memcg_stat_item_in_bytes() is gone. And since commit `fa40d1ee9f` ("mm: vmscan: memcontrol: remove mem_cgroup_select_victim_node()"), only the declaration of mem_cgroup_select_victim_node() is remained here. Remove them. Link: https://lkml.kernel.org/r/20210807082835.61281-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alex Shi <alexs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:13 -07:00
Vasily Averin	55a68c8239	memcg: replace in_interrupt() by !in_task() in active_memcg() set_active_memcg() uses in_interrupt() check to select proper storage for cgroup: pointer on task struct or per-cpu pointer. It isn't fully correct: obsoleted in_interrupt() includes tasks with disabled BH. It's better to use '!in_task()' instead. Link: https://lkml.org/lkml/2021/7/26/487 Link: https://lkml.kernel.org/r/ed4448b0-4970-616f-7368-ef9dd3cb628d@virtuozzo.com Fixes: `37d5985c00` ("mm: kmem: prepare remote memcg charging infra for interrupt contexts") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:13 -07:00
Shakeel Butt	96e51ccf1a	memcg: cleanup racy sum avoidance code We used to have per-cpu memcg and lruvec stats and the readers have to traverse and sum the stats from each cpu. This summing was racy and may expose transient negative values. So, an explicit check was added to avoid such scenarios. Now these stats are moved to rstat infrastructure and are no more per-cpu, so we can remove the fixup for transient negative values. Link: https://lkml.kernel.org/r/20210728012243.3369123-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:13 -07:00
Shakeel Butt	aa48e47e39	memcg: infrastructure to flush memcg stats At the moment memcg stats are read in four contexts: 1. memcg stat user interfaces 2. dirty throttling 3. page fault 4. memory reclaim Currently the kernel flushes the stats for first two cases. Flushing the stats for remaining two casese may have performance impact. Always flushing the memcg stats on the page fault code path may negatively impacts the performance of the applications. In addition flushing in the memory reclaim code path, though treated as slowpath, can become the source of contention for the global lock taken for stat flushing because when system or memcg is under memory pressure, many tasks may enter the reclaim path. This patch uses following mechanisms to solve these challenges: 1. Periodically flush the stats from root memcg every 2 seconds. This will time limit the out of sync stats. 2. Asynchronously flush the stats after fixed number of stat updates. In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for 2 seconds. 3. For avoiding thundering herd to flush the stats particularly from the memory reclaim context, introduce memcg local spinlock and let only one flusher active at a time. This could have been done through cgroup_rstat_lock lock but that lock is used by other subsystem and for userspace reading memcg stats. So, it is better to keep flushers introduced by this patch decoupled from cgroup_rstat_lock. However we would have to use irqsafe version of rstat flush but that is fine as this code path will be flushing for whole tree and do the work for everyone. No one will be waiting for that worker. [shakeelb@google.com: fix sleep-in-wrong context bug] Link: https://lkml.kernel.org/r/20210716212137.1391164-2-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Shakeel Butt	7e1c0d6f58	memcg: switch lruvec stats to rstat The commit `2d146aa3aa` ("mm: memcontrol: switch to rstat") switched memcg stats to rstat infrastructure but skipped the conversion of the lruvec stats as such stats are read in the performance critical code paths and flushing stats may have impacted the performances of the applications. This patch converts the lruvec stats to rstat and later patches add mechanisms to keep the performance impact to minimum. The rstat conversion comes with the price i.e. memory cost. Effectively this patch reverts the savings done by the commit `f3344adf38` ("mm: memcontrol: optimize per-lruvec stats counter memory usage"). However this cost is justified due to negative impact of the inaccurate lruvec stats on many heuristics. One such case is reported in [1]. The memory reclaim code is filled with plethora of heuristics and many of those heuristics reads the lruvec stats. So, inaccurate stats can make such heuristics ineffective. [1] reports the impact of inaccurate lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats can impact the deactivation and aging anon heuristics as well. [1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/ Link: https://lkml.kernel.org/r/20210716212137.1391164-1-shakeelb@google.com Link: https://lkml.kernel.org/r/20210714013948.270662-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Suren Baghdasaryan	01c4b28cd2	mm, memcg: inline swap-related functions to improve disabled memcg config Inline mem_cgroup_try_charge_swap, mem_cgroup_uncharge_swap and cgroup_throttle_swaprate functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~1% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/20210713010934.299876-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Suren Baghdasaryan	2c8d8f97ae	mm, memcg: inline mem_cgroup_{charge/uncharge} to improve disabled memcg config Inline mem_cgroup_{charge/uncharge} and mem_cgroup_uncharge_list functions functions to perform mem_cgroup_disabled static key check inline before calling the main body of the function. This minimizes the memcg overhead in the pagefault and exit_mmap paths when memcgs are disabled using cgroup_disable=memory command-line option. This change results in ~0.4% overhead reduction when running PFT test [1] comparing {CONFIG_MEMCG=n} against {CONFIG_MEMCG=y, cgroup_disable=memory} configuration on an 8-core ARM64 Android device. [1] https://lkml.org/lkml/2006/8/29/294 also used in mmtests suite Link: https://lkml.kernel.org/r/20210713010934.299876-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alex Shi <alexs@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Hugh Dickins	5e6e5a12a4	huge tmpfs: shmem_is_huge(vma, inode, index) Extend shmem_huge_enabled(vma) to shmem_is_huge(vma, inode, index), so that a consistent set of checks can be applied, even when the inode is accessed through read/write syscalls (with NULL vma) instead of mmaps (the index argument is seldom of interest, but required by mount option "huge=within_size"). Clean up and rearrange the checks a little. This then replaces the checks which shmem_fault() and shmem_getpage_gfp() were making, and eliminates the SGP_HUGE and SGP_NOHUGE modes. Replace a couple of 0s by explicit SHMEM_HUGE_NEVERs; and replace the obscure !shmem_mapping() symlink check by explicit S_ISLNK() - nothing else needs that symlink check, so leave it there in shmem_getpage_gfp(). Link: https://lkml.kernel.org/r/23a77889-2ddc-b030-75cd-44ca27fd4d1@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Hugh Dickins	acdd9f8e0f	huge tmpfs: SGP_NOALLOC to stop collapse_file() on race khugepaged's collapse_file() currently uses SGP_NOHUGE to tell shmem_getpage() not to try allocating a huge page, in the very unlikely event that a racing hole-punch removes the swapped or fallocated page as soon as i_pages lock is dropped. We want to consolidate shmem's huge decisions, removing SGP_HUGE and SGP_NOHUGE; but cannot quite persuade ourselves that it's okay to regress the protection in this case - Yang Shi points out that the huge page would remain indefinitely, charged to root instead of the intended memcg. collapse_file() should not even allocate a small page in this case: why proceed if someone is punching a hole? SGP_READ is almost the right flag here, except that it optimizes away from a fallocated page, with NULL to tell caller to fill with zeroes (like a hole); whereas collapse_file()'s sequence relies on using a cache page. Add SGP_NOALLOC just for this. There are too many consecutive "if (page"s there in shmem_getpage_gfp(): group it better; and fix the outdated "bring it back from swap" comment. Link: https://lkml.kernel.org/r/1355343b-acf-4653-ef79-6aee40214ac5@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Hugh Dickins	d144bf6205	huge tmpfs: fix split_huge_page() after FALLOC_FL_KEEP_SIZE A successful shmem_fallocate() guarantees that the extent has been reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used. But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to split huge pages and free their excess beyond i_size; and by other uses of split_huge_page() near i_size. It's sad to add a shmem inode field just for this, but I did not find a better way to keep the guarantee. A flag to say KEEP_SIZE has been used would be cheaper, but I'm averse to unclearable flags. The fallocend field is not perfect either (many disjoint ranges might be fallocated), but good enough; and gains another use later on. Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com Fixes: `779750d20b` ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
Sebastian Andrzej Siewior	bf11b9a8e9	shmem: use raw_spinlock_t for ->stat_lock Each CPU has SHMEM_INO_BATCH inodes available in `->ino_batch' which is per-CPU. Access here is serialized by disabling preemption. If the pool is empty, it gets reloaded from `->next_ino'. Access here is serialized by ->stat_lock which is a spinlock_t and can not be acquired with disabled preemption. One way around it would make per-CPU ino_batch struct containing the inode number a local_lock_t. Another solution is to promote ->stat_lock to a raw_spinlock_t. The critical sections are short. The mpol_put() must be moved outside of the critical section to avoid invoking the destructor with disabled preemption. Link: https://lkml.kernel.org/r/20210806142916.jdwkb5bx62q5fwfo@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
John Hubbard	3969b1a654	mm: delete unused get_kernel_page() get_kernel_page() was added in 2012 by [1]. It was used for a while for NFS, but then in 2014, a refactoring [2] removed all callers, and it has apparently not been used since. Remove get_kernel_page() because it has no callers. [1] commit `18022c5d86` ("mm: add get_kernel_page[s] for pinning of kernel addresses for I/O") [2] commit `91f79c43d1` ("new helper: iov_iter_get_pages_alloc()") Link: https://lkml.kernel.org/r/20210729221847.1165665-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
John Hubbard	9857a17f20	mm/gup: remove try_get_page(), call try_get_compound_head() directly try_get_page() is very similar to try_get_compound_head(), and in fact try_get_page() has fallen a little behind in terms of maintenance: try_get_compound_head() handles speculative page references more thoroughly. There are only two try_get_page() callsites, so just call try_get_compound_head() directly from those, and remove try_get_page() entirely. Also, seeing as how this changes try_get_compound_head() into a non-static function, provide some kerneldoc documentation for it. Link: https://lkml.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
John Hubbard	54d516b1d6	mm/gup: small refactoring: simplify try_grab_page() try_grab_page() does the same thing as try_grab_compound_head(..., refs=1, ...), just with a different API. So there is a lot of code duplication there. Change try_grab_page() to call try_grab_compound_head(), while keeping the API contract identical for callers. Also, now that try_grab_compound_head() always has a caller, remove the __maybe_unused annotation. Link: https://lkml.kernel.org/r/20210813044133.1536842-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:11 -07:00
Jing Yangyang	6de522d166	include/linux/buffer_head.h: fix boolreturn.cocci warnings ./include/linux/buffer_head.h:412:64-65:WARNING:return of 0/1 in function 'has_bh_in_lru' with return type bool Return statements in functions returning bool should use true/false instead of 1/0. Generated by: scripts/coccinelle/misc/boolreturn.cocci Link: https://lkml.kernel.org/r/20210824055828.58783-1-deng.changcheng@zte.com.cn Signed-off-by: Jing Yangyang <jing.yangyang@zte.com.cn> Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:10 -07:00
Shakeel Butt	7490a2d248	writeback: memcg: simplify cgroup_writeback_by_id Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty pages for a memcg. However mem_cgroup_wb_stats() does a lot more than just get the number of dirty pages. Just directly get the number of dirty pages instead of calling mem_cgroup_wb_stats(). Also cgroup_writeback_by_id() is only called for best-effort dirty flushing, so remove the unused 'nr' parameter and no need to explicitly flush memcg stats. Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:10 -07:00
Jan Kara	45a2966fd6	writeback: fix bandwidth estimate for spiky workload Michael Stapelberg has reported that for workload with short big spikes of writes (GCC linker seem to trigger this frequently) the write throughput is heavily underestimated and tends to steadily sink until it reaches zero. This has rather bad impact on writeback throttling (causing stalls). The problem is that writeback throughput estimate gets updated at most once per 200 ms. One update happens early after we submit pages for writeback (at that point writeout of only small fraction of pages is completed and thus observed throughput is tiny). Next update happens only during the next write spike (updates happen only from inode writeback and dirty throttling code) and if that is more than 1s after previous spike, we decide system was idle and just ignore whatever was written until this moment. Fix the problem by making sure writeback throughput estimate is also updated shortly after writeback completes to get reasonable estimate of throughput for spiky workloads. [jack@suse.cz: avoid division by 0 in wb_update_dirty_ratelimit()] Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+linux@google.com Link: https://lkml.kernel.org/r/20210713104716.22868-3-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Michael Stapelberg <stapelberg+linux@google.com> Tested-by: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:10 -07:00
Jan Kara	fee468fdf4	writeback: reliably update bandwidth estimation Currently we trigger writeback bandwidth estimation from balance_dirty_pages() and from wb_writeback(). However neither of these need to trigger when the system is relatively idle and writeback is triggered e.g. from fsync(2). Make sure writeback estimates happen reliably by triggering them from do_writepages(). Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:10 -07:00
Jan Kara	633a2abb9e	writeback: track number of inodes under writeback Patch series "writeback: Fix bandwidth estimates", v4. Fix estimate of writeback throughput when device is not fully busy doing writeback. Michael Stapelberg has reported that such workload (e.g. generated by linking) tends to push estimated throughput down to 0 and as a result writeback on the device is practically stalled. The first three patches fix the reported issue, the remaining two patches are unrelated cleanups of problems I've noticed when reading the code. This patch (of 4): Track number of inodes under writeback for each bdi_writeback structure. We will use this to decide whether wb does any IO and so we can estimate its writeback throughput. In principle we could use number of pages under writeback (WB_WRITEBACK counter) for this however normal percpu counter reads are too inaccurate for our purposes and summing the counter is too expensive. Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Michael Stapelberg <stapelberg+linux@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:10 -07:00
Matthew Wilcox (Oracle)	4f3eaf452a	mm: report a more useful address for reclaim acquisition A recent lockdep report included these lines: [ 96.177910] 3 locks held by containerd/770: [ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x115/0x770 [ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at: get_swap_device+0x33/0x140 [ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 While it was not useful to that bug report to know where the reclaim lock had been acquired, it might be useful under other circumstances. Allow the caller of __fs_reclaim_acquire to specify the instruction pointer to use. Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:10 -07:00
David Hildenbrand	592ca09be8	fs: update documentation of get_write_access() and friends As VM_DENYWRITE does no longer exists, let's spring-clean the documentation of get_write_access() and friends. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:02 +02:00
David Hildenbrand	6128b3af2a	mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() Let's also remove masking off MAP_DENYWRITE from ksys_mmap_pgoff(): the last in-tree occurrence of MAP_DENYWRITE is now in LEGACY_MAP_MASK, which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag is ignored throughout the kernel now. Add a comment to LEGACY_MAP_MASK stating that MAP_DENYWRITE is ignored. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
David Hildenbrand	8d0920bde5	mm: remove VM_DENYWRITE All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be set from user space, so all users are gone; let's remove it. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
David Hildenbrand	fe69d560b5	kernel/fork: always deny write access to current MM exe_file We want to remove VM_DENYWRITE only currently only used when mapping the executable during exec. During exec, we already deny_write_access() the executable, however, after exec completes the VMAs mapped with VM_DENYWRITE effectively keeps write access denied via deny_write_access(). Let's deny write access when setting or replacing the MM exe_file. With this change, we can remove VM_DENYWRITE for mapping executables. Make set_mm_exe_file() return an error in case deny_write_access() fails; note that this should never happen, because exec code does a deny_write_access() early and keeps write access denied when calling set_mm_exe_file. However, it makes the code easier to read and makes set_mm_exe_file() and replace_mm_exe_file() look more similar. This represents a minor user space visible change: sys_prctl(PR_SET_MM_MAP/EXE_FILE) can now fail if the file is already opened writable. Also, after sys_prctl(PR_SET_MM_MAP/EXE_FILE) the file cannot be opened writable. Note that we can already fail with -EACCES if the file doesn't have execute permissions. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
David Hildenbrand	35d7bdc860	kernel/fork: factor out replacing the current MM exe_file Let's factor the main logic out into replace_mm_exe_file(), such that all mm->exe_file logic is contained in kernel/fork.c. While at it, perform some simple cleanups that are possible now that we're simplifying the individual functions. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: David Hildenbrand <david@redhat.com>	2021-09-03 18:42:01 +02:00
Kate Hsuan	7a8526a5cd	libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI for Samsung 860 and 870 SSD. Many users are reporting that the Samsung 860 and 870 SSD are having various issues when combined with AMD/ATI (vendor ID 0x1002) SATA controllers and only completely disabling NCQ helps to avoid these issues. Always disabling NCQ for Samsung 860/870 SSDs regardless of the host SATA adapter vendor will cause I/O performance degradation with well behaved adapters. To limit the performance impact to ATI adapters, introduce the ATA_HORKAGE_NO_NCQ_ON_ATI flag to force disable NCQ only for these adapters. Also, two libata.force parameters (noncqati and ncqati) are introduced to disable and enable the NCQ for the system which equipped with ATI SATA adapter and Samsung 860 and 870 SSDs. The user can determine NCQ function to be enabled or disabled according to the demand. After verifying the chipset from the user reports, the issue appears on AMD/ATI SB7x0/SB8x0/SB9x0 SATA Controllers and does not appear on recent AMD SATA adapters. The vendor ID of ATI should be 0x1002. Therefore, ATA_HORKAGE_NO_NCQ_ON_AMD was modified to ATA_HORKAGE_NO_NCQ_ON_ATI. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=201693 Signed-off-by: Kate Hsuan <hpa@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20210903094411.58749-1-hpa@redhat.com Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2021-09-03 08:06:02 -06:00
Linus Torvalds	a9c9a6f741	SCSI misc on 20210902 This series consists of the usual driver updates (ufs, qla2xxx, target, smartpqi, lpfc, mpt3sas). The core change causing the most churn was replacing the command request field request with a macro, allowing us to offset map to it and remove the redundant field; the same was also done for the tag field. The most impactful change is the final removal of scsi_ioctl, which has been deprecated for over a decade. Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com> -----BEGIN PGP SIGNATURE----- iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYTD/TiYcamFtZXMuYm90 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdUkAQCjb3Ux 4K9438mMelHlzM4er1S1IJ0WNnvObaVMNO9LBwD+JUz+rHsrKvuEX9j3g3C3u6JH hC3BUEW8f2LLnujWanQ= =lC5o -----END PGP SIGNATURE----- Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (ufs, qla2xxx, target, smartpqi, lpfc, mpt3sas). The core change causing the most churn was replacing the command request field request with a macro, allowing us to offset map to it and remove the redundant field; the same was also done for the tag field. The most impactful change is the final removal of scsi_ioctl, which has been deprecated for over a decade" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits) scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1 scsi: ufs: ufs-exynos: Fix static checker warning scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Use the proper SCSI midlayer interfaces for PI scsi: lpfc: Copyright updates for 14.0.0.1 patches scsi: lpfc: Update lpfc version to 14.0.0.1 scsi: lpfc: Add bsg support for retrieving adapter cmf data scsi: lpfc: Add cmf_info sysfs entry scsi: lpfc: Add debugfs support for cm framework buffers scsi: lpfc: Add support for maintaining the cm statistics buffer scsi: lpfc: Add rx monitoring statistics scsi: lpfc: Add support for the CM framework scsi: lpfc: Add cmfsync WQE support scsi: lpfc: Add support for cm enablement buffer scsi: lpfc: Add cm statistics buffer support scsi: lpfc: Add EDC ELS support scsi: lpfc: Expand FPIN and RDF receive logging scsi: lpfc: Add MIB feature enablement support scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware scsi: fc: Add EDC ELS definition ...	2021-09-02 15:09:46 -07:00
Linus Torvalds	23852bec53	RDMA v5.15 merge window Pull Request - Various cleanup and small features for rtrs - kmap_local_page() conversions - Driver updates and fixes for: efa, rxe, mlx5, hfi1, qed, hns - Cache the IB subnet prefix - Rework how CRC is calcuated in rxe - Clean reference counting in iwpm's netlink - Pull object allocation and lifecycle for user QPs to the uverbs core code - Several small hns features and continued general code cleanups - Fix the scatterlist confusion of orig_nents/nents introduced in an earlier patch creating the append operation -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmEudRgACgkQOG33FX4g mxraJA//c6bMxrrTVrzmrtrkyYD4tYWE8RDfgvoyZtleZnnEOJeunCQWakQrpJSv ukSnOGCA3PtnmRMdV54f/11YJ/7otxOJodSO7jWsIoBrqG/lISAdX8mn2iHhrvJ0 dIaFEFPLy0WqoMLCJVIYIupR0IStVHb/mWx0uYL4XnnoYKyt7f7K5JMZpNWMhDN2 ieJw0jfrvEYm8pipWuxUvB16XARlzAWQrjqLpMRI+jFRpbDVBY21dz2/LJvOJPrA LcQ+XXsV/F659ibOAGm6bU4BMda8fE6Lw90B/gmhSswJ205NrdziF5cNYHP0QxcN oMjrjSWWHc9GEE7MTipC2AH8e36qob16Q7CK+zHEJ+ds7R6/O/8XmED1L8/KFpNA FGqnjxnxsl1y27mUegfj1Hh8PfoDp2oVq0lmpEw0CYo4cfVzHSMRrbTR//XmW628 Ie/mJddpFK4oLk+QkSNjSLrnxOvdTkdA58PU0i84S5eUVMNm41jJDkxg2J7vp0Zn sclZsclhUQ9oJ5Q2so81JMWxu4JDn7IByXL0ULBaa6xwQTiVEnyvSxSuPlflhLRW 0vI2ylATYKyWkQqyX7VyWecZJzwhwZj5gMMWmoGsij8bkZhQ/VaQMaesByzSth+h NV5UAYax4GqyOQ/tg/tqT6e5nrI1zof87H64XdTCBpJ7kFyQ/oA= =ZwOe -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "This is quite a small cycle, no major series stands out. The HNS and rxe drivers saw the most activity this cycle, with rxe being broken for a good chunk of time. The significant deleted line count is due to a SPDX cleanup series. Summary: - Various cleanup and small features for rtrs - kmap_local_page() conversions - Driver updates and fixes for: efa, rxe, mlx5, hfi1, qed, hns - Cache the IB subnet prefix - Rework how CRC is calcuated in rxe - Clean reference counting in iwpm's netlink - Pull object allocation and lifecycle for user QPs to the uverbs core code - Several small hns features and continued general code cleanups - Fix the scatterlist confusion of orig_nents/nents introduced in an earlier patch creating the append operation" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (90 commits) RDMA/mlx5: Relax DCS QP creation checks RDMA/hns: Delete unnecessary blank lines. RDMA/hns: Encapsulate the qp db as a function RDMA/hns: Adjust the order in which irq are requested and enabled RDMA/hns: Remove RST2RST error prints for hw v1 RDMA/hns: Remove dqpn filling when modify qp from Init to Init RDMA/hns: Fix QP's resp incomplete assignment RDMA/hns: Fix query destination qpn RDMA/hfi1: Convert to SPDX identifier IB/rdmavt: Convert to SPDX identifier RDMA/hns: Bugfix for incorrect association between dip_idx and dgid RDMA/hns: Bugfix for the missing assignment for dip_idx RDMA/hns: Bugfix for data type of dip_idx RDMA/hns: Fix incorrect lsn field RDMA/irdma: Remove the repeated declaration RDMA/core/sa_query: Retry SA queries RDMA: Use the sg_table directly and remove the opencoded version from umem lib/scatterlist: Fix wrong update of orig_nents lib/scatterlist: Provide a dedicated function to support table append RDMA/hns: Delete unused hns bitmap interface ...	2021-09-02 14:47:21 -07:00
Linus Torvalds	75d6e7d9ce	Nothing changed in the clk framework core this time around. We did get some updates to the basic clk types to use determine_rate for the divider type and add a power of two fractional divider flag though. Otherwise, this is a collection of clk driver updates. More than half the diffstat is in the Qualcomm clk driver where we add a bunch of data to describe clks on various SoCs and fix bugs. The other big new thing in here is the Mediatek MT8192 clk driver. That's been under review for a while and it's nice to see that it's finally upstream. Beyond that it's the usual set of minor fixes and tweaks to clk drivers. There are some non-clk driver bits in here which have all been acked by the respective maintainers. New Drivers: - Support video, gpu, display clks on qcom sc7280 SoCs - GCC clks on qcom MSM8953, SM4250/6115, and SM6350 SoCs - Multimedia clks (MMCC) on qcom MSM8994/MSM8992 - RPMh clks on qcom SM6350 SoCs - Support for Mediatek MT8192 SoCs - Add display (DU and DSI) clocks on Renesas R-Car V3U - Add I2C, DMAC, USB, sound (SSIF-2), GPIO, CANFD, and ADC clocks and resets on Renesas RZ/G2L Updates: - Support the SD/OE pin on IDT VersaClock 5 and 6 clock generators - Add power of two flag to fractional divider clk type - Migrate some clk drivers to clk_divider_ops.determine_rate - Migrate to clk_parent_data in gcc-sdm660 - Fix CLKOUT clocks on i.MX8MM and i.MX8MN by using imx_clk_hw_mux2 - Switch from .round_rate to .determine_rate in clk-divider-gate - Fix clock tree update for TF-A controlled clocks for all i.MX8M - Add missing M7 core clock for i.MX8MN - YAML conversion of rk3399 clock controller binding - Removal of GRF dependency for the rk3328/rk3036 pll types - Drop CLK_IS_CRITICAL flag from Tegra fuse clk - Make CLK_R9A06G032 Kconfig symbol invisible - Convert various DT bindings to YAML -----BEGIN PGP SIGNATURE----- iQJFBAABCAAvFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAmExEooRHHNib3lkQGtl cm5lbC5vcmcACgkQrQKIl8bklSXXBhAAvhHm4fcm3fRjNdfImd+jDEl8XSvg+w43 adSnmVxbYM6ZVNOiJ4CJWHbj0hOY/PJnsQYWbV0xXvXW+zXva6p495MMHHOGSi2o lMgZVMvj5UAwu304ZC9Xfn31dwo8XdGrltp4JqIcI2NEBMh1/PlZW22esT+jDiWN 3SWFD3M7lu88xTREyiEu11FY3z/KiGzbGlqYcbivx1X0sHVnBRbl4qcqZway+BmQ 95Ma4YWwhvDGYc+ypKH2EPxs/LikHXj05nMooigy65DOQ5wrM4L1eWkwmVUf6h+e t4x7sAVysLnkihzdH5r2pw6CcAIom76v8w0+maSfk+jINUu1LeGVuat1eXSesFTu 49o+uTKRghkUe/Qh6r+7lbo8AZXQq+wUsLTYRuaWT/mSb+svAtJaUWAru8tJnMlH oK6OehcQwz4nGhH0HnBK1jCVdtgckxPBw8F/GYN9rYhsccIe0XmFjX1rzMM3s8De PLl6QO7Xzd+xb/FwAU8+S1WpKFdPU6ILTUnI2Ma3Mn/gfjZEZHvWAdTjo4oZGEsw +N4n924ArptbeSLRrlNUtqx4BVDL5yo54xS5gefNpmD5yezO7aoUtN0aGcBq+01p Qw0N5hKtcdsNYLBEFSvBGcZZmErMZbPwMXHWiUwNymXBDzJKgj5d+ks+1vJ3iCNW R5r9hvATJPQ= =Rrqg -----END PGP SIGNATURE----- Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk updates from Stephen Boyd: "Nothing changed in the clk framework core this time around. We did get some updates to the basic clk types to use determine_rate for the divider type and add a power of two fractional divider flag though. Otherwise, this is a collection of clk driver updates. More than half the diffstat is in the Qualcomm clk driver where we add a bunch of data to describe clks on various SoCs and fix bugs. The other big new thing in here is the Mediatek MT8192 clk driver. That's been under review for a while and it's nice to see that it's finally upstream. Beyond that it's the usual set of minor fixes and tweaks to clk drivers. There are some non-clk driver bits in here which have all been acked by the respective maintainers. New Drivers: - Support video, gpu, display clks on qcom sc7280 SoCs - GCC clks on qcom MSM8953, SM4250/6115, and SM6350 SoCs - Multimedia clks (MMCC) on qcom MSM8994/MSM8992 - RPMh clks on qcom SM6350 SoCs - Support for Mediatek MT8192 SoCs - Add display (DU and DSI) clocks on Renesas R-Car V3U - Add I2C, DMAC, USB, sound (SSIF-2), GPIO, CANFD, and ADC clocks and resets on Renesas RZ/G2L Updates: - Support the SD/OE pin on IDT VersaClock 5 and 6 clock generators - Add power of two flag to fractional divider clk type - Migrate some clk drivers to clk_divider_ops.determine_rate - Migrate to clk_parent_data in gcc-sdm660 - Fix CLKOUT clocks on i.MX8MM and i.MX8MN by using imx_clk_hw_mux2 - Switch from .round_rate to .determine_rate in clk-divider-gate - Fix clock tree update for TF-A controlled clocks for all i.MX8M - Add missing M7 core clock for i.MX8MN - YAML conversion of rk3399 clock controller binding - Removal of GRF dependency for the rk3328/rk3036 pll types - Drop CLK_IS_CRITICAL flag from Tegra fuse clk - Make CLK_R9A06G032 Kconfig symbol invisible - Convert various DT bindings to YAML" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (128 commits) dt-bindings: clock: samsung: fix header path in example clk: tegra: fix old-style declaration clk: qcom: Add SM6350 GCC driver MAINTAINERS: clock: include S3C and S5P in Samsung SoC clock entry dt-bindings: clock: samsung: convert S5Pv210 AudSS to dtschema dt-bindings: clock: samsung: convert Exynos AudSS to dtschema dt-bindings: clock: samsung: convert Exynos4 to dtschema dt-bindings: clock: samsung: convert Exynos3250 to dtschema dt-bindings: clock: samsung: convert Exynos542x to dtschema dt-bindings: clock: samsung: add bindings for Exynos external clock dt-bindings: clock: samsung: convert Exynos5250 to dtschema clk: vc5: Add properties for configuring SD/OE behavior clk: vc5: Use dev_err_probe dt-bindings: clk: vc5: Add properties for configuring the SD/OE pin dt-bindings: clock: brcm,iproc-clocks: fix armpll properties clk: zynqmp: Fix kernel-doc format clk: at91: clk-generated: Limit the requested rate to our range clk: ralink: avoid to set 'CLK_IS_CRITICAL' flag for gates clk: zynqmp: Fix a memory leak clk: zynqmp: Check the return type ...	2021-09-02 14:17:24 -07:00
Linus Torvalds	7ba88a2a09	platform-drivers-x86 for v5.15-1 Highlights: - Move all the Intel drivers into their own subdir(s) (mostly Kate's work) - New meraki-mx100 platform driver - Asus WMI driver enhancements, including /sys/firmware/acpi/platform_profile support - New BIOS SAR driver for Intel M.2 WWAM modems - Alder Lake support for the Intel PMC driver - A whole bunch of cleanups + fixes all over the place The following is an automated git shortlog grouped by driver: BIOS SAR driver for Intel M.2 Modem: - BIOS SAR driver for Intel M.2 Modem ISST: - use semi-colons instead of commas - Fix optimization with use of numa Replace deprecated CPU-hotplug functions.: - Replace deprecated CPU-hotplug functions. Update Mario Limonciello's email address in the docs: - Update Mario Limonciello's email address in the docs acer-wmi: - Add Turbo Mode support for Acer PH315-53 add meraki-mx100 platform driver: - add meraki-mx100 platform driver asus-nb-wmi: - Add tablet_mode_sw=lid-flip quirk for the TP200s - Allow configuring SW_TABLET_MODE method with a module option asus-wmi: - Fix "unsigned 'retval' is never less than zero" smatch warning - Delete impossible condition - Add support for platform_profile - Add egpu enable method - Add dgpu disable method - Add panel overdrive functionality dell-smbios: - Remove unused dmi_system_id table dell-smbios-wmi: - Add missing kfree in error-exit from run_smbios_call - Avoid false-positive memcpy() warning dell-smo8800: - Convert to be a platform driver dual_accel_detect: - Use the new i2c_acpi_client_count() helper gigabyte-wmi: - add support for B450M S2H V2 - add support for X570 GAMING X hp_accel: - Convert to be a platform driver - Remove _INI method call i2c: - acpi: Add an i2c_acpi_client_count() helper function i2c-multi-instantiate: - Use the new i2c_acpi_client_count() helper ideapad-laptop: - Fix Legion 5 Fn lock LED intel-hid: - Move to intel sub-directory intel-rst: - Move to intel sub-directory intel-smartconnect: - Move to intel sub-directory intel-uncore-frequency: - Move to intel sub-directory intel-vbtn: - Move to intel sub-directory intel-wmi-sbl-fw-update: - Move to intel sub-directory intel-wmi-thunderbolt: - Move to intel sub-directory intel_atomisp2: - Move to intel sub-directory intel_bxtwc_tmu: - Move to intel sub-directory intel_cht_int33fe: - Use the new i2c_acpi_client_count() helper intel_chtdc_ti_pwrbtn: - Move to intel sub-directory intel_int0002_vgpio: - Move to intel sub-directory intel_mrfld_pwrbtn: - Move to intel sub-directory intel_oaktrail: - Move to intel sub-directory intel_pmc_core: - Move to intel sub-directory - Prevent possibile overflow intel_pmt_telemetry: - Ignore zero sized entries intel_punit_ipc: - Move to intel sub-directory intel_scu_ipc: - Fix doc of intel_scu_ipc_dev_command_with_size() intel_speed_select_if: - Move to intel sub-directory intel_telemetry: - Move to intel sub-directory intel_turbo_max_3: - Move to intel sub-directory lg-laptop: - Use correct event for keyboard backlight FN-key - Use correct event for touchpad toggle FN-key - Support for battery charge limit on newer models platform/mellanox: - mlxbf-pmc: fix kernel-doc notation platform/surface: - aggregator: Use y instead of objs in Makefile - surface3_power: Use i2c_acpi_get_i2c_resource() helper platform/x86/intel: - pmc/core: Add GBE Package C10 fix for Alder Lake PCH - pmc/core: Add Alder Lake low power mode support for pmc core - pmc/core: Add Latency Tolerance Reporting (LTR) support to Alder Lake - pmc/core: Add Alderlake support to pmc core driver - int3472: Use y instead of objs in Makefile - pmt: Use y instead of objs in Makefile - int33fe: Use y instead of objs in Makefile - Move Intel PMT drivers to new subfolder thermal/drivers/intel: - Move intel_menlow to thermal drivers think-lmi: - add debug_cmd -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmEw1KAUHGhkZWdvZWRl QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9wk7Qf/dsaaDgx7aC6DfKzdcMgfeLIdTaGm a6svNXM2t/JFdvhjYzxA+4QlQgco7zkN06iRlWbObEonSUsHlGlwEOHX60VgcopO qJaqnznmfdXUocFTnA+5acJXabNaw7xkKHS0K61UWgk+mm6aMuygpKxULnNTa4X+ p3HoU6uXFckpoA/Jstzo5UfegNYhg11bflNd7XN4F3rMCbbNHAsWlf4oVr2YsEHa wECW+1e8wZl4BInUzoXQhilRoybJWXWJ8sLsvQfDXLs9aNoLdDqu9p0MuXEW5QqE wNt26SNNAP2L49BD6kaJszV5Ry/jNSEtVkWwkrzHbGTZoOEyMYas8pm6Uw== =iK1W -----END PGP SIGNATURE----- Merge tag 'platform-drivers-x86-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver updates from Hans de Goede: "Highlights: - Move all the Intel drivers into their own subdir(s) (mostly Kate's work) - New meraki-mx100 platform driver - Asus WMI driver enhancements, including support for /sys/firmware/acpi/platform_profile - New BIOS SAR driver for Intel M.2 WWAM modems - Alder Lake support for the Intel PMC driver - A whole bunch of cleanups + fixes all over the place" * tag 'platform-drivers-x86-v5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (65 commits) platform/x86: dell-smbios-wmi: Add missing kfree in error-exit from run_smbios_call platform/x86: dell-smbios-wmi: Avoid false-positive memcpy() warning platform/x86: ISST: use semi-colons instead of commas platform/x86: asus-wmi: Fix "unsigned 'retval' is never less than zero" smatch warning platform/x86: asus-wmi: Delete impossible condition platform/x86: hp_accel: Convert to be a platform driver platform/x86: hp_accel: Remove _INI method call platform/mellanox: mlxbf-pmc: fix kernel-doc notation platform/x86/intel: pmc/core: Add GBE Package C10 fix for Alder Lake PCH platform/x86/intel: pmc/core: Add Alder Lake low power mode support for pmc core platform/x86/intel: pmc/core: Add Latency Tolerance Reporting (LTR) support to Alder Lake platform/x86/intel: pmc/core: Add Alderlake support to pmc core driver platform/x86: intel-wmi-thunderbolt: Move to intel sub-directory platform/x86: intel-wmi-sbl-fw-update: Move to intel sub-directory platform/x86: intel-vbtn: Move to intel sub-directory platform/x86: intel_oaktrail: Move to intel sub-directory platform/x86: intel_int0002_vgpio: Move to intel sub-directory platform/x86: intel-hid: Move to intel sub-directory platform/x86: intel_atomisp2: Move to intel sub-directory platform/x86: intel_speed_select_if: Move to intel sub-directory ...	2021-09-02 13:49:39 -07:00
Xiubo Li	d095559ce4	ceph: flush mdlog before umounting Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2021-09-02 22:49:16 +02:00
Linus Torvalds	89b6b8cd92	VFIO update for v5.15-rc1 - Fix dma-valid return WAITED implementation (Anthony Yznaga) - SPDX license cleanups (Cai Huoqing) - Split vfio-pci-core from vfio-pci and enhance PCI driver matching to support future vendor provided vfio-pci variants (Yishai Hadas, Max Gurtovoy, Jason Gunthorpe) - Replace duplicated reflck with core support for managing first open, last close, and device sets (Jason Gunthorpe, Max Gurtovoy, Yishai Hadas) - Fix non-modular mdev support and don't nag about request callback support (Christoph Hellwig) - Add semaphore to protect instruction intercept handler and replace open-coded locks in vfio-ap driver (Tony Krowiak) - Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmEvwWkbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsi+1UP/3CRizghroINVYR+cJ99 Tjz7lB/wlzxmRfX+SL4NAVe1SSB2VeCgU4B0PF6kywELLS8OhCO3HXYXVsz244fW Gk5UIns86+TFTrfCOMpwYBV0P86zuaa1ZnvCnkhMK1i2pTZ+oX8hUH1Yj5clHuU+ YgC7JfEuTIAX73q2bC/llLvNE9ke1QCoDX3+HAH87ttqutnRWcnnq56PTEqwe+EW eMA+glB1UG6JAqXxoJET4155arNOny1/ZMprfBr3YXZTiXDF/lSzuMyUtbp526Sf hsvlnqtE6TCdfKbog0Lxckl+8E9NCq8jzFBKiZhbccrQv3vVaoP6dOsPWcT35Kp1 IjzMLiHIbl4wXOL+Xap/biz3LCM5BMdT/OhW5LUC007zggK71ndRvb9F8ptW83Bv 0Uh9DNv7YIQ0su3JHZEsJ3qPFXQXceP199UiADOGSeV8U1Qig3YKsHUDMuALfFvN t+NleeJ4qCWao+W4VCfyDfKurVnMj/cThXiDEWEeq5gMOO+6YKBIFWJVKFxUYDbf MgGdg0nQTUECuXKXxLD4c1HAWH9xi207OnLvhW1Icywp20MsYqOWt0vhg+PRdMBT DK6STxP18aQxCaOuQN9Vf81LjhXNTeg+xt3mMyViOZPcKfX6/wAC9qLt4MucJDdw FBfOz2UL2F56dhAYT+1vHoUM =nzK7 -----END PGP SIGNATURE----- Merge tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Fix dma-valid return WAITED implementation (Anthony Yznaga) - SPDX license cleanups (Cai Huoqing) - Split vfio-pci-core from vfio-pci and enhance PCI driver matching to support future vendor provided vfio-pci variants (Yishai Hadas, Max Gurtovoy, Jason Gunthorpe) - Replace duplicated reflck with core support for managing first open, last close, and device sets (Jason Gunthorpe, Max Gurtovoy, Yishai Hadas) - Fix non-modular mdev support and don't nag about request callback support (Christoph Hellwig) - Add semaphore to protect instruction intercept handler and replace open-coded locks in vfio-ap driver (Tony Krowiak) - Convert vfio-ap to vfio_register_group_dev() API (Jason Gunthorpe) * tag 'vfio-v5.15-rc1' of git://github.com/awilliam/linux-vfio: (37 commits) vfio/pci: Introduce vfio_pci_core.ko vfio: Use kconfig if XX/endif blocks instead of repeating 'depends on' vfio: Use select for eventfd PCI / VFIO: Add 'override_only' support for VFIO PCI sub system PCI: Add 'override_only' field to struct pci_device_id vfio/pci: Move module parameters to vfio_pci.c vfio/pci: Move igd initialization to vfio_pci.c vfio/pci: Split the pci_driver code out of vfio_pci_core.c vfio/pci: Include vfio header in vfio_pci_core.h vfio/pci: Rename ops functions to fit core namings vfio/pci: Rename vfio_pci_device to vfio_pci_core_device vfio/pci: Rename vfio_pci_private.h to vfio_pci_core.h vfio/pci: Rename vfio_pci.c to vfio_pci_core.c vfio/ap_ops: Convert to use vfio_register_group_dev() s390/vfio-ap: replace open coded locks for VFIO_GROUP_NOTIFY_SET_KVM notification s390/vfio-ap: r/w lock for PQAP interception handler function pointer vfio/type1: Fix vfio_find_dma_valid return vfio-pci/zdev: Remove repeated verbose license text vfio: platform: reset: Convert to SPDX identifier vfio: Remove struct vfio_device_ops open/release ...	2021-09-02 13:41:33 -07:00
Mike Galbraith	15eb7c888e	locking/rwsem: Add missing __init_rwsem() for PREEMPT_RT `730633f0b7` became the first direct caller of __init_rwsem() vs the usual init_rwsem(), exposing PREEMPT_RT's lack thereof. Add it. [ tglx: Move it out of line ] Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/50a936b7d8f12277d6ec7ed2ef0421a381056909.camel@gmx.de	2021-09-02 22:07:17 +02:00
Bjorn Helgaas	6e129176c3	Merge branch 'remotes/lorenzo/pci/endpoint' - Add max-virtual-functions to endpoint binding (Kishon Vijay Abraham I) - Add pci_epf_add_vepf() API to add virtual function to endpoint (Kishon Vijay Abraham I) - Add pci_epf_vepf_link() to link virtual function to endpoint physical function (Kishon Vijay Abraham I) - Add virtual function number to pci_epc_ops endpoint ops interfaces (Kishon Vijay Abraham I) - Simplify register base address computation for endpoint BAR configuration (Kishon Vijay Abraham I) - Add support to configure virtual functions in cadence endpoint driver (Kishon Vijay Abraham I) - Add SR-IOV configuration to endpoint test driver (Kishon Vijay Abraham I) - Document configfs usage to create virtual functions for endpoints (Kishon Vijay Abraham I) * remotes/lorenzo/pci/endpoint: Documentation: PCI: endpoint/pci-endpoint-cfs: Guide to use SR-IOV misc: pci_endpoint_test: Populate sriov_configure ops to configure SR-IOV device PCI: cadence: Add support to configure virtual functions PCI: cadence: Simplify code to get register base address for configuring BAR PCI: endpoint: Add virtual function number in pci_epc ops PCI: endpoint: Add support to link a physical function to a virtual function PCI: endpoint: Add support to add virtual function in endpoint core dt-bindings: PCI: pci-ep: Add binding to specify virtual function	2021-09-02 14:56:51 -05:00
Bjorn Helgaas	a1e4ca8eb9	Merge branch 'remotes/lorenzo/pci/hyper-v' - Add domain_nr in struct pci_host_bridge (Boqun Feng) - Use host bridge MSI domain for root buses if present (Boqun Feng) - Allow ARM64 virtual host bridge with no ACPI companion (e.g., Hyper-V) (Boqun Feng) - Make Hyper-V enumeration more generic (Arnd Bergmann) - Set Hyper-V domain_nr at probe-time (Boqun Feng) - Set up Hyper-V MSI domain at bridge probe-time (Boqun Feng) - Enable Hyper-V bridge probing on ARM64 (Boqun Feng) * remotes/lorenzo/pci/hyper-v: PCI: hv: Turn on the host bridge probing on ARM64 PCI: hv: Set up MSI domain at bridge probing time PCI: hv: Set ->domain_nr of pci_host_bridge at probing time PCI: hv: Generify PCI probing arm64: PCI: Support root bridge preparation for Hyper-V arm64: PCI: Restructure pcibios_root_bridge_prepare() PCI: Support populating MSI domains of root buses via bridges PCI: Introduce domain_nr in pci_host_bridge	2021-09-02 14:56:47 -05:00
Bjorn Helgaas	739c4747a2	Merge branch 'pci/misc' - Add pci_numachip_init() declaration (Krzysztof Wilczyński) - Allocate pci_dev_str_match_path() string atomically (Dan Carpenter) - Drop error message when Precision Time Measurement supported but not enabled (Jakub Kicinski) - Correct the pci_iomap.h header guard #endif comment (Jonathan Cameron) - Add schedule point in proc_bus_pci_read() (Krzysztof Wilczyński) - Make saved capability state private to core (Bjorn Helgaas) - Sync __pci_register_driver() stub for CONFIG_PCI=n (Andy Shevchenko) - Convert sta2x11 from PCI-DMA-API to generic DMA-API (Christophe JAILLET) * pci/misc: x86/PCI: sta2x11: switch from 'pci_' to 'dma_' API PCI: Sync __pci_register_driver() stub for CONFIG_PCI=n PCI: Make saved capability state private to core PCI: Add schedule point in proc_bus_pci_read() PCI: Correct the pci_iomap.h header guard #endif comment PCI/PTM: Remove error message at boot PCI: Fix pci_dev_str_match_path() alloc while atomic bug x86/PCI: Add pci_numachip_init() declaration # Conflicts: # include/linux/pci.h	2021-09-02 14:56:44 -05:00
Bjorn Helgaas	74797618e2	Merge branch 'pci/vpd' - Check Resource Item Names against those defined for type (Bjorn Helgaas) - Treat initial 0xff as missing EEPROM (Heiner Kallweit) - Reject resource tags with invalid size (Bjorn Helgaas) - Don't check Large Resource Item Names for validity (Bjorn Helgaas) - Allow access to valid parts of VPD if some is invalid (Bjorn Helgaas) - Remove pci_vpd_size() old_size argument (Heiner Kallweit) - Make pci_vpd_wait() uninterruptible (Heiner Kallweit) - Remove struct pci_vpd.flag (Heiner Kallweit) - Remove struct pci_vpd_ops (Heiner Kallweit) - Remove struct pci_vpd.valid member (Heiner Kallweit) - Embed struct pci_vpd in struct pci_dev (Heiner Kallweit) - Determine VPD size in pci_vpd_init() (Heiner Kallweit) - Treat invalid VPD like missing VPD capability (Heiner Kallweit) - Add pci_vpd_alloc() to allocate buffer and read VPD into it (Heiner Kallweit) - Add pci_vpd_find_ro_info_keyword() (Heiner Kallweit) - Add pci_vpd_check_csum() (Heiner Kallweit) - Add pci_vpd_find_id_string() (Heiner Kallweit) - Read VPD with pci_vpd_alloc() (bnx2x, bnxt, sfc, sfc falcon, tg3 drivers) (Heiner Kallweit) - Search VPD with pci_vpd_find_ro_info_keyword() (bnx2, bnx2x, bnxt, cxgb4, cxlflash SCSI, sfc, sfc falcon, tg3 drivers) (Heiner Kallweit) - Search VPD with pci_vpd_find_id_string() (cxgb4 driver) (Heiner Kallweit) - Validate VPD checksum with pci_vpd_check_csum() (cxgb4, tg3 drivers) (Heiner Kallweit) - Replace open-coded byte swapping with swab32s() in bnx2 (Heiner Kallweit) - Remove unused vpd_param member ec (Heiner Kallweit) - Stop exporting pci_vpd_find_tag(), pci_vpd_find_info_keyword() (Heiner Kallweit) - Move several VPD defines and inlines to internal PCI core (Heiner Kallweit) * pci/vpd: PCI/VPD: Use unaligned access helpers PCI/VPD: Clean up public VPD defines and inline functions cxgb4: Use pci_vpd_find_id_string() to find VPD ID string PCI/VPD: Add pci_vpd_find_id_string() PCI/VPD: Include post-processing in pci_vpd_find_tag() PCI/VPD: Stop exporting pci_vpd_find_info_keyword() PCI/VPD: Stop exporting pci_vpd_find_tag() scsi: cxlflash: Search VPD with pci_vpd_find_ro_info_keyword() cxgb4: Search VPD with pci_vpd_find_ro_info_keyword() cxgb4: Remove unused vpd_param member ec cxgb4: Validate VPD checksum with pci_vpd_check_csum() bnxt: Search VPD with pci_vpd_find_ro_info_keyword() bnxt: Read VPD with pci_vpd_alloc() bnx2x: Search VPD with pci_vpd_find_ro_info_keyword() bnx2x: Read VPD with pci_vpd_alloc() bnx2: Replace open-coded byte swapping with swab32s() bnx2: Search VPD with pci_vpd_find_ro_info_keyword() sfc: falcon: Search VPD with pci_vpd_find_ro_info_keyword() sfc: falcon: Read VPD with pci_vpd_alloc() tg3: Search VPD with pci_vpd_find_ro_info_keyword() tg3: Validate VPD checksum with pci_vpd_check_csum() tg3: Read VPD with pci_vpd_alloc() sfc: Search VPD with pci_vpd_find_ro_info_keyword() sfc: Read VPD with pci_vpd_alloc() PCI/VPD: Add pci_vpd_check_csum() PCI/VPD: Add pci_vpd_find_ro_info_keyword() PCI/VPD: Add pci_vpd_alloc() PCI/VPD: Treat invalid VPD like missing VPD capability PCI/VPD: Determine VPD size in pci_vpd_init() PCI/VPD: Embed struct pci_vpd in struct pci_dev PCI/VPD: Remove struct pci_vpd.valid member PCI/VPD: Remove struct pci_vpd_ops PCI/VPD: Reorder pci_read_vpd(), pci_write_vpd() PCI/VPD: Remove struct pci_vpd.flag PCI/VPD: Make pci_vpd_wait() uninterruptible PCI/VPD: Remove pci_vpd_size() old_size argument PCI/VPD: Allow access to valid parts of VPD if some is invalid PCI/VPD: Don't check Large Resource Item Names for validity PCI/VPD: Reject resource tags with invalid size PCI/VPD: Treat initial 0xff as missing EEPROM PCI/VPD: Check Resource Item Names against those valid for type PCI/VPD: Correct diagnostic for VPD read failure	2021-09-02 14:56:44 -05:00
Bjorn Helgaas	1295d187ab	Merge branch 'pci/virtualization' - Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms (Wasim Khan) - Add ACS quirks for Cavium multi-function devices (George Cherian) - Enforce pci=noats with Transaction Blocking (Alex Williamson) * pci/virtualization: PCI/ACS: Enforce pci=noats with Transaction Blocking PCI: Add ACS quirks for Cavium multi-function devices PCI: Add ACS quirks for NXP LX2xx0 and LX2xx2 platforms	2021-09-02 14:56:43 -05:00
Bjorn Helgaas	9045f63e67	Merge branch 'pci/resource' - Refactor pci_ioremap_bar() and pci_ioremap_wc_bar() (Krzysztof Wilczyński) - Optimize pci_resource_len() to reduce kernel size (Zhen Lei) * pci/resource: PCI: Optimize pci_resource_len() to reduce kernel size PCI: Refactor pci_ioremap_bar() and pci_ioremap_wc_bar()	2021-09-02 14:56:43 -05:00

1 2 3 4 5 ...

80168 commits