Commit c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode. This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.
Unfortunately, that else clause is exercised when there is no page table
entry. This will likely lead to a call to huge_pmd_share. If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.
Simply remove the else clause. It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.
To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.
Fixes: c0d0381ade ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.
Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.
The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.
To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.
[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no compact_defer_limit. It should be compact_defer_shift in
use. and add compact_order_failed explanation.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Proactive compaction uses per-node/zone "fragmentation score" which is
always in range [0, 100], so use unsigned type of these scores as well as
for related constants.
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.
For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.
The tunable takes a value in range [0, 100], with a default of 20.
Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl. Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate. The internal interpretation of this opaque
value allows for future fine-tuning.
Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation. A zone's present_pages determines its weight.
To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same. If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value. By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.
This patch is largely based on ideas from Michal Hocko [2]. See also the
LWN article [3].
Performance data
================
System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap. The workload is mainly anonymous
userspace pages, which are easy to move around. I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.
1. Kernel hugepage allocation latencies
With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:
(all latency values are in microseconds)
- With vanilla 5.6.0-rc3
percentile latency
–––––––––– –––––––
5 7894
10 9496
25 12561
30 15295
40 18244
50 21229
60 27556
75 30147
80 31047
90 32859
95 33799
Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
- With 5.6.0-rc3 + this patch, with proactiveness=20
sysctl -w vm.compaction_proactiveness=20
percentile latency
–––––––––– –––––––
5 2
10 2
25 3
30 3
40 3
50 4
60 4
75 4
80 4
90 5
95 429
Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)
2. JAVA heap allocation
In this test, we first fragment memory using the same method as for (1).
Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages. We also set THP to madvise to
allow hugepage backing of this heap.
/usr/bin/time
java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
The above command allocates 700G of Java heap using hugepages.
- With vanilla 5.6.0-rc3
17.39user 1666.48system 27:37.89elapsed
- With 5.6.0-rc3 + this patch, with proactiveness=20
8.35user 194.58system 3:19.62elapsed
Elapsed time remains around 3:15, as proactiveness is further increased.
Note that proactive compaction happens throughout the runtime of these
workloads. The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.
In the above Java workload, proactiveness is set to 20. The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction. As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80). Repeat.
bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark. kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.
Backoff behavior
================
Above workloads produce a memory state which is easy to compact. However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off. To test this aspect:
- Created a kernel driver that allocates almost all memory as hugepages
followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
(=> ~30 seconds between retries).
[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/
Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch implements workingset detection for anonymous LRU. All the
infrastructure is implemented by the previous patches so this patch just
activates the workingset detection by installing/retrieving the shadow
entry and adding refault calculation.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache. This patch implements an infrastructure to store the shadow
entry in the swapcache.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To prepare the workingset detection for anon LRU, this patch splits
workingset event counters for refault, activate and restore into anon and
file variants, as well as the refaults counter in struct lruvec.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In current implementation, newly created or swap-in anonymous page is
started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.
Following is an example of this situation.
Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)
3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)
This patch tries to fix this issue. Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list. They are
promoted to active list if enough reference happens. This simple
modification changes the above example as following.
1. 50 hot pages on active list
50(h) | 0
2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)
3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)
As you can see, hot pages on active list would be protected.
Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the reservation routine, we only check whether the cpuset meets the
memory allocation requirements. But we ignore the mempolicy of MPOL_BIND
case. If someone mmap hugetlb succeeds, but the subsequent memory
allocation may fail due to mempolicy restrictions and receives the SIGBUS
signal. This can be reproduced by the follow steps.
1) Compile the test case.
cd tools/testing/selftests/vm/
gcc map_hugetlb.c -o map_hugetlb
2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
system. Each node will pre-allocate one huge page.
echo 2 > /proc/sys/vm/nr_hugepages
3) Run test case(mmap 4MB). We receive the SIGBUS signal.
numactl --membind=3D0 ./map_hugetlb 4
With this patch applied, the mmap will fail in the step 3) and throw
"mmap: Cannot allocate memory".
[akpm@linux-foundation.org: include sched.h for `current']
Reported-by: Jianchao Guo <guojianchao@bytedance.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Baoquan He <bhe@redhat.com>
Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Percpu memory can represent a noticeable chunk of the total memory
consumption, especially on big machines with many CPUs. Let's track
percpu memory usage for each memcg and display it in memory.stat.
A percpu allocation is usually scattered over multiple pages (and nodes),
and can be significantly smaller than a page. So let's add a byte-sized
counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra
created for slabs can be perfectly reused for percpu case.
[guro@fb.com: v3]
Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Untangle the header spaghetti which causes build failures in various
situations caused by the lockdep additions to seqcount to validate that
the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict per
CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored and
write_seqcount_begin() has a lockdep assertion to validate that the
lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API is
unchanged and determines the type at compile time with the help of
_Generic which is possible now that the minimal GCC version has been
moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs which
have been addressed already independent of this.
While generaly useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if the
writers are serialized by an associated lock, which leads to the well
known reader preempts writer livelock. RT prevents this by storing the
associated lock pointer independent of lockdep in the seqcount and
changing the reader side to block on the lock when a reader detects
that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and initializers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
81ppn2KlvzUK8g==
=7Gj+
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.
- The seqcount associated lock debug addons which were blocked by the
above fallout.
seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.
This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.
Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.
Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.
While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.
- Conversion of seqcount usage sites to associated types and
initializers"
* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
x86/headers: Remove APIC headers from <asm/smp.h>
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...
- Make sure transactions won't be started recursively in gfs2_block_zero_range.
(Bug introduced in 5.4 when switching to iomap_zero_range.)
- Fix a glock holder refcount leak introduced in the iopen glock locking
scheme rework merged in 5.8.
- A few other small improvements (debugging, stack usage, comment fixes).
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl8xkmAUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZToB0w/9FANsxraTsNxxZUQuP2uyPiB/TRxY
Vutp63Pe+eE0GNxgpfLWT+QgO99BQlGZ8WmDJ49ex0dZXZY4lv4L7JoGTV5VwAyh
CFqRcWQkAb7zVvOxAeKfyXT0CwhS7O0avvlubSpEHfQgPkGQa3q3vu18vg1jHfB3
OsFjZGCoyiJKmNFX005euhebnStabaGeP0+8uSz/5xHS+NQUbB3jPlgycUMvTVBe
Lhl052rVQzlULNUCkehKnaBxzN0/8K55iFOaOSd2kzZ5BMfRabvskZEyJFf4T75p
VJyElekk5mGV0k/FpGL03Me1oqMddDLpAFCGh9Hw51o02i8wZ3RderfQHfUmojku
5/lLEcEJV+oC7gJR7IsGRme71De9y+uLnKvod+ayBw+9us1ZbEm4zJY7hLLGyq03
piuFEAo0UGUmxGn8/s1RBT0lMYKjEDGIGjockaz/XzEQMSip27JZq4ATnA5CHaSy
8q4PFflxEaXWJtPEjiY7DW1xQhYW/3cEDfd7kjSBw/GUxk1ILXRMnzCOsjrYuWfH
ykH0gzVq7/uJln/otYa39RBdt/GQXJCzhvODeizF6B36seVX9oBBEc6TtohxREwt
aIpupz6iGQ6GXddg1Sq6p2w3xyW8e8bG4YislEyiCyxpdOjvcYdM6cVxF7UBgtgu
fwEbjgGO34pAO1E=
=+Nig
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Make sure transactions won't be started recursively in
gfs2_block_zero_range (bug introduced in 5.4 when switching to
iomap_zero_range)
- Fix a glock holder refcount leak introduced in the iopen glock
locking scheme rework merged in 5.8.
- A few other small improvements (debugging, stack usage, comment
fixes).
* tag 'gfs2-for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: When gfs2_dirty_inode gets a glock error, dump the glock
gfs2: Never call gfs2_block_zero_range with an open transaction
gfs2: print details on transactions that aren't properly ended
gfs2: Fix inaccurate comment
fs: Fix typo in comment
gfs2: Fix refcount leak in gfs2_glock_poke
gfs2: Pass glock holder to gfs2_file_direct_{read,write}
gfs2: Add some flags missing from glock output
Pull input updates from Dmitry Torokhov:
- an update to Elan touchpad controller driver supporting newer ICs
with enhanced precision reports and a new firmware update process
- an update to EXC3000 touch controller supporting additional parts
- assorted driver fixups
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (27 commits)
Input: exc3000 - add support to query model and fw_version
Input: exc3000 - add reset gpio support
Input: exc3000 - add EXC80H60 and EXC80H84 support
dt-bindings: touchscreen: Convert EETI EXC3000 touchscreen to json-schema
Input: sentelic - fix error return when fsp_reg_write fails
Input: alps - remove redundant assignment to variable ret
Input: ims-pcu - return error code rather than -ENOMEM
Input: elan_i2c - add ic type 0x15
Input: atmel_mxt_ts - only read messages in mxt_acquire_irq() when necessary
Input: uinput - fix typo in function name documentation
Input: ati_remote2 - add missing newlines when printing module parameters
Input: psmouse - add a newline when printing 'proto' by sysfs
Input: synaptics-rmi4 - drop a duplicated word
Input: elan_i2c - add support for high resolution reports
Input: elan_i2c - do not constantly re-query pattern ID
Input: elan_i2c - add firmware update info for ICs 0x11, 0x13, 0x14
Input: elan_i2c - handle firmware updated on newer ICs
Input: elan_i2c - add support for different firmware page sizes
Input: elan_i2c - fix detecting IAP version on older controllers
Input: elan_i2c - handle devices with patterns above 1
...
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl8oBt0ACgkQM2qzM29m
f5dTFQ/9H72E6gr1onsia0/Py0CO8F9qzLgmUBl1vVYAh2/vPqUL1ypxrC5OYrAy
TOqESTsJvmGluCFc/77XUTD7NvJY3znIWim49okwDiyee4Y14ZfRhhCxyyA6Z94E
FjJQb5TbF1Mti4X3dN8Gn7O1Y/BfTjDAAXnXGlTA1xoLcxM5idWIj+G8x0bPmeDb
2fTbgsoETu6MpS2/L6mraXVh3d5ESOJH+73YvpBl0AhYPzlNASJZMLtHtd+A/JbO
IPkMP/7UA5DuJtWGeuQ4I4D5bQNpNWMfN6zhwtih4IV5bkRC7vyAOLG1R7w9+Ufq
58cxPiorMcsg1cHnXG0Z6WVtbMEdWTP/FzmJdE5RC7DEJhmmSUG/R0OmgDcsDZET
GovPARho01yp80GwTjCIctDHRRFRL4pdPfr8PjVHetSnx9+zoRUT+D70Zeg/KSy2
99gmCxqSY9BZeHoiVPEX/HbhXrkuDjUSshwl98OAzOFmv6kbwtLntgFbWlBdE6dB
mqOxBb73zEoZ5P9GA2l2ShU3GbzMzDebHBb9EyomXHZrLejoXeUNA28VJ+8vPP5S
IVHnEwOkdJrNe/7cH4jd/B0NR6f8Da/F9kmkLiG2GNPMqQ8bnVhxTUtZkcAE+fd4
f34qLxsoht70wSSfISjBs7hP5KxEM1lOAf0w0RpycPUKJNV1FB0=
=OEpF
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6
Pull NFS server updates from Chuck Lever:
"Highlights:
- Support for user extended attributes on NFS (RFC 8276)
- Further reduce unnecessary NFSv4 delegation recalls
Notable fixes:
- Fix recent krb5p regression
- Address a few resource leaks and a rare NULL dereference
Other:
- De-duplicate RPC/RDMA error handling and other utility functions
- Replace storage and display of kernel memory addresses by tracepoints"
* tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits)
svcrdma: CM event handler clean up
svcrdma: Remove transport reference counting
svcrdma: Fix another Receive buffer leak
SUNRPC: Refresh the show_rqstp_flags() macro
nfsd: netns.h: delete a duplicated word
SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
nfsd: avoid a NULL dereference in __cld_pipe_upcall()
nfsd4: a client's own opens needn't prevent delegations
nfsd: Use seq_putc() in two functions
svcrdma: Display chunk completion ID when posting a rw_ctxt
svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
svcrdma: Introduce Send completion IDs
svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
svcrdma: Introduce Receive completion IDs
svcrdma: Introduce infrastructure to support completion IDs
svcrdma: Add common XDR encoders for RDMA and Read segments
svcrdma: Add common XDR decoders for RDMA and Read segments
SUNRPC: Add helpers for decoding list discriminators symbolically
svcrdma: Remove declarations for functions long removed
svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
...
* Spelling
* http to https updates
NAND core changes:
* Drop useless 'depends on' in Kconfig
* Add an extra level in the Kconfig hierarchy
* Trivial spellings
* Dynamic allocation of the interface configurations
* Dropping the default ONFI timing mode
* Various cleanup (types, structures, naming, comments)
* Hide the chip->data_interface indirection
* Add the generic rb-gpios property
* Add the ->choose_interface_config() hook
* Introduce nand_choose_best_sdr_timings()
* Use default values for tPROG_max and tBERS_max
* Avoid redefining tR_max and tCCS_min
* Add a helper to find the closest ONFI mode
* bcm63xx MTD parsers: simplify CFE detection
Raw NAND controller drivers changes:
* fsl-upm: Deprecation of specific DT properties
* fsl_upm: Driver rework and cleanup in favor of ->exec_op()
* Ingenic: Cleanup ARRAY_SIZE() vs sizeof() use
* brcmnand: ECC error handling on EDU transfers
* brcmnand: Don't default to EDU transfers
* qcom: Set BAM mode only if not set already
* qcom: Avoid write to unavailable register
* gpio: Driver rework in favor of ->exec_op()
* tango: ->exec_op() conversion
* mtk: ->exec_op() conversion
Raw NAND chip drivers changes:
* toshiba: Implement ->choose_interface_config() for TH58NVG2S3HBAI4
* toshiba: Implement ->choose_interface_config() for TC58NVG0S3E
* toshiba: Implement ->choose_interface_config() for TC58TEG5DCLTA00
* hynix: Implement ->choose_interface_config() for H27UCG8T2ATR-BC
SPI NOR core changes:
* Disable Quad Mode in spi_nor_restore().
* Don't abort BFPT parsing when QER reserved value is used.
* Add support/update capabilities for few flashes.
* Drop s70fl01gs flash: it does not support RDSR(05h) which
is critical for erase/write.
* Merge the SPIMEM DTR bits in spi-nor/next to avoid conflicts
during the release cycle.
SPI NOR controller drivers changes:
* Move the cadence-quadspi driver to spi-mem. The series was
taken through the SPI tree. Merge it also in spi-nor/next
to avoid conflicts during the release cycle.
* intel-spi:
- Add new PCI IDs.
- Ignore the Write Disable command, the controller doesn't
support it.
- Fix performance regression.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEE9HuaYnbmDhq/XIDIJWrqGEe9VoQFAl8vJtMACgkQJWrqGEe9
VoRdGAf/Y5m5BwmLilkEYpffyxi7dVR6XOKPLU5EJXkS3dPvH9398zchbHOdedCZ
OzJIfh6Iv+qbkgS2g0lAAT+SAfOfG9plubvSdkjrHXl4eZXRnR/49RF5LAEju7sz
Uw1HdRcawyEi5uI9yYS0tCeVMIUJq+5x7VibH+82yOIdSPc60c7FDc5ih/nVKj/a
Pn9LOzGzkdndcE1b3FcF2Uk/T1YOJx3Yt5ngALlPpJxaDZmQSHtYPuuz8DfUbamf
uj3CkpqYRyT18CzuFvtuba6LyF+donXNJgvl6ivW7dlRSPzSMnDQu7J5bpNhUfcd
p/ZdzX1Jxle4theDm0J9ALsSSM5g2w==
=RiY8
-----END PGP SIGNATURE-----
Merge tag 'mtd/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd updates from Miquel Raynal:
"MTD core changes:
- Spelling
- http to https updates
NAND core changes:
- Drop useless 'depends on' in Kconfig
- Add an extra level in the Kconfig hierarchy
- Trivial spellings
- Dynamic allocation of the interface configurations
- Dropping the default ONFI timing mode
- Various cleanup (types, structures, naming, comments)
- Hide the chip->data_interface indirection
- Add the generic rb-gpios property
- Add the ->choose_interface_config() hook
- Introduce nand_choose_best_sdr_timings()
- Use default values for tPROG_max and tBERS_max
- Avoid redefining tR_max and tCCS_min
- Add a helper to find the closest ONFI mode
- bcm63xx MTD parsers: simplify CFE detection
Raw NAND controller drivers changes:
- fsl-upm: Deprecation of specific DT properties
- fsl_upm: Driver rework and cleanup in favor of ->exec_op()
- Ingenic: Cleanup ARRAY_SIZE() vs sizeof() use
- brcmnand: ECC error handling on EDU transfers
- brcmnand: Don't default to EDU transfers
- qcom: Set BAM mode only if not set already
- qcom: Avoid write to unavailable register
- gpio: Driver rework in favor of ->exec_op()
- tango: ->exec_op() conversion
- mtk: ->exec_op() conversion
Raw NAND chip drivers changes:
- toshiba: Implement ->choose_interface_config() for TH58NVG2S3HBAI4,
TC58NVG0S3E, and TC58TEG5DCLTA00
- hynix: Implement ->choose_interface_config() for H27UCG8T2ATR-BC
SPI NOR core changes:
- Disable Quad Mode in spi_nor_restore().
- Don't abort BFPT parsing when QER reserved value is used.
- Add support/update capabilities for few flashes.
- Drop s70fl01gs flash: it does not support RDSR(05h) which is
critical for erase/write.
- Merge the SPIMEM DTR bits in spi-nor/next to avoid conflicts during
the release cycle.
SPI NOR controller drivers changes:
- Move the cadence-quadspi driver to spi-mem. The series was taken
through the SPI tree. Merge it also in spi-nor/next to avoid
conflicts during the release cycle.
- intel-spi:
- Add new PCI IDs.
- Ignore the Write Disable command, the controller doesn't support
it.
- Fix performance regression"
* tag 'mtd/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (79 commits)
MTD: pfow.h: drop a duplicated word
MTD: mtd-abi.h: drop a duplicated word
mtd: rawnand: omap_elm: Replace HTTP links with HTTPS ones
mtd: Replace HTTP links with HTTPS ones
mtd: hyperbus: Replace HTTP links with HTTPS ones
mtd: revert "spi-nor: intel: provide a range for poll_timout"
mtd: spi-nor: update read capabilities for w25q64 and s25fl064k
mtd: spi-nor: micron: Add SPI_NOR_DUAL_READ flag on mt25qu02g
mtd: spi-nor: macronix: Add support for mx66u2g45g
mtd: spi-nor: intel-spi: Simulate WRDI command
mtd: spi-nor: Disable the flash quad mode in spi_nor_restore()
mtd: spi-nor: Add capability to disable flash quad mode
mtd: spi-nor: spansion: Remove s70fl01gs from flash_info
mtd: spi-nor: sfdp: do not make invalid quad enable fatal
dt-bindings: mtd: fsl-upm-nand: Deprecate chip-delay and fsl, upm-wait-flags
mtd: rawnand: stm32_fmc2: get resources from parent node
mtd: rawnand: stm32_fmc2: use regmap APIs
memory: stm32-fmc2-ebi: add STM32 FMC2 EBI controller driver
dt-bindings: memory-controller: add STM32 FMC2 EBI controller documentation
dt-bindings: mtd: update STM32 FMC2 NAND controller documentation
...
Pull misc vfs updates from Al Viro:
"No common topic whatsoever in those, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: define inode flags using bit numbers
iov_iter: Move unnecessary inclusion of crypto/hash.h
dlmfs: clean up dlmfs_file_{read,write}() a bit
- The biggest news in that the tracing ring buffer can now time events that
interrupted other ring buffer events. Before this change, if an interrupt
came in while recording another event, and that interrupt also had an
event, those events would all have the same time stamp as the event it
interrupted. Now, with the new design, those events will have a unique time
stamp and rightfully display the time for those events that were recorded
while interrupting another event.
- Bootconfig how has an "override" operator that lets the users have a
default config, but then add options to override the default.
- A fix was made to properly filter function graph tracing to the ftrace
PIDs. This came in at the end of the -rc cycle, and needs to be backported.
- Several clean ups, performance updates, and minor fixes as well.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXy3GOBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qphsAP9ci1jtrC2+cMBMCNKb/AFpA/nDaKsD
hpsDzvD0YPOmCAEA9QbZset8wUNG49R4FexP7egQ8Ad2S6Oa5f60jWleDQY=
=lH+q
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- The biggest news in that the tracing ring buffer can now time events
that interrupted other ring buffer events.
Before this change, if an interrupt came in while recording another
event, and that interrupt also had an event, those events would all
have the same time stamp as the event it interrupted.
Now, with the new design, those events will have a unique time stamp
and rightfully display the time for those events that were recorded
while interrupting another event.
- Bootconfig how has an "override" operator that lets the users have a
default config, but then add options to override the default.
- A fix was made to properly filter function graph tracing to the
ftrace PIDs. This came in at the end of the -rc cycle, and needs to
be backported.
- Several clean ups, performance updates, and minor fixes as well.
* tag 'trace-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (39 commits)
tracing: Add trace_array_init_printk() to initialize instance trace_printk() buffers
kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
tracing: Use trace_sched_process_free() instead of exit() for pid tracing
bootconfig: Fix to find the initargs correctly
Documentation: bootconfig: Add bootconfig override operator
tools/bootconfig: Add testcases for value override operator
lib/bootconfig: Add override operator support
kprobes: Remove show_registers() function prototype
tracing/uprobe: Remove dead code in trace_uprobe_register()
kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler
ftrace: Fix ftrace_trace_task return value
tracepoint: Use __used attribute definitions from compiler_attributes.h
tracepoint: Mark __tracepoint_string's __used
trace : Have tracing buffer info use kvzalloc instead of kzalloc
tracing: Remove outdated comment in stack handling
ftrace: Do not let direct or IPMODIFY ftrace_ops be added to module and set trampolines
ftrace: Setup correct FTRACE_FL_REGS flags for module
tracing/hwlat: Honor the tracing_cpumask
tracing/hwlat: Drop the duplicate assignment in start_kthread()
tracing: Save one trace_event->type by using __TRACE_LAST_TYPE
...
As trace_array_printk() used with not global instances will not add noise to
the main buffer, they are OK to have in the kernel (unlike trace_printk()).
This require the subsystem to create their own tracing instance, and the
trace_array_printk() only writes into those instances.
Add trace_array_init_printk() to initialize the trace_printk() buffers
without printing out the WARNING message.
Reported-by: Sean Paul <sean@poorly.run>
Reviewed-by: Sean Paul <sean@poorly.run>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
framework we just have some minor tweaks and a debugfs feature, so not much to
see there. The driver updates are fairly well split between AT91 and Qualcomm
clk support. Adding those two drivers together equals about 50% of the
diffstat. Otherwise, the big amount of work this time was on supporting
Broadcom's Raspberry Pi firmware clks. See below for some more highlights.
Core:
- Document clk_hw_round_rate() so it gets some more use
- Remove unused __clk_get_flags()
- Add a prepare/enable debugfs feature similar to rate setting
New Drivers:
- Add support for SAMA7G5 SoC clks
- Enable CPU clks on Qualcomm IPQ6018 SoCs
- Enable CPU clks on Qualcomm MSM8996 SoCs
- GPU clk support for Qualcomm SM8150 and SM8250 SoCs
- Audio clks on Qualcomm SC7180 SoCs
- Microchip Sparx5 DPLL clk
- Add support for the new Renesas RZ/G2H (R8A774E1) SoC
Updates:
- Make defines for bcm63xx-gate clks to use in DT
- Support BCM2711 SoC firmware clks
- Add HDMI clks for BCM2711 SoCs
- Add RTC related clks on Ingenic SoCs
- Support USB PHY clks on Ingenic SoCs
- Support gate clks on BCM6318 SoCs
- RMU and DMAC/GPIO clock support for Actions Semi S500 SoCs
- Use poll_timeout functions in Rockchip clk driver
- Support Rockchip rk3288w SoC variant
- Mark mac_lbtest critical on Rockchip rk3188
- Add CAAM clock support for i.MX vf610 driver
- Add MU root clock support for i.MX imx8mp driver
- Amlogic g12: add neural network accelerator clock sources
- Amlogic meson8: remove critical flag for main PLL divider
- Amlogic meson8: add video decoder clock gates
- Convert one more Renesas DT binding to json-schema
- Enhance critical clock handling on Renesas platforms to only consider
clocks that were enabled at boot time
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAl8tspERHHNib3lkQGtl
cm5lbC5vcmcACgkQrQKIl8bklSX0og//TXp134IBXVZ2Z5Wca4J41itv7tPGJFsX
EslQ/eMm6xhGEqqrA6BVsQ228JF9rUmg1fbvdl82UPK7Zyp9P2fo6CMC65ngDTky
B1rLZOaKitER9JdL9DmmnIR772pI//rAMdIeVC/Jn5RR7OpP4lgY6+qvK0pJjVFl
8aK3uHY+UWVFtqZxuJEAAyZBq1+bREqmVwNC1my6kmMIf0j0KwcGhrZgASWWtjQK
4TRmroehLC4FBqnJ78Y3E6UAOBlz6C7XnP38qJge2672Ef7QhXZU7AGrGiTtxc4h
5fB6MNMF+5QGel54qR1eH+JxaEoKsAjLaX1VBr7hAHrGIQ26dBFHFPsPvWDA+VVR
4bwXJKz2objAWyqlMVM/cVV3q6uDixuScdrw2aAiojmV7ZvdZXflacdmZuS5v7e4
sh86llN+lF0YrViYz33z+up0risCAi089xVo7Z99VgyLe2DR8TE+4/d6DQTFRNxl
m66m6mlB9pPPgIV028SAf/zmzyoVWEarwdEAQPjJC6KVRbnR9mFlhSiTQMmCsSvU
zHRxVcInc+8qIJY6VJ552UFu/JAan/AFl5knb+kW21a7hM8p81H9HuSnS4aHDDq4
yETPU3XSMYz8ARHX4lKo1UIB6+k9iF2CWdCP+gXHNP+QYSAsRBIJsonjul9lLUCi
8i6mUGkS0OU=
=QLRg
-----END PGP SIGNATURE-----
Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk updates from Stephen Boyd:
"It looks like a smaller batch of clk updates this time around.
In the core framework we just have some minor tweaks and a debugfs
feature, so not much to see there. The driver updates are fairly well
split between AT91 and Qualcomm clk support. Adding those two drivers
together equals about 50% of the diffstat.
Otherwise, the big amount of work this time was on supporting
Broadcom's Raspberry Pi firmware clks.
Highlights:
Core:
- Document clk_hw_round_rate() so it gets some more use
- Remove unused __clk_get_flags()
- Add a prepare/enable debugfs feature similar to rate setting
New Drivers:
- Add support for SAMA7G5 SoC clks
- Enable CPU clks on Qualcomm IPQ6018 SoCs
- Enable CPU clks on Qualcomm MSM8996 SoCs
- GPU clk support for Qualcomm SM8150 and SM8250 SoCs
- Audio clks on Qualcomm SC7180 SoCs
- Microchip Sparx5 DPLL clk
- Add support for the new Renesas RZ/G2H (R8A774E1) SoC
Updates:
- Make defines for bcm63xx-gate clks to use in DT
- Support BCM2711 SoC firmware clks
- Add HDMI clks for BCM2711 SoCs
- Add RTC related clks on Ingenic SoCs
- Support USB PHY clks on Ingenic SoCs
- Support gate clks on BCM6318 SoCs
- RMU and DMAC/GPIO clock support for Actions Semi S500 SoCs
- Use poll_timeout functions in Rockchip clk driver
- Support Rockchip rk3288w SoC variant
- Mark mac_lbtest critical on Rockchip rk3188
- Add CAAM clock support for i.MX vf610 driver
- Add MU root clock support for i.MX imx8mp driver
- Amlogic g12: add neural network accelerator clock sources
- Amlogic meson8: remove critical flag for main PLL divider
- Amlogic meson8: add video decoder clock gates
- Convert one more Renesas DT binding to json-schema
- Enhance critical clock handling on Renesas platforms to only
consider clocks that were enabled at boot time"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (79 commits)
clk: qcom: gcc: Make disp gpll0 branch aon for sc7180/sdm845
ipq806x: gcc: add support for child probe
clk: qcom: msm8996: Make symbol 'cpu_msm8996_clks' static
clk: qcom: ipq8074: Add correct index for PCIe clocks
clk: <linux/clk-provider.h>: drop a duplicated word
clk: renesas: cpg-mssr: Add r8a774e1 support
dt-bindings: clock: renesas,cpg-mssr: Document r8a774e1
clk: Drop duplicate selection in Kconfig
clk: qcom: smd: Add support for MSM8992/4 rpm clocks
clk: qcom: ipq8074: Add missing clocks for pcie
dt-bindings: clock: qcom: ipq8074: Add missing bindings for PCIe
Replace HTTP links with HTTPS ones: Common CLK framework
clk: qcom: Add CPU clock driver for msm8996
dt-bindings: clk: qcom: Add bindings for CPU clock for msm8996
soc: qcom: Separate kryo l2 accessors from PMU driver
clk: meson: meson8b: add the vclk2_en gate clock
clk: meson: meson8b: add the vclk_en gate clock
clk: qcom: Fix return value check in apss_ipq6018_probe()
clk: bcm: dvp: Add missing module informations
clk: meson: meson8b: Drop CLK_IS_CRITICAL from fclk_div2
...
Pull fdpick coredump update from Al Viro:
"Switches fdpic coredumps away from original aout dumping primitives to
the same kind of regset use as regular elf coredumps do"
* 'work.fdpic' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
[elf-fdpic] switch coredump to regsets
[elf-fdpic] use elf_dump_thread_status() for the dumper thread as well
[elf-fdpic] move allocation of elf_thread_status into elf_dump_thread_status()
[elf-fdpic] coredump: don't bother with cyclic list for per-thread objects
kill elf_fpxregs_t
take fdpic-related parts of elf_prstatus out
unexport linux/elfcore.h
- Add adaptive voltage scaling (AVS) support to the brcmstb cpufreq
driver and clean it up (Florian Fainelli, Markus Mayer).
- Add a new Tegra cpufreq driver and clean up the existing one (Jon
Hunter, Sumit Gupta).
- Add bandwidth level support to the Qcom cpufreq driver along with
OPP changes (Sibi Sankar).
- Clean up the sti, cpufreq-dt, ap806, CPPC cpufreq drivers (Viresh
Kumar, Lee Jones, Ivan Kokshaysky, Sven Auhagen, Xin Hao).
- Make schedutil the default governor for ARM (Valentin Schneider).
- Fix dependency issues for the imx cpufreq driver (Walter Lozano).
- Clean up cached_resolved_idx handlihng in the cpufreq core (Viresh
Kumar).
- Fix the intel_pstate driver to use the correct maximum frequency
value when MSR_TURBO_RATIO_LIMIT is 0 (Srinivas Pandruvada).
- Provide kenrneldoc comments for multiple runtime PM helpers and
improve the pm_runtime_get_if_active() kerneldoc (Rafael Wysocki).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl8tlk4SHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxyl8QAIHknuudPrtt1yiZM2dpKCwi1fpdZjWL
GkGNlS4I1AkzMaVnGdsaiJb8ek1aukcl3w3vkj3tMCfPuBLT3P5f5mNagtwPgwdG
ZpNoU+OJUK1zBeVaYH8OL0Vb8dQQjOqk3RUx8MmnHkIRo4EpixxFoEOuo+6eKGZZ
G6KG5u3r+ZrY3nWmoBEVJ8ytM/ovDS0uv/j+uPR5qB1GCGuQJuW4ngA/0CgIBClS
Rk+/r7enmhGPylFp74UJD6S1rVhypLzEAX7JfqfQB3T0918ZTkYFuaQpb7JJqwj7
5rbyZX0xWjVMoypW7JaWDctcywdQ9aslWLHo0rmEdZCKKDDT5bPduXhb+4HAKqFg
j62eCbQzkz7swk1jPDMPuDLFVweAqKEoU2OOram9rGrzevOPvm2t1awNSiLDhmxx
TQL6COs1rbwOPuBT23NUa5jTc7us8xYQh13bI4zKrf1pGxju1s+QwOe7HbnQixuA
TngK62e3fl5qxSaq4yTKITX2y2/SxAIjgqy5FXTC7aU0rob3YVtSjMISZxJmKKN8
vbkkF+ZeGJn7TJycwurq7HgDCbggUopM8upaPj4+BVLZamiHcNEUP4V2/Q2jVQ1s
9/wbMPetycGnM0fd0drtcV0TxVz/cGWAkKfE12lWM8rxUEMqeblVTiOXYaYM2w8V
Dlxm6D4as+gE
=70pc
-----END PGP SIGNATURE-----
Merge tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"These are mostly ARM cpufreq driver updates plus a cpufreq core
cleanup, an ARM-wide change to make schedutil the default scaling
governor, an intel_pstate driver fix and some runtime PM changes
regarding kerneldoc comments.
Specifics:
- Add adaptive voltage scaling (AVS) support to the brcmstb cpufreq
driver and clean it up (Florian Fainelli, Markus Mayer).
- Add a new Tegra cpufreq driver and clean up the existing one (Jon
Hunter, Sumit Gupta).
- Add bandwidth level support to the Qcom cpufreq driver along with
OPP changes (Sibi Sankar).
- Clean up the sti, cpufreq-dt, ap806, CPPC cpufreq drivers (Viresh
Kumar, Lee Jones, Ivan Kokshaysky, Sven Auhagen, Xin Hao).
- Make schedutil the default governor for ARM (Valentin Schneider).
- Fix dependency issues for the imx cpufreq driver (Walter Lozano).
- Clean up cached_resolved_idx handlihng in the cpufreq core (Viresh
Kumar).
- Fix the intel_pstate driver to use the correct maximum frequency
value when MSR_TURBO_RATIO_LIMIT is 0 (Srinivas Pandruvada).
- Provide kenrneldoc comments for multiple runtime PM helpers and
improve the pm_runtime_get_if_active() kerneldoc (Rafael Wysocki)"
* tag 'pm-5.9-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (22 commits)
cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0
PM: runtime: Improve kerneldoc of pm_runtime_get_if_active()
PM: runtime: Add kerneldoc comments to multiple helpers
cpufreq: make schedutil the default for arm and arm64
cpufreq: cached_resolved_idx can not be negative
cpufreq: Add Tegra194 cpufreq driver
dt-bindings: arm: Add NVIDIA Tegra194 CPU Complex binding
cpufreq: imx: Select NVMEM_IMX_OCOTP
cpufreq: sti-cpufreq: Fix some formatting and misspelling issues
cpufreq: tegra186: Simplify probe return path
cpufreq: CPPC: Reuse caps variable in few routines
cpufreq: ap806: fix cpufreq driver needs ap cpu clk
cpufreq: cppc: Reorder code and remove apply_hisi_workaround variable
cpufreq: dt: fix oops on armada37xx
cpufreq: brcmstb-avs-cpufreq: send S2_ENTER / S2_EXIT commands to AVS
cpufreq: brcmstb-avs-cpufreq: Support polling AVS firmware
cpufreq: brcmstb-avs-cpufreq: more flexible interface for __issue_avs_command()
cpufreq: qcom: Disable fast switch when scaling DDR/L3
cpufreq: qcom: Update the bandwidth levels on frequency change
OPP: Add and export helper to set bandwidth
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAl8tEbwACgkQCF8+vY7k
4RUvgw//bIP9Jisg0wfUtjm34cRIKZ13PfMRYlaMKcz4Q2YVIcOCJN+xJ2kUo5M9
D78q91g0u90OjsYIDJe/P8oKxluwr0RgXkHKsgywA+OiTIvJIEFxuvn7uiNMHFCJ
BgU7inSZ39odgtrSbvqNAzOQgEqjx38I1NZathkRO1fr775Q5ZOhLn0fH1JroMsC
mgfVB76p9R/UjYlYZLHumQPyxkAfN02xXmuXAqiIGBVLi/57eyuhT4qYC8FAbse6
rYhiV3RHcZiiV0aec2HPP0BqRo5D3uIQW6Qv6FcSK8UDGpeBL8tX62Y47rqiPZED
RExDlAHnTZgUSgquXzc+OtOolfWPctFAg6uLVHm7VVnzDOkJLQUrKWj18hEqLu87
8H9BTo+DhDys1MUhJfzItTRsvcjke0SJUWsVhF8CWAI0PpnBH/inwLQhq2jcpsJb
x2VxjnXJ1iyLw6zD4tJpMWhQwbpHSVjQ2cGK7w0ppMqytWxK9KJCROCrJup0T3EY
lidp0gzymGZQxtpaUFksSAlsdjKUglrh3d1xCryZt5uepvMIFiPp+Vn6Ixmq+S8U
UKBhW15EhVbAQdJLw5O0vsxQ/4AVSIbGXUpV6BA9mnMw9ZbDL62qB6iumCykByZC
UyEvm/+MLLg1Y9JhyHmJzi6VI9gGZG9ayTpswhUlG4EGQDZ+U0w=
=SYqr
-----END PGP SIGNATURE-----
Merge tag 'media/v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
- Legacy soc_camera driver was removed from staging
- New I2C sensor related drivers: dw9768, ch7322, max9271, rdacm20
- TI vpe driver code was re-organized and had new features added
- Added Xilinx MIPI CSI-2 Rx Subsystem driver
- Added support for Infrared Toy and IR Droid devices
- Lots of random driver fixes, new features and cleanups
* tag 'media/v5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (318 commits)
media: camss: fix memory leaks on error handling paths in probe
media: davinci: vpif_capture: fix potential double free
media: radio: remove redundant assignment to variable retval
media: allegro: fix potential null dereference on header
media: mtk-mdp: Fix a refcounting bug on error in init
media: allegro: fix an error pointer vs NULL check
media: meye: fix missing pm_mchip_mode field
media: cafe-driver: use generic power management
media: saa7164: use generic power management
media: v4l2-dev/ioctl: Fix document for VIDIOC_QUERYCAP
media: v4l2: Correct kernel-doc inconsistency
media: v4l2: Correct kernel-doc inconsistency
media: dvbdev.h: keep * together with the type
media: v4l2-subdev.h: keep * together with the type
media: videobuf2: Print videobuf2 buffer state by name
media: colorspaces-details.rst: fix V4L2_COLORSPACE_JPEG description
media: tw68: use generic power management
media: meye: use generic power management
media: cx88: use generic power management
media: cx25821: use generic power management
...
Core:
- Support out of order dma completion
- Support for repeating transaction
New controllers:
- Support for Actions S700 DMA engine
- Renesas R8A774E1, r8a7742 controller binding
- New driver for Xilinx DPDMA controller
Others:
- Support of out of order dma completion in idxd driver
- W=1 warning cleanup of subsystem
- Updates to ti-k3-dma, dw, idxd drivers
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+vs47OPLdNbVcHzyfBQHDyUjg0cFAl8s6voACgkQfBQHDyUj
g0f7Aw/+NjqyWAMZ4WpP6p2AN+5Evs7MY0fhhJMkU7ShbQlBM1GKrrNpMhaOaMw2
KB7xWvsfMnpKhxcq5LL2ymMnzJgJHVi0Zp9aRwNQXmJfHyCTDoqv54ljd5ADaL/O
XLBLBWc6h5WbAsWmpiovb/EQ58RAU/bvlPD7gntK9Y8n5ha32c+jFnOg+Fd3uINl
x9uSHKUOWFVRvIJgOrFcFwl2eT0erFcme7WyCWuNfSFDZlJqOdfVf1TfTVcfyAYY
8r6VWPOyiAc97SPN1hVYMUqqTtRAEDlsPRfeyvUm2pnRJnbyJdHbvbA0l/OMvzH5
3q5SBXz6NgoZsO6GPiSEV679K0nsuZOCqfevNb6+UQUrO7f5JyEbwGTrWju6F3fg
UVTENto8XW7KCE+oTkJBgZ6utbDtK5dpoKghX59lN3nKogqzGi3JUlgTtlSIF+AY
CnmESWM37f1jw1Ew58gmSYRFfKQV2fLwcAePnaV4HaNV70uFoYnhPvVenSvgYeky
24D8O5fzzhRHsSqUPTLTZ/u4cGJtOiBzQWdWcUXig/mfHKpu9i4nejHmuA2x64l0
oFc3nKwd7XrGVg2l4XMx1T0x69+1dlc0eEkZ7lRGzZgDCMKeHEsLOBGaid+bMO09
4IMzxoQxINui6l8csX5ctbRdXfUFZKZaZU36RxQeysidLE6QDGk=
=OfZv
-----END PGP SIGNATURE-----
Merge tag 'dmaengine-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine updates from Vinod Koul:
"Core:
- Support out of order dma completion
- Support for repeating transaction
New controllers:
- Support for Actions S700 DMA engine
- Renesas R8A774E1, r8a7742 controller binding
- New driver for Xilinx DPDMA controller
Other:
- Support of out of order dma completion in idxd driver
- W=1 warning cleanup of subsystem
- Updates to ti-k3-dma, dw, idxd drivers"
* tag 'dmaengine-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (68 commits)
dmaengine: dw: Don't include unneeded header to platform data header
dmaengine: Actions: Add support for S700 DMA engine
dmaengine: Actions: get rid of bit fields from dma descriptor
dt-bindings: dmaengine: convert Actions Semi Owl SoCs bindings to yaml
dmaengine: idxd: add missing invalid flags field to completion
dmaengine: dw: Initialize max_sg_burst capability
dmaengine: dw: Introduce max burst length hw config
dmaengine: dw: Initialize min and max burst DMA device capability
dmaengine: dw: Set DMA device max segment size parameter
dmaengine: dw: Take HC_LLP flag into account for noLLP auto-config
dmaengine: Introduce DMA-device device_caps callback
dmaengine: Introduce max SG burst capability
dmaengine: Introduce min burst length capability
dt-bindings: dma: dw: Add max burst transaction length property
dt-bindings: dma: dw: Convert DW DMAC to DT binding
dmaengine: ti: k3-udma: Query throughput level information from hardware
dmaengine: ti: k3-udma: Use defines for capabilities register parsing
dmaengine: xilinx: dpdma: Fix kerneldoc warning
dmaengine: xilinx: dpdma: add missing kernel doc
dmaengine: xilinx: dpdma: remove comparison of unsigned expression
...
Merge misc updates from Andrew Morton:
- a few MM hotfixes
- kthread, tools, scripts, ntfs and ocfs2
- some of MM
Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
mm: vmscan: consistent update to pgrefill
mm/vmscan.c: fix typo
khugepaged: khugepaged_test_exit() check mmget_still_valid()
khugepaged: retract_page_tables() remember to test exit
khugepaged: collapse_pte_mapped_thp() protect the pmd lock
khugepaged: collapse_pte_mapped_thp() flush the right range
mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
mm: thp: replace HTTP links with HTTPS ones
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
mm/page_alloc.c: skip setting nodemask when we are in interrupt
mm/page_alloc: fallbacks at most has 3 elements
mm/page_alloc: silence a KASAN false positive
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
mm/page_alloc.c: simplify pageblock bitmap access
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
mm/shuffle: remove dynamic reconfiguration
mm/memory_hotplug: document why shuffle_zone() is relevant
mm/page_alloc: remove nr_free_pagecache_pages()
mm: remove vm_total_pages
...
Currently, memalloc_nocma_{save/restore} API that prevents CMA area
in page allocation is implemented by using current_gfp_context(). However,
there are two problems of this implementation.
First, this doesn't work for allocation fastpath. In the fastpath,
original gfp_mask is used since current_gfp_context() is introduced in
order to control reclaim and it is on slowpath. So, CMA area can be
allocated through the allocation fastpath even if
memalloc_nocma_{save/restore} APIs are used. Currently, there is just
one user for these APIs and it has a fallback method to prevent actual
problem.
Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
to exclude the memory on the ZONE_MOVABLE for allocation target.
To fix these problems, this patch changes the implementation to exclude
CMA area in page allocation. Main point of this change is using the
alloc_flags. alloc_flags is mainly used to control allocation so it fits
for excluding CMA area in allocation.
Fixes: d7fefcc8de (mm/cma: add PF flag to force non cma alloc)
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After previous cleanup, the end_bitidx is not necessary any more.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20200623124201.8199-4-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to commit e58469bafd ("mm: page_alloc: use word-based accesses for
get/set pageblock bitmaps"), pageblock bitmap is accessed with word-based
access. This operation could be simplified a little.
Intuitively, if we want to get a bit range [start_idx, end_idx] in a word,
we can do like this:
mask = (1 << (end_bitidx - start_bitidx + 1)) - 1;
ret = (word >> start_idx) & mask;
And also if we want to set a bit range [start_idx, end_idx] with flags, we
can do the same by just shift start_bitidx.
By doing so we reduce some instructions for these two helper functions:
Before Patched
set_pfnblock_flags_mask 209 198(-5%)
get_pfnblock_flags_mask 101 87(-13%)
Since the syntax is changed a little, we need to check the whole 4-bit
migrate_type instead of part of it.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20200623124201.8199-3-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We already have the definition of PB_migratetype_bits and current
NR_MIGRATETYPE_BITS looks like a cyclic definition.
Just use PB_migratetype_bits is enough.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/20200623124201.8199-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and
the name does not really help to understand what's going on. Let's
open-code it instead and add a comment.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Huang Ying <ying.huang@intel.com>
Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global variable "vm_total_pages" is a relic from older days. There is
only a single user that reads the variable - build_all_zonelists() - and
the first thing it does is update it.
Use a local variable in build_all_zonelists() instead and remove the
global variable.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When CONFIG_EFI is not enabled, we might get an undefined reference to
efi_enter_virtual_mode() error, if this efi_enabled() call isn't inlined
into start_kernel(). This happens in particular, if start_kernel() is
annodated with __no_sanitize_address.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Elena Petrova <lenaptr@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Link: http://lkml.kernel.org/r/6514652d3a32d3ed33d6eb5c91d0af63bf0d1a0c.1596544734.git.andreyknvl@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kasan_unpoison_stack_above_sp_to() is defined in kasan code but never
used. The function was introduced as part of the commit:
commit 9f7d416c36 ("kprobes: Unpoison stack in jprobe_return() for KASAN")
... where it was necessary because x86's jprobe_return() would leave
stale shadow on the stack, and was an oddity in that regard.
Since then, jprobes were removed entirely, and as of commit:
commit 80006dbee6 ("kprobes/x86: Remove jprobe implementation")
... there have been no callers of this function.
Remove the declaration and the implementation.
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20200706143505.23299-1-vincenzo.frascino@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "kasan: memorize and print call_rcu stack", v8.
This patchset improves KASAN reports by making them to have call_rcu()
call stack information. It is useful for programmers to solve
use-after-free or double-free memory issue.
The KASAN report was as follows(cleaned up slightly):
BUG: KASAN: use-after-free in kasan_rcu_reclaim+0x58/0x60
Freed by task 0:
kasan_save_stack+0x24/0x50
kasan_set_track+0x24/0x38
kasan_set_free_info+0x18/0x20
__kasan_slab_free+0x10c/0x170
kasan_slab_free+0x10/0x18
kfree+0x98/0x270
kasan_rcu_reclaim+0x1c/0x60
Last call_rcu():
kasan_save_stack+0x24/0x50
kasan_record_aux_stack+0xbc/0xd0
call_rcu+0x8c/0x580
kasan_rcu_uaf+0xf4/0xf8
Generic KASAN will record the last two call_rcu() call stacks and print up
to 2 call_rcu() call stacks in KASAN report. it is only suitable for
generic KASAN.
This feature considers the size of struct kasan_alloc_meta and
kasan_free_meta, we try to optimize the structure layout and size, lets it
get better memory consumption.
[1]https://bugzilla.kernel.org/show_bug.cgi?id=198437
[2]https://groups.google.com/forum/#!searchin/kasan-dev/better$20stack$20traces$20for$20rcu%7Csort:date/kasan-dev/KQsjT_88hDE/7rNUZprRBgAJ
This patch (of 4):
This feature will record the last two call_rcu() call stacks and prints up
to 2 call_rcu() call stacks in KASAN report.
When call_rcu() is called, we store the call_rcu() call stack into slub
alloc meta-data, so that the KASAN report can print rcu stack.
[1]https://bugzilla.kernel.org/show_bug.cgi?id=198437
[2]https://groups.google.com/forum/#!searchin/kasan-dev/better$20stack$20traces$20for$20rcu%7Csort:date/kasan-dev/KQsjT_88hDE/7rNUZprRBgAJ
[walter-zh.wu@mediatek.com: build fix]
Link: http://lkml.kernel.org/r/20200710162401.23816-1-walter-zh.wu@mediatek.com
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Link: http://lkml.kernel.org/r/20200710162123.23713-1-walter-zh.wu@mediatek.com
Link: http://lkml.kernel.org/r/20200601050847.1096-1-walter-zh.wu@mediatek.com
Link: http://lkml.kernel.org/r/20200601050927.1153-1-walter-zh.wu@mediatek.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().
Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.
Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.
Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/mremap: cleanup move_page_tables() a little", v5.
move_page_tables() tries to move page table by PMD or PTE.
The root reason is if it tries to move PMD, both old and new range should
be PMD aligned. But current code calculate old range and new range
separately. This leads to some redundant check and calculation.
This cleanup tries to consolidate the range check in one place to reduce
some extra range handling.
This patch (of 3):
old_end is passed to these two functions to check whether there is enough
space to do the move, while this check is done before invoking these
functions.
These two functions only would be invoked when extent meets the
requirement and there is one check before invoking these functions:
if (extent > old_end - old_addr)
extent = old_end - old_addr;
This implies (old_end - old_addr) won't fail the check in these two
functions.
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Link: http://lkml.kernel.org/r/20200710092835.56368-1-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200710092835.56368-2-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200708095028.41706-1-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200708095028.41706-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current split between do_mmap() and do_mmap_pgoff() was introduced in
commit 1fcfd8db7f ("mm, mpx: add "vm_flags_t vm_flags" arg to
do_mmap_pgoff()") to support MPX.
The wrapper function do_mmap_pgoff() always passed 0 as the value of the
vm_flags argument to do_mmap(). However, MPX support has subsequently
been removed from the kernel and there were no more direct callers of
do_mmap(); all calls were going via do_mmap_pgoff().
Simplify the code by removing do_mmap_pgoff() and changing all callers to
directly call do_mmap(), which now no longer takes a vm_flags argument.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20200727194109.1371462-1-pcc@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are many instances where vmemap allocation is often switched between
regular memory and device memory just based on whether altmap is available
or not. vmemmap_alloc_block_buf() is used in various platforms to
allocate vmemmap mappings. Lets also enable it to handle altmap based
device memory allocation along with existing regular memory allocations.
This will help in avoiding the altmap based allocation switch in many
places. To summarize there are two different methods to call
vmemmap_alloc_block_buf().
vmemmap_alloc_block_buf(size, node, NULL) /* Allocate from system RAM */
vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */
This converts altmap_alloc_block_buf() into a static function, drops it's
entry from the header and updates Documentation/vm/memory-model.rst.
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jia He <justin.he@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yu Zhao <yuzhao@google.com>
Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "arm64: Enable vmemmap mapping from device memory", v4.
This series enables vmemmap backing memory allocation from device memory
ranges on arm64. But before that, it enables vmemmap_populate_basepages()
and vmemmap_alloc_block_buf() to accommodate struct vmem_altmap based
alocation requests.
This patch (of 3):
vmemmap_populate_basepages() is used across platforms to allocate backing
memory for vmemmap mapping. This is used as a standard default choice or
as a fallback when intended huge pages allocation fails. This just
creates entire vmemmap mapping with base pages (PAGE_SIZE).
On arm64 platforms, vmemmap_populate_basepages() is called instead of the
platform specific vmemmap_populate() when ARM64_SWAPPER_USES_SECTION_MAPS
is not enabled as in case for ARM64_16K_PAGES and ARM64_64K_PAGES configs.
At present vmemmap_populate_basepages() does not support allocating from
driver defined struct vmem_altmap while trying to create vmemmap mapping
for a device memory range. It prevents ARM64_16K_PAGES and
ARM64_64K_PAGES configs on arm64 from supporting device memory with
vmemap_altmap request.
This enables vmem_altmap support in vmemmap_populate_basepages() unlocking
device memory allocation for vmemap mapping on arm64 platforms with 16K or
64K base page configs.
Each architecture should evaluate and decide on subscribing device memory
based base page allocation through vmemmap_populate_basepages(). Hence
lets keep it disabled on all archs in order to preserve the existing
semantics. A subsequent patch enables it on arm64.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Jia He <justin.he@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Link: http://lkml.kernel.org/r/1594004178-8861-1-git-send-email-anshuman.khandual@arm.com
Link: http://lkml.kernel.org/r/1594004178-8861-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':
94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.
So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
add a sysctl handler to adjust it when the policy is reconfigured.
Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test. And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.
And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:
: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch. E.g. large in memory databases which do large mmaps
: during startups from multiple threads.
[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: kernel test robot <rong.a.chen@intel.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu_counter's accuracy is related to its batch size. For a
percpu_counter with a big batch, its deviation could be big, so when the
counter's batch is runtime changed to a smaller value for better accuracy,
there could also be requirment to reduce the big deviation.
So add a percpu-counter sync function to be run on each CPU.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tim Chen <tim.c.chen@intel.com>
Link: http://lkml.kernel.org/r/1594389708-60781-4-git-send-email-feng.tang@intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The functions are only used in two source files, so there is no need for
them to be in the global <linux/mm.h> header. Move them to the new
<linux/pgalloc-track.h> header and include it only where needed.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200609120533.25867-1-joro@8bytes.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_protected currently is both used to set effective low and min
and return a mem_cgroup_protection based on the result. As a user, this
can be a little unexpected: it appears to be a simple predicate function,
if not for the big warning in the comment above about the order in which
it must be executed.
This change makes it so that we separate the state mutations from the
actual protection checks, which makes it more obvious where we need to be
careful mutating internal state, and where we are simply checking and
don't need to worry about that.
[mhocko@suse.com - don't check protection on root memcgs]
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.
This series contains a fix for a edge case in my earlier protection
calculation patches, and a patch to make the area overall a little more
robust to hopefully help avoid this in future.
This patch (of 2):
A cgroup can have both memory protection and a memory limit to isolate it
from its siblings in both directions - for example, to prevent it from
being shrunk below 2G under high pressure from outside, but also from
growing beyond 4G under low pressure.
Commit 9783aa9917 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in excess
of their protection settings don't get reclaimed equally but instead in
accordance to their unprotected portion.
During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and should
be applied as such. Reclaim should operate at full efficiency.
However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above its
protection. As a result, a query during limit reclaim may return stale
protection values that were calculated by a previous reclaim cycle in
which the cgroup did have siblings.
When this happens, reclaim is unnecessarily hesitant and potentially slow
to meet the desired limit. In theory this could lead to premature OOM
kills, although it's not obvious this has occurred in practice.
Workaround the problem by special casing reclaim roots in
mem_cgroup_protection. These memcgs are never participating in the
reclaim protection because the reclaim is internal.
We have to ignore effective protection values for reclaim roots because
mem_cgroup_protected might be called from racing reclaim contexts with
different roots. Calculation is relying on root -> leaf tree traversal
therefore top-down reclaim protection invariants should hold. The only
exception is the reclaim root which should have effective protection set
to 0 but that would be problematic for the following setup:
Let's have global and A's reclaim in parallel:
|
A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
|\
| C (low = 1G, usage = 2.5G)
B (low = 1G, usage = 0.5G)
for A reclaim we have
B.elow = B.low
C.elow = C.low
For the global reclaim
A.elow = A.low
B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
C.elow = min(C.usage, C.low)
With the effective values resetting we have A reclaim
A.elow = 0
B.elow = B.low
C.elow = C.low
and global reclaim could see the above and then
B.elow = C.elow = 0 because children_low_usage > A.elow
Which means that protected memcgs would get reclaimed.
In future we would like to make mem_cgroup_protected more robust against
racing reclaim contexts but that is likely more complex solution than this
simple workaround.
[hannes@cmpxchg.org - large part of the changelog]
[mhocko@suse.com - workaround explanation]
[chris@chrisdown.name - retitle]
Fixes: 9783aa9917 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Roman Gushchin <guro@fb.com>
Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently memcg_kmem_enabled() is optimized for the kernel memory
accounting being off. It was so for a long time, and arguably the reason
behind was that the kernel memory accounting was initially an opt-in
feature. However, now it's on by default on both cgroup v1 and cgroup v2,
and it's on for all cgroups. So let's switch over to
static_branch_likely() to reflect this fact.
Unlikely there is a significant performance difference, as the cost of a
memory allocation and its accounting significantly exceeds the cost of a
jump. However, the conversion makes the code look more logically.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200707173612.124425-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>