mm/gup: track FOLL_PIN pages
Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.
As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().
Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section. (That limitation will be
removed in a following patch.)
The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".
Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:
bool page_maybe_dma_pinned(struct page *page);
What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].
This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
[1] Some slow progress on get_user_pages() (Apr 2, 2019):
https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
[jhubbard@nvidia.com: add kerneldoc]
Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
parent
94202f126f
commit
3faa52c03f
5 changed files with 379 additions and 104 deletions
|
|
@ -1001,6 +1001,8 @@ static inline void get_page(struct page *page)
|
|||
page_ref_inc(page);
|
||||
}
|
||||
|
||||
bool __must_check try_grab_page(struct page *page, unsigned int flags);
|
||||
|
||||
static inline __must_check bool try_get_page(struct page *page)
|
||||
{
|
||||
page = compound_head(page);
|
||||
|
|
@ -1029,29 +1031,79 @@ static inline void put_page(struct page *page)
|
|||
__put_page(page);
|
||||
}
|
||||
|
||||
/**
|
||||
* unpin_user_page() - release a gup-pinned page
|
||||
* @page: pointer to page to be released
|
||||
/*
|
||||
* GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
|
||||
* the page's refcount so that two separate items are tracked: the original page
|
||||
* reference count, and also a new count of how many pin_user_pages() calls were
|
||||
* made against the page. ("gup-pinned" is another term for the latter).
|
||||
*
|
||||
* Pages that were pinned via pin_user_pages*() must be released via either
|
||||
* unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
|
||||
* that eventually such pages can be separately tracked and uniquely handled. In
|
||||
* particular, interactions with RDMA and filesystems need special handling.
|
||||
* With this scheme, pin_user_pages() becomes special: such pages are marked as
|
||||
* distinct from normal pages. As such, the unpin_user_page() call (and its
|
||||
* variants) must be used in order to release gup-pinned pages.
|
||||
*
|
||||
* unpin_user_page() and put_page() are not interchangeable, despite this early
|
||||
* implementation that makes them look the same. unpin_user_page() calls must
|
||||
* be perfectly matched up with pin*() calls.
|
||||
* Choice of value:
|
||||
*
|
||||
* By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
|
||||
* counts with respect to pin_user_pages() and unpin_user_page() becomes
|
||||
* simpler, due to the fact that adding an even power of two to the page
|
||||
* refcount has the effect of using only the upper N bits, for the code that
|
||||
* counts up using the bias value. This means that the lower bits are left for
|
||||
* the exclusive use of the original code that increments and decrements by one
|
||||
* (or at least, by much smaller values than the bias value).
|
||||
*
|
||||
* Of course, once the lower bits overflow into the upper bits (and this is
|
||||
* OK, because subtraction recovers the original values), then visual inspection
|
||||
* no longer suffices to directly view the separate counts. However, for normal
|
||||
* applications that don't have huge page reference counts, this won't be an
|
||||
* issue.
|
||||
*
|
||||
* Locking: the lockless algorithm described in page_cache_get_speculative()
|
||||
* and page_cache_gup_pin_speculative() provides safe operation for
|
||||
* get_user_pages and page_mkclean and other calls that race to set up page
|
||||
* table entries.
|
||||
*/
|
||||
static inline void unpin_user_page(struct page *page)
|
||||
{
|
||||
put_page(page);
|
||||
}
|
||||
#define GUP_PIN_COUNTING_BIAS (1U << 10)
|
||||
|
||||
void unpin_user_page(struct page *page);
|
||||
void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
|
||||
bool make_dirty);
|
||||
|
||||
void unpin_user_pages(struct page **pages, unsigned long npages);
|
||||
|
||||
/**
|
||||
* page_maybe_dma_pinned() - report if a page is pinned for DMA.
|
||||
*
|
||||
* This function checks if a page has been pinned via a call to
|
||||
* pin_user_pages*().
|
||||
*
|
||||
* For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
|
||||
* because it means "definitely not pinned for DMA", but true means "probably
|
||||
* pinned for DMA, but possibly a false positive due to having at least
|
||||
* GUP_PIN_COUNTING_BIAS worth of normal page references".
|
||||
*
|
||||
* False positives are OK, because: a) it's unlikely for a page to get that many
|
||||
* refcounts, and b) all the callers of this routine are expected to be able to
|
||||
* deal gracefully with a false positive.
|
||||
*
|
||||
* For more information, please see Documentation/vm/pin_user_pages.rst.
|
||||
*
|
||||
* @page: pointer to page to be queried.
|
||||
* @Return: True, if it is likely that the page has been "dma-pinned".
|
||||
* False, if the page is definitely not dma-pinned.
|
||||
*/
|
||||
static inline bool page_maybe_dma_pinned(struct page *page)
|
||||
{
|
||||
/*
|
||||
* page_ref_count() is signed. If that refcount overflows, then
|
||||
* page_ref_count() returns a negative value, and callers will avoid
|
||||
* further incrementing the refcount.
|
||||
*
|
||||
* Here, for that overflow case, use the signed bit to count a little
|
||||
* bit higher via unsigned math, and thus still get an accurate result.
|
||||
*/
|
||||
return ((unsigned int)page_ref_count(compound_head(page))) >=
|
||||
GUP_PIN_COUNTING_BIAS;
|
||||
}
|
||||
|
||||
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
|
||||
#define SECTION_IN_PAGE_FLAGS
|
||||
#endif
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue