mm/gup: track FOLL_PIN pages

Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
implemented via overloading of page->_refcount: pins are added by adding
GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
indication of pinning, and it can have false positives (and that's OK).
Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
details.

As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().

Please also note the limitation, discussed in pin_user_pages.rst under the
"TODO: for 1GB and larger huge pages" section.  (That limitation will be
removed in a following patch.)

The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a new
function call:

   bool page_maybe_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

[jhubbard@nvidia.com: add kerneldoc]
  Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
[imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
  Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
[akpm@linux-foundation.org: fix put_compound_head defined but not used]
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
John Hubbard 2020-04-01 21:05:29 -07:00 committed by Linus Torvalds
parent 94202f126f
commit 3faa52c03f
5 changed files with 379 additions and 104 deletions

View file

@ -1001,6 +1001,8 @@ static inline void get_page(struct page *page)
page_ref_inc(page);
}
bool __must_check try_grab_page(struct page *page, unsigned int flags);
static inline __must_check bool try_get_page(struct page *page)
{
page = compound_head(page);
@ -1029,29 +1031,79 @@ static inline void put_page(struct page *page)
__put_page(page);
}
/**
* unpin_user_page() - release a gup-pinned page
* @page: pointer to page to be released
/*
* GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
* the page's refcount so that two separate items are tracked: the original page
* reference count, and also a new count of how many pin_user_pages() calls were
* made against the page. ("gup-pinned" is another term for the latter).
*
* Pages that were pinned via pin_user_pages*() must be released via either
* unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
* that eventually such pages can be separately tracked and uniquely handled. In
* particular, interactions with RDMA and filesystems need special handling.
* With this scheme, pin_user_pages() becomes special: such pages are marked as
* distinct from normal pages. As such, the unpin_user_page() call (and its
* variants) must be used in order to release gup-pinned pages.
*
* unpin_user_page() and put_page() are not interchangeable, despite this early
* implementation that makes them look the same. unpin_user_page() calls must
* be perfectly matched up with pin*() calls.
* Choice of value:
*
* By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
* counts with respect to pin_user_pages() and unpin_user_page() becomes
* simpler, due to the fact that adding an even power of two to the page
* refcount has the effect of using only the upper N bits, for the code that
* counts up using the bias value. This means that the lower bits are left for
* the exclusive use of the original code that increments and decrements by one
* (or at least, by much smaller values than the bias value).
*
* Of course, once the lower bits overflow into the upper bits (and this is
* OK, because subtraction recovers the original values), then visual inspection
* no longer suffices to directly view the separate counts. However, for normal
* applications that don't have huge page reference counts, this won't be an
* issue.
*
* Locking: the lockless algorithm described in page_cache_get_speculative()
* and page_cache_gup_pin_speculative() provides safe operation for
* get_user_pages and page_mkclean and other calls that race to set up page
* table entries.
*/
static inline void unpin_user_page(struct page *page)
{
put_page(page);
}
#define GUP_PIN_COUNTING_BIAS (1U << 10)
void unpin_user_page(struct page *page);
void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
bool make_dirty);
void unpin_user_pages(struct page **pages, unsigned long npages);
/**
* page_maybe_dma_pinned() - report if a page is pinned for DMA.
*
* This function checks if a page has been pinned via a call to
* pin_user_pages*().
*
* For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
* because it means "definitely not pinned for DMA", but true means "probably
* pinned for DMA, but possibly a false positive due to having at least
* GUP_PIN_COUNTING_BIAS worth of normal page references".
*
* False positives are OK, because: a) it's unlikely for a page to get that many
* refcounts, and b) all the callers of this routine are expected to be able to
* deal gracefully with a false positive.
*
* For more information, please see Documentation/vm/pin_user_pages.rst.
*
* @page: pointer to page to be queried.
* @Return: True, if it is likely that the page has been "dma-pinned".
* False, if the page is definitely not dma-pinned.
*/
static inline bool page_maybe_dma_pinned(struct page *page)
{
/*
* page_ref_count() is signed. If that refcount overflows, then
* page_ref_count() returns a negative value, and callers will avoid
* further incrementing the refcount.
*
* Here, for that overflow case, use the signed bit to count a little
* bit higher via unsigned math, and thus still get an accurate result.
*/
return ((unsigned int)page_ref_count(compound_head(page))) >=
GUP_PIN_COUNTING_BIAS;
}
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTION_IN_PAGE_FLAGS
#endif