Commit graph

57824 commits

Author SHA1 Message Date
Stephen Smalley
5d72801538 lsm_audit: update my email address
Update my email address since epoch.ncsc.mil no longer exists.
MAINTAINERS and CREDITS are already correct.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2017-08-17 15:33:39 -04:00
Jan Kara
bc8230ee8e quota: Convert dqio_mutex to rwsem
Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
yet.

Signed-off-by: Jan Kara <jack@suse.cz>
2017-08-17 18:52:48 +02:00
Linus Torvalds
c8c03f1858 pty: fix the cached path of the pty slave file descriptor in the master
Christian Brauner reported that if you use the TIOCGPTPEER ioctl() to
get a slave pty file descriptor, the resulting file descriptor doesn't
look right in /proc/<pid>/fd/<fd>.  In particular, he wanted to use
readlink() on /proc/self/fd/<fd> to get the pathname of the slave pty
(basically implementing "ptsname{_r}()").

The reason for that was that we had generated the wrong 'struct path'
when we create the pty in ptmx_open().

In particular, the dentry was correct, but the vfsmount pointed to the
mount of the ptmx node. That _can_ be correct - in case you use
"/dev/pts/ptmx" to open the master - but usually is not.  The normal
case is to use /dev/ptmx, which then looks up the pts/ directory, and
then the vfsmount of the ptmx node is obviously the /dev directory, not
the /dev/pts/ directory.

We actually did have the right vfsmount available, but in the wrong
place (it gets looked up in 'devpts_acquire()' when we get a reference
to the pts filesystem), and so ptmx_open() used the wrong mnt pointer.

The end result of this confusion was that the pty worked fine, but when
if you did TIOCGPTPEER to get the slave side of the pty, end end result
would also work, but have that dodgy 'struct path'.

And then when doing "d_path()" on to get the pathname, the vfsmount
would not match the root of the pts directory, and d_path() would return
an empty pathname thinking that the entry had escaped a bind mount into
another mount.

This fixes the problem by making devpts_acquire() return the vfsmount
for the pts filesystem, allowing ptmx_open() to trivially just use the
right mount for the pts dentry, and create the proper 'struct path'.

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17 09:10:48 -07:00
Paul E. McKenney
656e7c0c0a Merge branches 'doc.2017.08.17a', 'fixes.2017.08.17a', 'hotplug.2017.07.25b', 'misc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD
doc.2017.08.17a: Documentation updates.
fixes.2017.08.17a: RCU fixes.
hotplug.2017.07.25b: CPU-hotplug updates.
misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts).
spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait().
srcu.2017.07.27c: SRCU updates.
torture.2017.07.24c: Torture-test updates.
2017-08-17 08:10:04 -07:00
Paul E. McKenney
d3a024abbc locking: Remove spin_unlock_wait() generic definitions
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes spin_unlock_wait() and related
definitions from core code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-17 08:08:58 -07:00
Joerg Roedel
fa94cf84af Merge branch 'core' into arm/tegra 2017-08-17 16:31:27 +02:00
Luis R. Rodriguez
352eee1242 swait: Add idle variants which don't contribute to load average
There are cases where folks are using an interruptible swait when
using kthreads. This is rather confusing given you'd expect
interruptible waits to be -- interruptible, but kthreads are not
interruptible ! The reason for such practice though is to avoid
having these kthreads contribute to the system load average.

When systems are idle some kthreads may spend a lot of time blocking if
using swait_event_timeout(). This would contribute to the system load
average. On systems without preemption this would mean the load average
of an idle system is bumped to 2 instead of 0. On systems with PREEMPT=y
this would mean the load average of an idle system is bumped to 3
instead of 0.

This adds proper API using TASK_IDLE to make such goals explicit and
avoid confusion.

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17 07:26:07 -07:00
Paul E. McKenney
ccdd29ffff rcu: Create reasonable API for do_exit() TASKS_RCU processing
Currently, the exit-time support for TASKS_RCU is open-coded in do_exit().
This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish()
APIs for do_exit() use.  This has the benefit of confining the use of the
tasks_rcu_exit_srcu variable to one file, allowing it to become static.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-08-17 07:26:05 -07:00
Paul E. McKenney
7e42776d5e rcu: Drive TASKS_RCU directly off of PREEMPT
The actual use of TASKS_RCU is only when PREEMPT, otherwise RCU-sched
is used instead.  This commit therefore makes synchronize_rcu_tasks()
and call_rcu_tasks() available always, but mapped to synchronize_sched()
and call_rcu_sched(), respectively, when !PREEMPT.  This approach also
allows some #ifdefs to be removed from rcutorture.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 07:26:04 -07:00
Boqun Feng
52fa5bc5cb locking/lockdep: Explicitly initialize wq_barrier::done::map
With the new lockdep crossrelease feature, which checks completions usage,
a false positive is reported in the workqueue code:

> Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
> Task   B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
> Task   C : acquired of lock#3 -> wait for completion of barr->done
> (Task C is in lru_add_drain_all_cpuslocked())
> Worker D : wait for wfc.work to be released -> will complete barr->done

Such a dead lock can not happen because Task C's barr->done and Worker D's
barr->done can not be the same instance.

The reason of this false positive is we initialize all wq_barrier::done
at insert_wq_barrier() via init_completion(), which makes them belong to
the same lock class, therefore, impossible circles are reported.

To fix this, explicitly initialize the lockdep map for wq_barrier::done
in insert_wq_barrier(), so that the lock class key of wq_barrier::done
is a subkey of the corresponding work_struct, as a result we won't build
a dependency between a wq_barrier with a unrelated work, and we can
differ wq barriers based on the related works, so the false positive
above is avoided.

Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
to make the code simple and away from unnecessary #ifdefs.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 12:12:33 +02:00
Byungchul Park
ea3f2c0fdf locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
'complete' is an adjective and LOCKDEP_COMPLETE sounds like 'lockdep is complete',
so pick a better name that uses a noun.

Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@lge.com
Link: http://lkml.kernel.org/r/1502960261-16206-3-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 11:38:55 +02:00
Baoquan He
02e43c2dcd efi: Introduce efi_early_memdesc_ptr to get pointer to memmap descriptor
The existing map iteration helper for_each_efi_memory_desc_in_map can
only be used after the kernel initializes the EFI subsystem to set up
struct efi_memory_map.

Before that we also need iterate map descriptors which are stored in several
intermediate structures, like struct efi_boot_memmap for arch independent
usage and struct efi_info for x86 arch only.

Introduce efi_early_memdesc_ptr() to get pointer to a map descriptor, and
replace several places where that primitive is open coded.

Signed-off-by: Baoquan He <bhe@redhat.com>
[ Various improvements to the text. ]
Acked-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: ard.biesheuvel@linaro.org
Cc: fanc.fnst@cn.fujitsu.com
Cc: izumi.taku@jp.fujitsu.com
Cc: keescook@chromium.org
Cc: linux-efi@vger.kernel.org
Cc: n-horiguchi@ah.jp.nec.com
Cc: thgarnie@google.com
Link: http://lkml.kernel.org/r/20170816134651.GF21273@x1
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 10:50:57 +02:00
Kees Cook
7a46ec0e2f locking/refcounts, x86/asm: Implement fast refcount overflow protection
This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.

This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.

Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:

  http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/

  http://cyseclabs.com/page?n=02012016

This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.

On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).

On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.

As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.

Example assembly comparison:

refcount_inc() before:

  .text:
  ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)

refcount_inc() after:

  .text:
  ffffffff81546149:       f0 ff 45 f4             lock incl -0xc(%rbp)
  ffffffff8154614d:       0f 88 80 d5 17 00       js     ffffffff816c36d3
  ...
  .text.unlikely:
  ffffffff816c36d3:       48 8d 4d f4             lea    -0xc(%rbp),%rcx
  ffffffff816c36d7:       0f ff                   (bad)

These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):

  2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
		    cycles		protections
  atomic_t           82249267387	none
  refcount_t-fast    82211446892	overflow, untested dec-to-zero
  refcount_t-full   144814735193	overflow, untested dec-to-zero, inc-from-zero

This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Hans Liljestrand <ishkamiel@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 10:40:26 +02:00
Tony Luck
ce0fa3e56a x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages
Speculative processor accesses may reference any memory that has a
valid page table entry.  While a speculative access won't generate
a machine check, it will log the error in a machine check bank. That
could cause escalation of a subsequent error since the overflow bit
will be then set in the machine check bank status register.

Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
address of the page we want to map out otherwise we may trigger the
very problem we are trying to avoid.  We use a non-canonical address
that passes through the usual Linux table walking code to get to the
same "pte".

Thanks to Dave Hansen for reviewing several iterations of this.

Also see:

  http://marc.info/?l=linux-mm&m=149860136413338&w=2

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 10:30:49 +02:00
Ingo Molnar
927d2c21f2 Merge branch 'linus' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17 09:41:41 +02:00
Eric Dumazet
81fbfe8ada ptr_ring: use kmalloc_array()
As found by syzkaller, malicious users can set whatever tx_queue_len
on a tun device and eventually crash the kernel.

Lets remove the ALIGN(XXX, SMP_CACHE_BYTES) thing since a small
ring buffer is not fast anyway.

Fixes: 2e0ab8ca83 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 16:28:47 -07:00
stephen hemminger
5dd0fb9b9f vmbus: remove unused vmbus_sendpacket_ctl
The only usage of vmbus_sendpacket_ctl was by vmbus_sendpacket.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 16:27:45 -07:00
stephen hemminger
5a668d8cdd vmbus: remove unused vmubs_sendpacket_pagebuffer_ctl
The function vmbus_sendpacket_pagebuffer_ctl was never used directly.
Just have vmbus_send_pagebuffer

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 16:27:45 -07:00
stephen hemminger
9a603b8e11 vmbus: remove unused vmbus_sendpacket_multipagebuffer
This function is not used anywhere in current code.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 16:27:45 -07:00
John Fastabend
6bdc9c4c31 bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSER
Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER

net/core/filter.c: In function ‘do_sk_redirect_map’:
net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration]
   sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
   ^
net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default]
   sk = __sock_map_lookup_elem(ri->map, ri->ifindex);

Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 15:34:13 -07:00
Dave Airlie
3154b13371 Merge tag 'drm-misc-next-2017-08-16' of git://anongit.freedesktop.org/git/drm-misc into drm-next
UAPI Changes:
- vc4: Allow userspace to dictate rendering order in submit_cl ioctl (Eric)

Cross-subsystem Changes:
- vboxvideo: One of Cihangir's patches applies to vboxvideo which is maintained
	     in staging

Core Changes:
- atomic_legacy_backoff is officially killed (Daniel)
- Extract drm_device.h (Daniel)
- Unregister drm device on unplug (Daniel)
- Rename deprecated drm_*_(un)?reference functions to drm_*_{get|put} (Cihangir)

Driver Changes:
- vc4: Error/destroy path cleanups, log level demotion, edid leak (Eric)
- various: Make various drm_*_funcs structs const (Bhumika)
- tinydrm: add support for LEGO MINDSTORMS EV3 LCD (David)
- various: Second half of .dumb_{map_offset|destroy} defaults set (Noralf)

Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Eric Anholt <eric@anholt.net>
Cc: Bhumika Goyal <bhumirks@gmail.com>
Cc: Cihangir Akturk <cakturk@gmail.com>
Cc: David Lechner <david@lechnology.com>
Cc: Noralf Trønnes <noralf@tronnes.org>

* tag 'drm-misc-next-2017-08-16' of git://anongit.freedesktop.org/git/drm-misc: (50 commits)
  drm/gem-cma-helper: Remove drm_gem_cma_dumb_map_offset()
  drm/virtio: Use the drm_driver.dumb_destroy default
  drm/bochs: Use the drm_driver.dumb_destroy default
  drm/mgag200: Use the drm_driver.dumb_destroy default
  drm/exynos: Use .dumb_map_offset and .dumb_destroy defaults
  drm/msm: Use the drm_driver.dumb_destroy default
  drm/ast: Use the drm_driver.dumb_destroy default
  drm/qxl: Use the drm_driver.dumb_destroy default
  drm/udl: Use the drm_driver.dumb_destroy default
  drm/cirrus: Use the drm_driver.dumb_destroy default
  drm/tegra: Use .dumb_map_offset and .dumb_destroy defaults
  drm/gma500: Use .dumb_map_offset and .dumb_destroy defaults
  drm/mxsfb: Use .dumb_map_offset and .dumb_destroy defaults
  drm/meson: Use .dumb_map_offset and .dumb_destroy defaults
  drm/kirin: Use .dumb_map_offset and .dumb_destroy defaults
  drm/vc4: Continue the switch to drm_*_put() helpers
  drm/vc4: Fix leak of HDMI EDID
  dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly v2
  dma-buf: add reservation_object_copy_fences (v2)
  drm/tinydrm: add support for LEGO MINDSTORMS EV3 LCD
  ...
2017-08-17 07:33:41 +10:00
Arnd Bergmann
a968bc52fe SoC updates for omaps for v4.14. Most of the chages are to add
support for new dra762 SoC. The other changes are are for legacy
 DMA code removal, and MMC quirk and iodelay config for dra7.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEEkgNvrZJU/QSQYIcQG9Q+yVyrpXMFAlmTIokRHHRvbnlAYXRv
 bWlkZS5jb20ACgkQG9Q+yVyrpXPLXw//bdosInQAgxNoRHOeXt8B8U4kmHrWVHIj
 DDuuY2Vh5/DBUbXFizpWi2O7LrzXOFMw3zWLzRnZsA56/sNlojwW70NV4J/KGHVa
 tmVQ12iEV+blLdt9rypoUgHT4XFtYZ8WEfo4XPjDB23DhPcRYgRikn0O0LqvUV2r
 nPMemfhGc/E1SIN/hGFUG7AqL3N+TPlFhmzZRLYsrbWECnSr17dpIgYbSD7Dviiq
 PuSd8jbj88ugPjtUjb6CXyV04o2uRcxFqHceJOghc0jQtARxJBxCzInMLZ4JPPH5
 ZBRNKfervmpc4b24Vmlf+23t2iMieOHpqrSvfYy1ErBzyWFgOF32w2kkqbBWO+oB
 TLd02DdX5ks3bnG6C5fVDk4ztSB6vUO8K+MatdDGnqbAc8f+RRUmYVDE35TTfyHK
 QcjG8fkC3sze3iO+Jlg6UPO8uGYXYN7wVxm6oJqnQ5R1gVSXTbt2LEDNuKy3usxR
 IkvkdjkVOei4pYrewgc7bkNLOQ+XMY1Mxy0G/XwoEG9SlCVcEi4N0Vn9LptgJRph
 Mm3kpb3E93U8qUCR0NnXCpMqrek4foSoOKGcOFFukvSWF67xvnoUbooIIzEtXh/R
 9K5ftL1XlgPzYcQEWEp3MT7q4FkgKSpDup+8eHOUH+ozhWi3n4umlm3FzqiT3vXM
 ADACrqr8CgM=
 =FBGl
 -----END PGP SIGNATURE-----

Merge tag 'omap-for-v4.14/soc-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into next/soc

Pull "soc changes for omaps for v4.14" from Tony Lindgren:

SoC updates for omaps for v4.14. Most of the chages are to add
support for new dra762 SoC. The other changes are are for legacy
DMA code removal, and MMC quirk and iodelay config for dra7.

* tag 'omap-for-v4.14/soc-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP: dra7: powerdomain data: Register SoC specific powerdomains
  ARM: dra762: Enable SMP for dra762
  ARM: dra7: hwmod: Register dra76x specific hwmod
  ARM: dra762: Add support for device identification
  ARM: OMAP2+: board-generic: add support for dra762 family
  ARM: OMAP2+: Select PINCTRL_TI_IODELAY for SOC_DRA7XX
  ARM: OMAP2+: Add pdata-quirks for MMC/SD on DRA74x EVM
  ARM: OMAP2+: Remove unused legacy code for DMA
2017-08-16 22:34:15 +02:00
Arnd Bergmann
45848f1d8c Small fixes and enhancements for the TEE subsystem
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJZhDYVAAoJELWwuEGXj+zTeVUQANy2asqPLy7PynoZx11JL7br
 V3zKHlG2Wagk+4QLPmoLI7JoQmpnjosZ+Gb9San7FbMZdkijInLhqrGkl50OpWx5
 w9EQQ5EVbitlcia6vXM7Y9ovjXNbwWcFvGg0WWJl4bBwLSn6DFXWp0fOUpDWSSCM
 9M9/Kw0XppGGtkcVDMBV/Kp9ty0sO9WyqZ8qlAIueF9DM5T8BcgHqiDxx0H/vBvj
 BsYVzQEjkLwdghanPvSZv/aKqdP1BvpAbp0xhI6XXFdlg4cXP+deLC5jjSuAfhbD
 CpbBFiyG45abpjmWJy7+xZ/CzC3gnjX+VMNjxflhTiqo2jqvlXiDFN5LXaOmhBkw
 VN1V1p0fyZQhjG4bRxYVGkPb+d1kUEZxj6y0NyEU842JvXb1emnqbB25isdB4tGu
 9v0wAqWqWYjerQOvDVuUmyM/bIDbctFNfE5D9KZ1TusF6bjGbzwmjap/+57BTCtV
 bbDr2yPPVgtpEchNKBKyfEWO2KEsvbhBsskFSKQftGBbta0mvljnUuglSUyPH2+Y
 mpWkLTr0LYbldAbjieoa0Br/smPSLX2NDqykWZuG9eNOGSE7yfLXSrnG6pt1NjQI
 yLNGC9IBJy+pMNOqaHCY5odO3gYl8my4guIKa+r6ywEQqPwN+wTU9LFqeniOFz4w
 3iox3mHwBu7X1B8AmfMU
 =rEq4
 -----END PGP SIGNATURE-----

Merge tag 'tee-drv-for-4.14' of http://git.linaro.org/people/jens.wiklander/linux-tee into next/drivers

Pull "Small fixes and enhancements for the TEE subsystem" from Jens Wiklander:

* tag 'tee-drv-for-4.14' of http://git.linaro.org/people/jens.wiklander/linux-tee:
  tee: optee: sync with new naming of interrupts
  tee: indicate privileged dev in gen_caps
  tee: optee: interruptible RPC sleep
  tee: optee: add const to tee_driver_ops and tee_desc structures
  tee: tee_shm: Constify dma_buf_ops structures.
  tee: add forward declaration for struct device
  tee: optee: fix uninitialized symbol 'parg'
2017-08-16 21:41:21 +02:00
Trond Myklebust
729749bb8d SUNRPC: Don't hold the transport lock across socket copy operations
Instead add a mechanism to ensure that the request doesn't disappear
from underneath us while copying from the socket. We do this by
preventing xprt_release() from freeing the XDR buffers until the
flag RPC_TASK_MSG_RECV has been cleared from the request.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2017-08-16 15:10:15 -04:00
John Fastabend
174a79ff95 bpf: sockmap with sk redirect support
Recently we added a new map type called dev map used to forward XDP
packets between ports (6093ec2dc3). This patches introduces a
similar notion for sockets.

A sockmap allows users to add participating sockets to a map. When
sockets are added to the map enough context is stored with the
map entry to use the entry with a new helper

  bpf_sk_redirect_map(map, key, flags)

This helper (analogous to bpf_redirect_map in XDP) is given the map
and an entry in the map. When called from a sockmap program, discussed
below, the skb will be sent on the socket using skb_send_sock().

With the above we need a bpf program to call the helper from that will
then implement the send logic. The initial site implemented in this
series is the recv_sock hook. For this to work we implemented a map
attach command to add attributes to a map. In sockmap we add two
programs a parse program and a verdict program. The parse program
uses strparser to build messages and pass them to the verdict program.
The parse programs use the normal strparser semantics. The verdict
program is of type SK_SKB.

The verdict program returns a verdict SK_DROP, or  SK_REDIRECT for
now. Additional actions may be added later. When SK_REDIRECT is
returned, expected when bpf program uses bpf_sk_redirect_map(), the
sockmap logic will consult per cpu variables set by the helper routine
and pull the sock entry out of the sock map. This pattern follows the
existing redirect logic in cls and xdp programs.

This gives the flow,

 recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock
                                                     \
                                                      -> kfree_skb

As an example use case a message based load balancer may use specific
logic in the verdict program to select the sock to send on.

Sample programs are provided in future patches that hopefully illustrate
the user interfaces. Also selftests are in follow-on patches.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 11:27:53 -07:00
John Fastabend
a6f6df69c4 bpf: export bpf_prog_inc_not_zero
bpf_prog_inc_not_zero will be used by upcoming sockmap patches this
patch simply exports it so we can pull it in.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 11:27:53 -07:00
John Fastabend
b005fd189c bpf: introduce new program type for skbs on sockets
A class of programs, run from strparser and soon from a new map type
called sock map, are used with skb as the context but on established
sockets. By creating a specific program type for these we can use
bpf helpers that expect full sockets and get the verifier to ensure
these helpers are not used out of context.

The new type is BPF_PROG_TYPE_SK_SKB. This patch introduces the
infrastructure and type.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-16 11:27:53 -07:00
Paul Burton
0d58e6c1b1 PCI: Add pci_irqd_intx_xlate()
Legacy PCI INTx interrupts are represented in the PCI_INTERRUPT_PIN
register using the range 1-4, which matches our enum pci_interrupt_pin.
This is however not ideal for an IRQ domain, where with 4 interrupts we
would ideally have a domain of size 4 & hwirq numbers in the range 0-3.

Different PCI host controller drivers have handled this in different ways.
Of those under drivers/pci/ which register an INTx IRQ domain, we have:

  - pcie-altera uses the range 1-4 in device trees and an IRQ domain of
    size 5 to cover that range, with entry 0 wasted.

  - pcie-xilinx & pcie-xilinx-nwl use the range 1-4 in device trees but
    register an IRQ domain of size 4, which doesn't cover the hwirq=4/INTD
    case leading to that interrupt being broken.

  - pci-ftpci100 & pci-aardvark use the range 0-3 in both device trees & as
    hwirq numbering in the driver & IRQ domain.

In order to introduce some level of consistency in at least the hwirq
numbering used by the drivers & IRQ domains, this patch introduces a new
pci_irqd_intx_xlate() helper function which drivers using the 1-4 range in
device trees can assign as the xlate callback for their INTx IRQ domain.
This translates the 1-4 range into a 0-3 range, allowing us to use an IRQ
domain of size 4 & avoid a wasted entry. Further patches will make use of
this in drivers to allow them to use an IRQ domain of size 4 for legacy
INTx interrupts without breaking INTD.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-08-16 11:38:32 -05:00
K. Y. Srinivasan
6f3d791f30 Drivers: hv: vmbus: Fix rescind handling issues
This patch handles the following issues that were observed when we are
handling racing channel offer message and rescind message for the same
offer:

1. Since the host does not respond to messages on a rescinded channel,
in the current code, we could be indefinitely blocked on the vmbus_open() call.

2. When a rescinded channel is being closed, if there is a pending interrupt on the
channel, we could end up freeing the channel that the interrupt handler would run on.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-16 09:16:29 -07:00
Bartosz Golaszewski
44e72c7ebf genirq/irq_sim: Add a devres variant of irq_sim_init()
Add a resource managed version of irq_sim_init(). This can be
conveniently used in device drivers.

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-gpio@vger.kernel.org
Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Link: http://lkml.kernel.org/r/20170814145318.6495-3-brgl@bgdev.pl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-16 16:40:02 +02:00
Bartosz Golaszewski
b19af510e6 genirq/irq_sim: Add a simple interrupt simulator framework
Implement a simple, irq_work-based framework for simulating
interrupts. Currently the API exposes routines for initializing and
deinitializing the simulator object, enqueueing the interrupts and
retrieving the allocated interrupt numbers based on the offset of the
dummy interrupt in the simulator struct.

Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-gpio@vger.kernel.org
Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Link: http://lkml.kernel.org/r/20170814145318.6495-2-brgl@bgdev.pl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-08-16 16:40:02 +02:00
Laurent Pinchart
739a6d5d64 drm: omapdrm: Remove omapdrm platform data
The omapdrm platform data are not used anymore, remove them.

Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
2017-08-16 15:38:52 +03:00
David S. Miller
463910e2df Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-15 20:23:23 -07:00
Linus Torvalds
510c8a899c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel
    Grumbach.

 2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From
    Vivien Didelot.

 3) Fix two regressions with bonding on wireless, from Andreas Born.

 4) Fix build when busypoll is disabled, from Daniel Borkmann.

 5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet.

 6) Set SKB cached route properly in inet_rtm_getroute(), from Florian
    Westphal.

 7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding
    Tianhong.

 8) Fix module refcnt leak in ULP code, from Sabrina Dubroca.

 9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric
    Dumazet.

10) Need to purge socket write queue in dccp_destroy_sock(), also from
    Eric Dumazet.

11) Make bpf_trace_printk() work properly on 32-bit architectures, from
    Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
  bpf: fix bpf_trace_printk on 32 bit archs
  PCI: fix oops when try to find Root Port for a PCI device
  sfc: don't try and read ef10 data on non-ef10 NIC
  net_sched: remove warning from qdisc_hash_add
  net_sched/sfq: update hierarchical backlog when drop packet
  net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
  ipv4: fix NULL dereference in free_fib_info_rcu()
  net: Fix a typo in comment about sock flags.
  ipv6: fix NULL dereference in ip6_route_dev_notify()
  tcp: fix possible deadlock in TCP stack vs BPF filter
  dccp: purge write queue in dccp_destroy_sock()
  udp: fix linear skb reception with PEEK_OFF
  ipv6: release rt6->rt6i_idev properly during ifdown
  af_key: do not use GFP_KERNEL in atomic contexts
  tcp: ulp: avoid module refcnt leak in tcp_set_ulp
  net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
  net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
  PCI: Disable Relaxed Ordering Attributes for AMD A1100
  PCI: Disable Relaxed Ordering for some Intel processors
  PCI: Disable PCIe Relaxed Ordering if unsupported
  ...
2017-08-15 18:52:28 -07:00
Chanwoo Choi
ab8a8fbe93 extcon: Use tab instead of space for indentation
The extcon header file defines the functions which used the mismatched
indentation and used the space on some case. So, this patch clean-up
the indentation in order to improve the readbility.

And this patch changes the return value of extcon_get_extcon_dev()
because of maintaing the same value with extcon_get_edev_by_phandle().

Signed-off-by: Chanwoo Choi <cwchoi00@gmail.com>
2017-08-16 09:27:55 +09:00
Chanwoo Choi
6ab6094f37 extcon: Correct description to improve the readability
The extcon files explains the detailed operation for functions and
what is meaning of extcon structure. There are different explanation
even if the same argument.

So, it modifies the description for both functions and structures
in order to improve the readability and guide the role of functions
more well.

Also, this patch fixes the mismatching license info as a GPL v2
and removes the inactive author information.

Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
2017-08-16 09:27:55 +09:00
Chanwoo Choi
ae3614fda9 Merge branch 'ib-extcon-usb-phy-4.14' into extcon-next 2017-08-16 09:22:55 +09:00
Chanwoo Choi
808ae8f3c7 extcon: Remove deprecated extcon_set/get_cable_state_()
The commit 575c2b867e ("extcon: Rename the extcon_set/get_state()
to maintain the function naming pattern") renames the extcon function as
following: But, the extcon just keeps the old API to prevent the build error.
This patch removes the deprecatd extcon API.

- extcon_get_cable_state_() -> extcon_get_state()
- extcon_set_cable_state_() -> extcon_set_state_sync()

Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
2017-08-16 09:21:49 +09:00
Tonghao Zhang
b3dc8f772f net: Fix a typo in comment about sock flags.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 17:07:17 -07:00
Edward Cree
dc503a8ad9 bpf/verifier: track liveness for pruning
State of a register doesn't matter if it wasn't read in reaching an exit;
 a write screens off all reads downstream of it from all explored_states
 upstream of it.
This allows us to prune many more branches; here are some processed insn
 counts for some Cilium programs:
Program                  before  after
bpf_lb_opt_-DLB_L3.o       6515   3361
bpf_lb_opt_-DLB_L4.o       8976   5176
bpf_lb_opt_-DUNKNOWN.o     2960   1137
bpf_lxc_opt_-DDROP_ALL.o  95412  48537
bpf_lxc_opt_-DUNKNOWN.o  141706  78718
bpf_netdev.o              24251  17995
bpf_overlay.o             10999   9385

The runtime is also improved; here are 'time' results in ms:
Program                  before  after
bpf_lb_opt_-DLB_L3.o         24      6
bpf_lb_opt_-DLB_L4.o         26     11
bpf_lb_opt_-DUNKNOWN.o       11      2
bpf_lxc_opt_-DDROP_ALL.o   1288    139
bpf_lxc_opt_-DUNKNOWN.o    1768    234
bpf_netdev.o                 62     31
bpf_overlay.o                15     13

Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 16:32:33 -07:00
Paul Burton
b352baf15b PCI: Move enum pci_interrupt_pin to linux/pci.h
We currently have a definition of enum pci_interrupt_pin in a header
specific to PCI endpoints - linux/pci-epf.h. In order to allow for use of
this enum from PCI host code in a future commit, move its definition to
linux/pci.h & include that from linux/pci-epf.h.

Additionally we add a PCI_NUM_INTX macro which indicates the number of PCI
INTx interrupts, and will be used alongside enum pci_interrupt_pin in
further patches.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
[bhelgaas: move enum pci_interrupt_pin outside #ifdef CONFIG_PCI]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-08-15 15:53:50 -05:00
Catalin Marinas
df5b95bee1 Merge branch 'arm64/vmap-stack' of git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux into for-next/core
* 'arm64/vmap-stack' of git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux:
  arm64: add VMAP_STACK overflow detection
  arm64: add on_accessible_stack()
  arm64: add basic VMAP_STACK support
  arm64: use an irq stack pointer
  arm64: assembler: allow adr_this_cpu to use the stack pointer
  arm64: factor out entry stack manipulation
  efi/arm64: add EFI_KIMG_ALIGN
  arm64: move SEGMENT_ALIGN to <asm/memory.h>
  arm64: clean up irq stack definitions
  arm64: clean up THREAD_* definitions
  arm64: factor out PAGE_* and CONT_* definitions
  arm64: kernel: remove {THREAD,IRQ_STACK}_START_SP
  fork: allow arch-override of VMAP stack alignment
  arm64: remove __die()'s stack dump
2017-08-15 18:40:58 +01:00
Mark Rutland
48ac3c18cc fork: allow arch-override of VMAP stack alignment
In some cases, an architecture might wish its stacks to be aligned to a
boundary larger than THREAD_SIZE. For example, using an alignment of
double THREAD_SIZE can allow for stack overflows smaller than
THREAD_SIZE to be detected by checking a single bit of the stack
pointer.

This patch allows architectures to override the alignment of VMAP'd
stacks, by defining THREAD_ALIGN. Where not defined, this defaults to
THREAD_SIZE, as is the case today.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: linux-kernel@vger.kernel.org
2017-08-15 18:34:46 +01:00
Joerg Roedel
9a005a800a iommu/iova: Add flush timer
Add a timer to flush entries from the Flush-Queues every
10ms. This makes sure that no stale TLB entries remain for
too long after an IOVA has been unmapped.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-15 18:23:52 +02:00
Joerg Roedel
8109c2a2f8 iommu/iova: Add locking to Flush-Queues
The lock is taken from the same CPU most of the time. But
having it allows to flush the queue also from another CPU if
necessary.

This will be used by a timer to regularily flush any pending
IOVAs from the Flush-Queues.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-15 18:23:52 +02:00
Joerg Roedel
fb418dab8a iommu/iova: Add flush counters to Flush-Queue implementation
There are two counters:

	* fq_flush_start_cnt  - Increased when a TLB flush
	                        is started.

	* fq_flush_finish_cnt - Increased when a TLB flush
				is finished.

The fq_flush_start_cnt is assigned to every Flush-Queue
entry on its creation. When freeing entries from the
Flush-Queue, the value in the entry is compared to the
fq_flush_finish_cnt. The entry can only be freed when its
value is less than the value of fq_flush_finish_cnt.

The reason for these counters it to take advantage of IOMMU
TLB flushes that happened on other CPUs. These already
flushed the TLB for Flush-Queue entries on other CPUs so
that they can already be freed without flushing the TLB
again.

This makes it less likely that the Flush-Queue is full and
saves IOMMU TLB flushes.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-15 18:23:51 +02:00
Joerg Roedel
1928210107 iommu/iova: Implement Flush-Queue ring buffer
Add a function to add entries to the Flush-Queue ring
buffer. If the buffer is full, call the flush-callback and
free the entries.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-15 18:23:51 +02:00
Joerg Roedel
42f87e71c3 iommu/iova: Add flush-queue data structures
This patch adds the basic data-structures to implement
flush-queues in the generic IOVA code. It also adds the
initialization and destroy routines for these data
structures.

The initialization routine is designed so that the use of
this feature is optional for the users of IOVA code.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-15 18:23:50 +02:00
Baoquan He
e01d1913b0 iommu: Add is_attach_deferred call-back to iommu-ops
This new call-back will be used to check if the domain attach need be
deferred for now. If yes, the domain attach/detach will return directly.

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-08-15 18:14:39 +02:00
Nick Terrell
73f3d1b48f lib: Add zstd modules
Add zstd compression and decompression kernel modules.
zstd offers a wide varity of compression speed and quality trade-offs.
It can compress at speeds approaching lz4, and quality approaching lzma.
zstd decompressions at speeds more than twice as fast as zlib, and
decompression speed remains roughly the same across all compression levels.

The code was ported from the upstream zstd source repository. The
`linux/zstd.h` header was modified to match linux kernel style.
The cross-platform and allocation code was stripped out. Instead zstd
requires the caller to pass a preallocated workspace. The source files
were clang-formatted [1] to match the Linux Kernel style as much as
possible. Otherwise, the code was unmodified. We would like to avoid
as much further manual modification to the source code as possible, so it
will be easier to keep the kernel zstd up to date.

I benchmarked zstd compression as a special character device. I ran zstd
and zlib compression at several levels, as well as performing no
compression, which measure the time spent copying the data to kernel space.
Data is passed to the compresser 4096 B at a time. The benchmark file is
located in the upstream zstd source repository under
`contrib/linux-kernel/zstd_compress_test.c` [2].

I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

    sudo modprobe zstd_compress_test
    sudo mknod zstd_compress_test c 245 0
    sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

    1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

    1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s    | Adj MB/s | Mem (MB) |
|----------|----------|----------|-------|---------|----------|----------|
| none     | 11988480 |    0.100 |     1 | 2119.88 |        - |        - |
| zstd -1  | 73645762 |    1.044 | 2.878 |  203.05 |   224.56 |     1.23 |
| zstd -3  | 66988878 |    1.761 | 3.165 |  120.38 |   127.63 |     2.47 |
| zstd -5  | 65001259 |    2.563 | 3.261 |   82.71 |    86.07 |     2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |    16.13 |    13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |    4.45 |     4.46 |    21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |    2.06 |     2.06 |    60.15 |
| zlib -1  | 77260026 |    2.895 | 2.744 |   73.23 |    75.85 |     0.27 |
| zlib -3  | 72972206 |    4.116 | 2.905 |   51.50 |    52.79 |     0.27 |
| zlib -6  | 68190360 |    9.633 | 3.109 |   22.01 |    22.24 |     0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |    9.40 |     9.44 |     0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s    | Adjusted MB/s | Memory (MB) |
|----------|----------|---------|---------------|-------------|
| none     |    0.025 | 8479.54 |             - |           - |
| zstd -1  |    0.358 |  592.15 |        636.60 |        0.84 |
| zstd -3  |    0.396 |  535.32 |        571.40 |        1.46 |
| zstd -5  |    0.396 |  535.32 |        571.40 |        1.46 |
| zstd -10 |    0.374 |  566.81 |        607.42 |        2.51 |
| zstd -15 |    0.379 |  559.34 |        598.84 |        4.61 |
| zstd -19 |    0.412 |  514.54 |        547.77 |        8.80 |
| zlib -1  |    0.940 |  225.52 |        231.68 |        0.04 |
| zlib -3  |    0.883 |  240.08 |        247.07 |        0.04 |
| zlib -6  |    0.844 |  251.17 |        258.84 |        0.04 |
| zlib -9  |    0.837 |  253.27 |        287.64 |        0.04 |

Tested in userland using the test-suite in the zstd repo under
`contrib/linux-kernel/test/UserlandTest.cpp` [5] by mocking the kernel
functions. Fuzz tested using libfuzzer [6] with the fuzz harnesses under
`contrib/linux-kernel/test/{RoundTripCrash.c,DecompressCrash.c}` [7] [8]
with ASAN, UBSAN, and MSAN. Additionaly, it was tested while testing the
BtrFS and SquashFS patches coming next.

[1] https://clang.llvm.org/docs/ClangFormat.html
[2] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/zstd_compress_test.c
[3] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
[4] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/zstd_decompress_test.c
[5] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/test/UserlandTest.cpp
[6] http://llvm.org/docs/LibFuzzer.html
[7] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/test/RoundTripCrash.c
[8] https://github.com/facebook/zstd/blob/dev/contrib/linux-kernel/test/DecompressCrash.c

zstd source repository: https://github.com/facebook/zstd

Signed-off-by: Nick Terrell <terrelln@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2017-08-15 09:02:08 -07:00