Commit graph

737480 commits

Author SHA1 Message Date
Anna-Maria Gleixner
ebba2c723f hrtimer: Make hrtimer_force_reprogramm() unconditionally available
hrtimer_force_reprogram() needs to be available unconditionally for softirq
based hrtimers. Move the function and all required struct members out of
the CONFIG_HIGH_RES_TIMERS #ifdef.

There is no functional change because hrtimer_force_reprogram() is only
invoked when hrtimer_cpu_base.hres_active is true and
CONFIG_HIGH_RES_TIMERS=y.

Making it unconditional increases the text size for the
CONFIG_HIGH_RES_TIMERS=n case slightly, but avoids replication of that code
for the upcoming softirq based hrtimers support. Most of the code gets
eliminated in the CONFIG_HIGH_RES_TIMERS=n case by the compiler.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-19-anna-maria@linutronix.de
[ Made it build on !CONFIG_HIGH_RES_TIMERS ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:53:28 +01:00
Roman Gushchin
45e5e1212a bpftool: recognize BPF_PROG_TYPE_CGROUP_DEVICE programs
Bpftool doesn't recognize BPF_PROG_TYPE_CGROUP_DEVICE programs,
so the prog show command prints the numeric type value:

$ bpftool prog show
1: type 15  name bpf_prog1  tag ac9f93dbfd6d9b74
	loaded_at Jan 15/07:58  uid 0
	xlated 96B  jited 105B  memlock 4096B

This patch defines the corresponding textual representation:

$ bpftool prog show
1: cgroup_device  name bpf_prog1  tag ac9f93dbfd6d9b74
	loaded_at Jan 15/07:58  uid 0
	xlated 96B  jited 105B  memlock 4096B

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Quentin Monnet <quentin.monnet@netronome.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-16 02:49:49 +01:00
Anna-Maria Gleixner
11a9fe069e hrtimer: Make hrtimer_reprogramm() unconditional
hrtimer_reprogram() needs to be available unconditionally for softirq based
hrtimers. Move the function and all required struct members out of the
CONFIG_HIGH_RES_TIMERS #ifdef.

There is no functional change because hrtimer_reprogram() is only invoked
when hrtimer_cpu_base.hres_active is true. Making it unconditional
increases the text size for the CONFIG_HIGH_RES_TIMERS=n case, but avoids
replication of that code for the upcoming softirq based hrtimers support.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-18-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
eb27926ba0 hrtimer: Make hrtimer_cpu_base.next_timer handling unconditional
hrtimer_cpu_base.next_timer stores the pointer to the next expiring timer
in a CPU base.

This pointer cannot be dereferenced and is solely used to check whether a
hrtimer which is removed is the hrtimer which is the first to expire in the
CPU base. If this is the case, then the timer hardware needs to be
reprogrammed to avoid an extra interrupt for nothing.

Again, this is conditional functionality, but there is no compelling reason
to make this conditional. As a preparation, hrtimer_cpu_base.next_timer
needs to be available unconditonally.

Aside of that the upcoming support for softirq based hrtimers requires access
to this pointer unconditionally as well, so our motivation is not entirely
simplicity based.

Make the update of hrtimer_cpu_base.next_timer unconditional and remove the
#ifdef cruft. The impact on CONFIG_HIGH_RES_TIMERS=n && CONFIG_NOHZ=n is
marginal as it's just a store on an already dirtied cacheline.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-17-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
07a9a7eae8 hrtimer: Make the remote enqueue check unconditional
hrtimer_cpu_base.expires_next is used to cache the next event armed in the
timer hardware. The value is used to check whether an hrtimer can be
enqueued remotely. If the new hrtimer is expiring before expires_next, then
remote enqueue is not possible as the remote hrtimer hardware cannot be
accessed for reprogramming to an earlier expiry time.

The remote enqueue check is currently conditional on
CONFIG_HIGH_RES_TIMERS=y and hrtimer_cpu_base.hres_active. There is no
compelling reason to make this conditional.

Move hrtimer_cpu_base.expires_next out of the CONFIG_HIGH_RES_TIMERS=y
guarded area and remove the conditionals in hrtimer_check_target().

The check is currently a NOOP for the CONFIG_HIGH_RES_TIMERS=n and the
!hrtimer_cpu_base.hres_active case because in these cases nothing updates
hrtimer_cpu_base.expires_next yet. This will be changed with later patches
which further reduce the #ifdef zoo in this code.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-16-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
851cff8caf hrtimer: Use accesor functions instead of direct access
__hrtimer_hres_active() is now available unconditionally, so replace open
coded direct accesses to hrtimer_cpu_base.hres_active.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-15-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
28bfd18bf3 hrtimer: Make the hrtimer_cpu_base::hres_active field unconditional, to simplify the code
The hrtimer_cpu_base::hres_active_member field depends on CONFIG_HIGH_RES_TIMERS=y
currently, and all related functions to this member are conditional as well.

To simplify the code make it unconditional and set it to zero during initialization.

(This will also help with the upcoming softirq based hrtimers code.)

The conditional code sections can be avoided by adding IS_ENABLED(HIGHRES)
conditionals into common functions, which ensures dead code elimination.

There is no functional change.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-14-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:47 +01:00
Anna-Maria Gleixner
da21c5a58a hrtimer: Make room in 'struct hrtimer_cpu_base'
The upcoming softirq based hrtimers support requires an additional field in
the hrtimer_cpu_base struct, which would grow the struct size beyond a
cache line.

The hrtimer_cpu_base::nr_retries and ::nr_hangs members are solely
used for diagnostic output and have no requirement to be 'unsigned int'.

Make them 'unsigned short' to create room for the new struct member.

No functional change.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-13-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
3f0b9e8eec hrtimer: Store running timer in hrtimer_clock_base
The pointer to the currently running timer is stored in hrtimer_cpu_base
before the base lock is dropped and the callback is invoked.

This results in two levels of indirections and the upcoming support for
softirq based hrtimer requires splitting the "running" storage into soft
and hard IRQ context expiry.

Storing both in the cpu base would require conditionals in all code paths
accessing that information.

It's possible to have a per clock base sequence count and running pointer
without changing the semantics of the related mechanisms because the timer
base pointer cannot be changed while a timer is running the callback.

Unfortunately this makes cpu_clock base larger than 32 bytes on 32-bit
kernels. Instead of having huge gaps due to alignment, remove the alignment
and let the compiler pack CPU base for 32-bit kernels. The resulting cache access
patterns are fortunately not really different from the current
behaviour. On 64-bit kernels the 64-byte alignment stays and the behaviour is
unchanged. This was determined by analyzing the resulting layout and
looking at the number of cache lines involved for the frequently used
clocks.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-12-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
c272ca58c3 hrtimer: Switch 'for' loop to _ffs() evaluation
Looping over all clock bases to find active bits is suboptimal if not all
bases are active.

Avoid this by converting it to a __ffs() evaluation. The functionallity is
outsourced into its own function and is called via a macro as suggested by
Peter Zijlstra.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-11-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
63e2ed3659 tracing/hrtimer: Print the hrtimer mode in the 'hrtimer_start' tracepoint
The 'hrtimer_start' tracepoint lacks the mode information. The mode is
important because consecutive starts can switch from ABS to REL or from
PINNED to non PINNED.

Append the mode field.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-10-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:46 +01:00
Anna-Maria Gleixner
91633eed73 tracing/hrtimer: Fix tracing bugs by taking all clock bases and modes into account
So far only CLOCK_MONOTONIC and CLOCK_REALTIME were taken into account as
well as HRTIMER_MODE_ABS/REL in the hrtimer_init tracepoint. The query for
detecting the ABS or REL timer modes is not valid anymore, it got broken
by the introduction of HRTIMER_MODE_PINNED.

HRTIMER_MODE_PINNED is not evaluated in the hrtimer_init() call, but for the
sake of completeness print all given modes.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-9-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:45 +01:00
Anna-Maria Gleixner
19b51cb5ff hrtimer: Clean up 'enum hrtimer_mode'
It's not obvious that the HRTIMER_MODE variants are bit combinations,
because all modes are hard coded constants currently.

Change it so the bit meanings are clear; and use the symbols for creating
modes which combine bits.

While at it get rid of the ugly tail comments as well.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-8-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:45 +01:00
Anna-Maria Gleixner
48d0c9becc hrtimer: Ensure POSIX compliance (relative CLOCK_REALTIME hrtimers)
The POSIX specification defines that relative CLOCK_REALTIME timers are not
affected by clock modifications. Those timers have to use CLOCK_MONOTONIC
to ensure POSIX compliance.

The introduction of the additional HRTIMER_MODE_PINNED mode broke this
requirement for pinned timers.

There is no user space visible impact because user space timers are not
using pinned mode, but for consistency reasons this needs to be fixed.

Check whether the mode has the HRTIMER_MODE_REL bit set instead of
comparing with HRTIMER_MODE_ABS.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Fixes: 597d027573 ("timers: Framework for identifying pinned timers")
Link: http://lkml.kernel.org/r/20171221104205.7269-7-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:45 +01:00
Anna-Maria Gleixner
6de6250c75 hrtimer: Fix hrtimer_start[_range_ns]() function descriptions
The hrtimer_start[_range_ns]() functions start a timer reliably on this CPU only
when HRTIMER_MODE_PINNED is set.

Furthermore the HRTIMER_MODE_PINNED mode is not considered when a hrtimer is initialized.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-6-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:45 +01:00
Anna-Maria Gleixner
907777136f hrtimer: Clean up the 'int clock' parameter of schedule_hrtimeout_range_clock()
schedule_hrtimeout_range_clock() uses an 'int clock' parameter for the
clock ID, instead of the customary predefined "clockid_t" type.

In hrtimer coding style the canonical variable name for the clock ID is
'clock_id', therefore change the name of the parameter here as well
to make it all consistent.

While at it, clean up the description for the 'clock_id' and 'mode'
function parameters. The clock modes and the clock IDs are not
restricted as the comment suggests.

Fix the mode description as well for the callers of schedule_hrtimeout_range_clock().

No functional changes intended.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-5-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Anna-Maria Gleixner
1fbc78b3c9 hrtimer: Fix kerneldoc syntax for 'struct hrtimer_cpu_base'
The '/**' sequence marks the start of a structure description. Add the
missing second asterisk. While at it adapt the ordering of the struct
members to the struct definition and document the purpose of
expires_next more precisely.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-4-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Thomas Gleixner
d05ca13b8d hrtimer: Correct blatantly incorrect comment
The protection of a hrtimer which runs its callback against migration to a
different CPU has nothing to do with hard interrupt context.

The protection against migration of a hrtimer running the expiry callback
is the pointer in the cpu_base which holds a pointer to the currently
running timer. This pointer is evaluated in the code which potentially
switches the timer base and makes sure it's kept on the CPU on which the
callback is running.

Reported-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/20171221104205.7269-3-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Thomas Gleixner
ae67badaa1 hrtimer: Optimize the hrtimer code by using static keys for migration_enable/nohz_active
The hrtimer_cpu_base::migration_enable and ::nohz_active fields
were originally introduced to avoid accessing global variables
for these decisions.

Still that results in a (cache hot) load and conditional branch,
which can be avoided by using static keys.

Implement it with static keys and optimize for the most critical
case of high performance networking which tends to disable the
timer migration functionality.

No change in functionality.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: keescook@chromium.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1801142327490.2371@nanos
Link: https://lkml.kernel.org/r/20171221104205.7269-2-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:35:44 +01:00
Ingo Molnar
57957fb519 Merge branch 'timers/urgent' into timers/core, to pick up dependent fix
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 02:33:42 +01:00
Paul Mackerras
5855564c8a KVM: PPC: Book3S HV: Enable migration of decrementer register
This adds a register identifier for use with the one_reg interface
to allow the decrementer expiry time to be read and written by
userspace.  The decrementer expiry time is in guest timebase units
and is equal to the sum of the decrementer and the guest timebase.
(The expiry time is used rather than the decrementer value itself
because the expiry time is not constantly changing, though the
decrementer value is, while the guest vcpu is not running.)

Without this, a guest vcpu migrated to a new host will see its
decrementer set to some random value.  On POWER8 and earlier, the
decrementer is 32 bits wide and counts down at 512MHz, so the
guest vcpu will potentially see no decrementer interrupts for up
to about 4 seconds, which will lead to a stall.  With POWER9, the
decrementer is now 56 bits side, so the stall can be much longer
(up to 2.23 years) and more noticeable.

To help work around the problem in cases where userspace has not been
updated to migrate the decrementer expiry time, we now set the
default decrementer expiry at vcpu creation time to the current time
rather than the maximum possible value.  This should mean an
immediate decrementer interrupt when a migrated vcpu starts
running.  In cases where the decrementer is 32 bits wide and more
than 4 seconds elapse between the creation of the vcpu and when it
first runs, the decrementer would have wrapped around to positive
values and there may still be a stall - but this is no worse than
the current situation.  In the large-decrementer case, we are sure
to get an immediate decrementer interrupt (assuming the time from
vcpu creation to first run is less than 2.23 years) and we thus
avoid a very long stall.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-16 11:54:45 +11:00
Arnd Bergmann
41e4b39115 netfilter: nf_defrag: move NF_CONNTRACK bits into #ifdef
We cannot access the skb->_nfct field when CONFIG_NF_CONNTRACK is
disabled:

net/ipv4/netfilter/nf_defrag_ipv4.c: In function 'ipv4_conntrack_defrag':
net/ipv4/netfilter/nf_defrag_ipv4.c:83:9: error: 'struct sk_buff' has no member named '_nfct'
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c: In function 'ipv6_defrag':
net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68:9: error: 'struct sk_buff' has no member named '_nfct'

Both functions already have an #ifdef for this, so let's move the
check in there.

Fixes: 902d6a4c2a ("netfilter: nf_defrag: Skip defrag if NOTRACK is set")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-16 01:52:08 +01:00
Arnd Bergmann
b069b37adb netfilter: nf_defrag: mark xt_table structures 'const' again
As a side-effect of adding the module option, we now get a section
mismatch warning:

WARNING: net/ipv4/netfilter/iptable_raw.o(.data+0x1c): Section mismatch in reference from the variable packet_raw to the function .init.text:iptable_raw_table_init()
The variable packet_raw references
the function __init iptable_raw_table_init()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console

Apparently it's ok to link to a __net_init function from .rodata but not
from .data. We can address this by rearranging the logic so that the
structure is read-only again. Instead of writing to the .priority field
later, we have an extra copies of the structure with that flag. An added
advantage is that that we don't have writable function pointers with this
approach.

Fixes: 902d6a4c2a ("netfilter: nf_defrag: Skip defrag if NOTRACK is set")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-16 01:52:07 +01:00
Subash Abhinov Kasiviswanathan
83f1999cae netfilter: ipv6: nf_defrag: Pass on packets to stack per RFC2460
ipv6_defrag pulls network headers before fragment header. In case of
an error, the netfilter layer is currently dropping these packets.
This results in failure of some IPv6 standards tests which passed on
older kernels due to the netfilter framework using cloning.

The test case run here is a check for ICMPv6 error message replies
when some invalid IPv6 fragments are sent. This specific test case is
listed in https://www.ipv6ready.org/docs/Core_Conformance_Latest.pdf
in the Extension Header Processing Order section.

A packet with unrecognized option Type 11 is sent and the test expects
an ICMP error in line with RFC2460 section 4.2 -

11 - discard the packet and, only if the packet's Destination
     Address was not a multicast address, send an ICMP Parameter
     Problem, Code 2, message to the packet's Source Address,
     pointing to the unrecognized Option Type.

Since netfilter layer now drops all invalid IPv6 frag packets, we no
longer see the ICMP error message and fail the test case.

To fix this, save the transport header. If defrag is unable to process
the packet due to RFC2460, restore the transport header and allow packet
to be processed by stack. There is no change for other packet
processing paths.

Tested by confirming that stack sends an ICMP error when it receives
these packets. Also tested that fragmented ICMP pings succeed.

v1->v2: Instead of cloning always, save the transport_header and
restore it in case of this specific error. Update the title and
commit message accordingly.

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-16 01:52:06 +01:00
Florian Westphal
e3eeacbac4 netfilter: x_tables: don't return garbage pointer on modprobe failure
request_module may return a positive error result from modprobe,
if we cast this to ERR_PTR this returns a garbage result (it passes
IS_ERR checks).

Fix it by ignoring modprobe return values entirely, just retry the
table lookup instead.

Reported-by: syzbot+980925dbfbc7f93bc2ef@syzkaller.appspotmail.com
Fixes: 03d13b6868 ("netfilter: xtables: add and use xt_request_find_table_lock")
Fixes: 20651cefd2 ("netfilter: x_tables: unbreak module auto loading")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-16 01:51:59 +01:00
Tom Lendacky
107cd25321 x86/mm: Encrypt the initrd earlier for BSP microcode update
Currently the BSP microcode update code examines the initrd very early
in the boot process.  If SME is active, the initrd is treated as being
encrypted but it has not been encrypted (in place) yet.  Update the
early boot code that encrypts the kernel to also encrypt the initrd so
that early BSP microcode updates work.

Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180110192634.6026.10452.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:50:59 +01:00
Tom Lendacky
cc5f01e28d x86/mm: Prepare sme_encrypt_kernel() for PAGE aligned encryption
In preparation for encrypting more than just the kernel, the encryption
support in sme_encrypt_kernel() needs to support 4KB page aligned
encryption instead of just 2MB large page aligned encryption.

Update the routines that populate the PGD to support non-2MB aligned
addresses.  This is done by creating PTE page tables for the start
and end portion of the address range that fall outside of the 2MB
alignment.  This results in, at most, two extra pages to hold the
PTE entries for each mapping of a range.

Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180110192626.6026.75387.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:50:58 +01:00
Tom Lendacky
2b5d00b6c2 x86/mm: Centralize PMD flags in sme_encrypt_kernel()
In preparation for encrypting more than just the kernel during early
boot processing, centralize the use of the PMD flag settings based
on the type of mapping desired.  When 4KB aligned encryption is added,
this will allow either PTE flags or large page PMD flags to be used
without requiring the caller to adjust.

Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180110192615.6026.14767.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:50:58 +01:00
Tom Lendacky
bacf6b499e x86/mm: Use a struct to reduce parameters for SME PGD mapping
In preparation for follow-on patches, combine the PGD mapping parameters
into a struct to reduce the number of function arguments and allow for
direct updating of the next pagetable mapping area pointer.

Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180110192605.6026.96206.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:50:58 +01:00
Tom Lendacky
1303880179 x86/mm: Clean up register saving in the __enc_copy() assembly code
Clean up the use of PUSH and POP and when registers are saved in the
__enc_copy() assembly function in order to improve the readability of the code.

Move parameter register saving into general purpose registers earlier
in the code and move all the pushes to the beginning of the function
with corresponding pops at the end.

We do this to prepare fixes.

Tested-by: Gabriel Craciunescu <nix.or.die@gmail.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180110192556.6026.74187.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:50:58 +01:00
Masamitsu Yamazaki
bd1c06a4f5 ipmi: Clear smi_info->thread to prevent use-after-free during module unload
During code inspection, I found an use-after-free possibility during unloading
ipmi_si in the polling mode.

If start_new_msg() is called after kthread_stop(), the function will try to
wake up non-existing kthread using the dangling pointer.

Possible scenario is when a new internal message is generated after
ipmi_unregister_smi()[*1] and remains after stop_timer_and_thread()
in clenaup_one_si() [*2].
Use-after-free could occur as follows depending on BMC replies.

  cleanup_one_si
    => ipmi_unregister_smi
       [*1]
    => stop_timer_and_thread
       => kthread_stop(smi_info->thread)
       [*2]
    => poll
       => smi_event_handler
          => start_new_msg
             => if (smi_info->thread)
                    wake_up_process(smi_info->thread) <== use-after-free!!

Although currently it seems no such message is generated in the polling mode,
some changes might introduce that in thefuture. For example in the interrupt
mode, disable_si_irq() does that at [*2].

So let's prevent such a critical issue possibility now.

Signed-off-by: Yamazaki Masamitsu <m-yamazaki@ah.jp.nec.com>
Signed-off-by: Corey Minyard <cminyard@mvista.com>
2018-01-15 18:34:34 -06:00
Josh Poimboeuf
385d11b152 objtool: Improve error message for bad file argument
If a nonexistent file is supplied to objtool, it complains with a
non-helpful error:

  open: No such file or directory

Improve it to:

  objtool: Can't open 'foo': No such file or directory

Reported-by: Markus <M4rkusXXL@web.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/406a3d00a21225eee2819844048e17f68523ccf6.1516025651.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:27:27 +01:00
Josh Poimboeuf
2a0098d706 objtool: Fix seg fault with gold linker
Objtool segfaults when the gold linker is used with
CONFIG_MODVERSIONS=y and CONFIG_UNWINDER_ORC=y.

With CONFIG_MODVERSIONS=y, the .o file gets passed to the linker before
being passed to objtool.  The gold linker seems to strip unused ELF
symbols by default, which confuses objtool and causes the seg fault when
it's trying to generate ORC metadata.

Objtool should really be running immediately after GCC anyway, without a
linker call in between.  Change the makefile ordering so that objtool is
called before the linker.

Reported-and-tested-by: Markus <M4rkusXXL@web.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: ee9f8fce99 ("x86/unwind: Add the ORC unwinder")
Link: http://lkml.kernel.org/r/355f04da33581f4a3bf82e5b512973624a1e23a2.1516025651.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-16 01:27:27 +01:00
Arnd Bergmann
9be9d04b28 netfilter: nf_tables: flow_offload depends on flow_table
Without CONFIG_NF_FLOW_TABLE, the new nft_flow_offload module produces
a link error:

net/netfilter/nft_flow_offload.o: In function `nft_flow_offload_iterate_cleanup':
nft_flow_offload.c:(.text+0xb0): undefined reference to `nf_flow_table_iterate'
net/netfilter/nft_flow_offload.o: In function `flow_offload_iterate_cleanup':
nft_flow_offload.c:(.text+0x160): undefined reference to `flow_offload_dead'
net/netfilter/nft_flow_offload.o: In function `nft_flow_offload_eval':
nft_flow_offload.c:(.text+0xc4c): undefined reference to `flow_offload_alloc'
nft_flow_offload.c:(.text+0xc64): undefined reference to `flow_offload_add'
nft_flow_offload.c:(.text+0xc94): undefined reference to `flow_offload_free'

This adds a Kconfig dependency for it.

Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-16 01:19:16 +01:00
Eric W. Biederman
eb5346c379 signal: Remove the code to clear siginfo before calling copy_siginfo_from_user32
The new unified copy_siginfo_from_user32 takes care of this.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 18:01:19 -06:00
Eric W. Biederman
212a36a17e signal: Unify and correct copy_siginfo_from_user32
The function copy_siginfo_from_user32 is used for two things, in ptrace
since the dawn of siginfo for arbirarily modifying a signal that
user space sees, and in sigqueueinfo to send a signal with arbirary
siginfo data.

Create a single copy of copy_siginfo_from_user32 that all architectures
share, and teach it to handle all of the cases in the siginfo union.

In the generic version of copy_siginfo_from_user32 ensure that all
of the fields in siginfo are initialized so that the siginfo structure
can be safely copied to userspace if necessary.

When copying the embedded sigval union copy the si_int member.  That
ensures the 32bit values passes through the kernel unchanged.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:55:59 -06:00
Eric W. Biederman
56b81456f4 signal/blackfin: Remove pointless UID16_SIGINFO_COMPAT_NEEDED
Nothing tests this define so just remove it.

I suspect the intention was to make the uid field in siginfo 16bit
however I can't find any code that ever tested this defined, and
even if it did it the layout has been this way for 8 years so
changing it now would break the ABI with userspace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:36 -06:00
Eric W. Biederman
71ee78d538 signal/blackfin: Move the blackfin specific si_codes to asm-generic/siginfo.h
Having si_codes in many different files simply encourages duplicate definitions
that can cause problems later.  To avoid that merge the blackfin specific si_codes
into uapi/asm-generic/siginfo.h

Update copy_siginfo_to_user to copy with the absence of BUS_MCEERR_AR that blackfin
defines to be something else.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:35 -06:00
Eric W. Biederman
753e5a8543 signal/tile: Move the tile specific si_codes to asm-generic/siginfo.h
Having si_codes in many different files simply encourages duplicate definitions
that can cause problems later.  To avoid that merge the tile specific si_codes
into uapi/asm-generic/siginfo.h

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:35 -06:00
Eric W. Biederman
8bc9e33848 signal/frv: Move the frv specific si_codes to asm-generic/siginfo.h
Having si_codes in many different files simply encourages duplicate definitions
that can cause problems later.  To avoid that merce the frv specific si_codes
into uapi/asm-generic/siginfo.h

This allows the removal of arch/frv/uapi/include/asm/siginfo.h as the last
last meaningful definition it held was FPE_MDAOVF.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:34 -06:00
Eric W. Biederman
ac54058d77 signal/ia64: Move the ia64 specific si_codes to asm-generic/siginfo.h
Having si_codes in many different files simply encourages duplicate
definitions that can cause problems later.  To avoid that merge the
ia64 specific si_codes into uapi/asm-generic/siginfo.h

Update the sanity checks in arch/x86/kernel/signal_compat.c to expect
the now lager NSIGILL and NSIGFPE.  As nothing excpe the larger count
is exposed on x86 no additional code needs to be updated.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:33 -06:00
Eric W. Biederman
ad2b1ab57d signal/powerpc: Remove redefinition of NSIGTRAP on powerpc
NSIGTRAP is 4 in the generic siginfo and powerpc just undefines
NSGTRAP and redefine it as 4.  That accomplishes nothing so remove
the duplication.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:33 -06:00
Eric W. Biederman
b68a68d3dc signal: Move addr_lsb into the _sigfault union for clarity
The addr_lsb fields is only valid and available when the
signal is SIGBUS and the si_code is BUS_MCEERR_AR or BUS_MCEERR_AO.
Document this with a comment and place the field in the _sigfault union
to make this clear.

All of the fields stay in the same physical location so both the old
and new definitions of struct siginfo will continue to work.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:32 -06:00
Al Viro
b713da69e4 signal: unify compat_siginfo_t
--EWB Added #ifdef CONFIG_X86_X32_ABI to arch/x86/kernel/signal_compat.c
      Changed #ifdef CONFIG_X86_X32 to #ifdef CONFIG_X86_X32_ABI in
      linux/compat.h

      CONFIG_X86_X32 is set when the user requests X32 support.

      CONFIG_X86_X32_ABI is set when the user requests X32 support
      and the tool-chain has X32 allowing X32 support to be built.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-01-15 17:40:31 -06:00
Wolfram Sang
7d2c17f021 i2c: rcar: implement bus recovery
We can force levels of SCL and SDA, so we can use that for bus recovery.
Note that we cannot read SDA back, because we will only get the internal
state of the bus free detection.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-01-16 00:04:48 +01:00
Wolfram Sang
2806e6ad77 i2c: send STOP after successful bus recovery
If we managed to get a client release SDA again, send a STOP afterwards
to make sure we have a consistent state on the bus again.

Tested-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-01-16 00:04:42 +01:00
Wolfram Sang
72b08fcc15 i2c: ensure SDA is released in recovery if SDA is controllable
If we have a function to control SDA, we should ensure that SDA is not
held down by us. So, release the GPIO in this case.

Tested-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-01-16 00:04:30 +01:00
Wolfram Sang
8092178ffe i2c: add 'set_sda' to bus_recovery_info
This will be needed when we want to create STOP conditions, too, later.
Create the needed fields and populate them for the GPIO case if the GPIO
is set to output.

Tested-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-01-16 00:04:19 +01:00
Wolfram Sang
6c92204e44 i2c: add identifier in declarations for i2c_bus_recovery
No reason to have them undefined, so let's add them.

Tested-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-01-16 00:04:03 +01:00
Wolfram Sang
766a4f27f3 i2c: make kerneldoc about bus recovery more precise
"Used internally" is vague. What it actually means is that those fields
are populated by the core if valid GPIOs are provided. Change the
comments to reflect that.

Tested-by: Phil Reid <preid@electromag.com.au>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2018-01-16 00:04:02 +01:00