Sometimes the user wants to have device name of the match rather than
just checking if device present or not. To make life easier for such
users introduce acpi_dev_get_first_match_name() helper based on code
for acpi_dev_present().
For example, GPIO driver for Intel Merrifield needs to know the device
name of pin control to be able to apply GPIO mapping table to the proper
device.
To be more consistent with the purpose rename
struct acpi_dev_present_info -> struct acpi_dev_match_info
acpi_dev_present_cb() -> acpi_dev_match_cb()
in the utils.c file.
Tested-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Construct the init thread stack in the linker script rather than doing it
by means of a union so that ia64's init_task.c can be got rid of.
The following symbols are then made available from INIT_TASK_DATA() linker
script macro:
init_thread_union
init_stack
INIT_TASK_DATA() also expands the region to THREAD_SIZE to accommodate the
size of the init stack. init_thread_union is given its own section so that
it can be placed into the stack space in the right order. I'm assuming
that the ia64 ordering is correct and that the task_struct is first and the
thread_info second.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Will Deacon <will.deacon@arm.com> (arm64)
Tested-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Make THREAD_SIZE available to vmlinux.lds on openrisc by including
asm/thread_info.h the linker script.
This allows init_stack to be allocated in the linker script in a subsequent
patch.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Stafford Horne <shorne@gmail.com>
cc: Jonas Bonn <jonas@southpole.se>
cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
cc: openrisc@lists.librecores.org
If a metadata IO error happens, we report the location of the failed IO
request in units of daddrs. However, the printk message misleads people
into thinking that the units are fs blocks, so fix the reported units.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Make read_c0_prid() use the new constant accessor macros so that it can
potentially be optimised or removed by the compiler. This is
particularly important under virtualisation, where even with hardware
assisted virtualisation (VZ), access to the PRid register may need to be
emulated by the hypervisor.
In particular this helps eliminate the read of the PRid register in the
rather frequently called add_interrupt_randomness() (which calls into
arch/mips/include/asm/timex.h) when the prid is unused but the read
can't be removed due to the inline asm being marked __volatile__.
Reported-by: Yann LeDu <Yann.LeDu@imgtec.com>
Signed-off-by: James Hogan <jhogan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Maciej W. Rozycki <macro@mips.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17923/
Some Cop0 registers are constant and have no side effects when read.
There is no need for the inline asm to read these to be marked
__volatile__, and doing so prevents them from being removed by the
compiler.
Add a few new accessor macros to handle these registers more efficiently
(especially for the sake of running in a guest where redundant access to
the register may trap to the hypervisor):
__read_const_32bit_c0_register()
__read_const_64bit_c0_register()
__read_const_ulong_c0_register()
Signed-off-by: James Hogan <jhogan@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Maciej W. Rozycki <macro@mips.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/17922/
Add Jiaxun Yang as the MIPS/Loongson-2 maintainer and add Huacai Chen
as the MIPS/Loongson-3 maintainer.
[ralf@linux-mips.org: Don't put all of drivers/platform/mips/ into these
two entries but rather only the files required even though at this time
the Loongson platforms are the only users of drivers/platform/mips/.]
Signed-off-by: Huacai Chen <chenhc@lemote.com>
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: James Hogan <james.hogan@mips.com>
Cc: Rui Wang <wangr@lemote.com>
Cc: Binbin Zhou <zhoubb@lemote.com>
Cc: Ce Sun <sunc@lemote.com>
Cc: Yao Wang <wangyao@lemote.com>
Cc: Liangliang Huang <huangll@lemote.com>
Cc: Fuxin Zhang <zhangfx@lemote.com>
Cc: Zhangjin Wu <wuzhangjin@gmail.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: r@hev.cc
Cc: zhoubb.aaron@gmail.com
Cc: huanglllzu@163.com
Cc: 513434146@qq.com
Cc: 1393699660@qq.com
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Cc: Huacai Chen <chenhc@lemote.com>
Patchwork: https://patchwork.linux-mips.org/patch/17888/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: James Hogan <jhogan@kernel.org>
Make THREAD_SIZE available to vmlinux.lds on hexagon by including
asm/thread_info.h the linker script.
This allows init_stack to be allocated in the linker script in a subsequent
patch.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Richard Kuo <rkuo@codeaurora.org>
cc: linux-hexagon@vger.kernel.org
This is needed to ensure that we actually handle timeouts.
Without it, the queue_mode=1 path will never call blk_add_timer(),
and the queue_mode=2 path will continually just return
EH_RESET_TIMER and we never actually complete the offending request.
This was used to test the new timeout code, and the changes around
killing off REQ_ATOM_COMPLETE.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.
A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."
To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64
The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden
v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)
v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next
Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add an extra temporary register parameter to uaccess_ttbr0_enable which
is about to be required for arm64 PAN support.
This patch doesn't introduce any functional change but ensures that the
kernel compiles once the KVM/ARM tree is merged with the arm64 tree by
ensuring a trivially mergable conflict with commit
27a921e757
("arm64: mm: Fix and re-enable ARM64_SW_TTBR0_PAN").
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Several staging directories have TODO files that indicate a
subsystem will be removed in the future.
Using a status entry of "S: Obsolete" helps indicate the
subsystem files should not be modified unnecessarily.
checkpatch also tests this setting and emits a warning that
the matching subsystem files should not be modified.
This might help avoid receiving patches that will be dropped.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's not available if we don't have group io scheduling set, and
there's no need to call it.
Fixes: 0d52af5905 ("block, bfq: release oom-queue ref to root group on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"A set of fixes that should go into this release. This contains:
- An NVMe pull request from Christoph, with a few critical fixes for
NVMe.
- A block drain queue fix from Ming.
- The concurrent lo_open/release fix for loop"
* 'for-linus' of git://git.kernel.dk/linux-block:
loop: fix concurrent lo_open/lo_release
block: drain queue before waiting for q_usage_counter becoming zero
nvme-fcloop: avoid possible uninitialized variable warning
nvme-mpath: fix last path removal during traffic
nvme-rdma: fix concurrent reset and reconnect
nvme: fix sector units when going between formats
nvme-pci: move use_sgl initialization to nvme_init_iod()
Otherwise, architectures that do negated adds of atomics (e.g. s390)
to do atomic_sub fail in closure_set_stopped.
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We should not call edac_mc_del_mc() if a corresponding call to
edac_mc_add_mc() has not been performed yet.
So here, we should go to err instead of err2 to branch at the right
place of the error handling path.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20180107205400.14068-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Borislav Petkov <bp@suse.de>
The functions vfio_mdev_probe, vfio_mdev_remove and the structure
vfio_mdev_driver are only used in this file, so make them static.
Clean up sparse warnings:
drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
for 'vfio_mdev_probe' [-Wmissing-prototypes]
drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
for 'vfio_mdev_remove' [-Wmissing-prototypes]
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a malicious filesystem image contains a block+ format directory
wherein the directory inode's core.mode is set such that
S_ISDIR(core.mode) == 0, and if there are subdirectories of the
corrupted directory, an attempt to traverse up the directory tree will
crash the kernel in __xfs_dir3_data_check. Running the online scrub's
parent checks will tend to do this.
The crash occurs because the directory inode's d_ops get set to
xfs_dir[23]_nondir_ops (it's not a directory) but the parent pointer
scrubber's indiscriminate call to xfs_readdir proceeds past the ASSERT
if we have non fatal asserts configured.
Fix the null pointer dereference crash in __xfs_dir3_data_check by
looking for S_ISDIR or wrong d_ops; and teach the parent scrubber
to bail out if it is fed a non-directory "parent".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Instead of always trying to disable HDCP. Only run hdcp_disable when the
state is not UNDESIRED. This will catch cases where it's enabled and
also cases where enable failed and the state is left in DESIRED mode.
Note that things won't blow up if disable is attempted while already
disabled, it's just bad form.
Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Sean Paul <seanpaul@chromium.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20180109185330.16853-1-seanpaul@chromium.org
Useful to identify which network queue is associated with
which vmbus channel.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes to vmbus ABI document including:
- make it clear that relid is numeric value in sub directory
- clarify interrupt mask description
- spelling fixes
- document regions
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 8a99b6ad4d.
Geert doesn't want it going in through the USB tree, ok, whatever...
Cc: Chris Brandt <chris.brandt@renesas.com>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The UIO IRQ handler doesn't need to be called from a tasklet.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The generic UIO mmap should work for us.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The suggested method for configuration does not work with
current kernels. Paths and ids changed.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The trailing semicolon is an empty statement that does no operation.
Removing it since it doesn't do anything.
Signed-off-by: Luis de Bethencourt <luisbg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The push-pull vs. open-drain are the only supported output modes. The
inputs are always unconditionally equipped with weak pull-downs. That's
the only mode, so there's probably no point in exporting that. I wonder
if it's worthwhile to provide a custom dbg_show method to indicate the
current status of the outputs, though.
This patch and [1] for i2c-gpio together make it possible to bit-bang an
I2C bus over GPIOs of an UART which is connected via SPI :). Yes, this
is crazy, but it's fast enough (while on a 26Mhz SPI HW bus with a
dual-core 1.6GHz CPU) to drive an I2C bus at 200kHz, according to my
scope.
[1] https://patchwork.ozlabs.org/patch/852591/
Signed-off-by: Jan Kundrát <jan.kundrat@cesnet.cz>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Casting a value returned by memory an allocation function is
not required and can be removed. Also add in a newline after before
the first statement. Code clean up as suggested by coccinelle.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 3a025e1d1c ("Add optional check for bad kernel-doc comments")
causes W=1 the kernel-doc script to be run and thereby causes several
new warnings to appear when building the kernel with W=1. Fix the
block layer kernel-doc headers such that the block layer again builds
cleanly with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change "nedeing" into "needing" and "caes" into "cases".
Fixes: commit f906a6a0f4 ("blk-mq: improve tag waiting setup for non-shared tags")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In addition to commit b2157399cc ("bpf: prevent out-of-bounds
speculation") also change the layout of struct bpf_map such that
false sharing of fast-path members like max_entries is avoided
when the maps reference counter is altered. Therefore enforce
them to be placed into separate cachelines.
pahole dump after change:
struct bpf_map {
const struct bpf_map_ops * ops; /* 0 8 */
struct bpf_map * inner_map_meta; /* 8 8 */
void * security; /* 16 8 */
enum bpf_map_type map_type; /* 24 4 */
u32 key_size; /* 28 4 */
u32 value_size; /* 32 4 */
u32 max_entries; /* 36 4 */
u32 map_flags; /* 40 4 */
u32 pages; /* 44 4 */
u32 id; /* 48 4 */
int numa_node; /* 52 4 */
bool unpriv_array; /* 56 1 */
/* XXX 7 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct user_struct * user; /* 64 8 */
atomic_t refcnt; /* 72 4 */
atomic_t usercnt; /* 76 4 */
struct work_struct work; /* 80 32 */
char name[16]; /* 112 16 */
/* --- cacheline 2 boundary (128 bytes) --- */
/* size: 128, cachelines: 2, members: 17 */
/* sum members: 121, holes: 1, sum holes: 7 */
};
Now all entries in the first cacheline are read only throughout
the life time of the map, set up once during map creation. Overall
struct size and number of cachelines doesn't change from the
reordering. struct bpf_map is usually first member and embedded
in map structs in specific map implementations, so also avoid those
members to sit at the end where it could potentially share the
cacheline with first map values e.g. in the array since remote
CPUs could trigger map updates just as well for those (easily
dirtying members like max_entries intentionally as well) while
having subsequent values in cache.
Quoting from Google's Project Zero blog [1]:
Additionally, at least on the Intel machine on which this was
tested, bouncing modified cache lines between cores is slow,
apparently because the MESI protocol is used for cache coherence
[8]. Changing the reference counter of an eBPF array on one
physical CPU core causes the cache line containing the reference
counter to be bounced over to that CPU core, making reads of the
reference counter on all other CPU cores slow until the changed
reference counter has been written back to memory. Because the
length and the reference counter of an eBPF array are stored in
the same cache line, this also means that changing the reference
counter on one physical CPU core causes reads of the eBPF array's
length to be slow on other physical CPU cores (intentional false
sharing).
While this doesn't 'control' the out-of-bounds speculation through
masking the index as in commit b2157399cc, triggering a manipulation
of the map's reference counter is really trivial, so lets not allow
to easily affect max_entries from it.
Splitting to separate cachelines also generally makes sense from
a performance perspective anyway in that fast-path won't have a
cache miss if the map gets pinned, reused in other progs, etc out
of control path, thus also avoids unintentional false sharing.
[1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Heiner Kallweit says:
====================
r8169: improve runtime pm
On my system with two network ports I found that runtime PM didn't
suspend the unused port. Therefore I checked runtime pm in this driver
in somewhat more detail and this series improves runtime pm in general
and solves the mentioned issue.
Tested on a system with RTL8168evl (MAC version 34).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
So far rpm doesn't cover cases like unused ports which are never
brought up. If they are active at probe time they remain in this state.
Included in this patch:
- Let the idle notification check whether we can suspend and let it
schedule the suspend. This way we don't need to have calls to
pm_schedule_suspend in different places.
- At the end of rtl_open and rtl_init_one send an idle notification
to allow suspending if the link is down. If a cable is plugged in
aneg is finished before the suspend timer expires and the suspend
request is cancelled.
- Change rtl8169_runtime_suspend to power down the chip if the
interface is down.
Successfully tested on a RTL8168evl (mac version 34).
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch partially reverts commit e4fbce740f "r8169: Fix runtime
power management" from 2010. At that time the suspend delay was 100ms
and therefore suspending happened during initial aneg. Currently
suspend delay is 5s, so suspend starts after aneg and the issue
doesn't exist any longer. On my system aneg takes almost 3s, to be on
the safe side let's increase the suspend delay to 10s.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverts commit 2a15cd2ff4 "r8169: runtime resume before
shutdown" from 2012. Few months after this change the underlying issue
was solved in the PCI core with commit 3ff2de9ba1 "PCI/PM: Resume
device before shutdown".
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy says:
====================
tipc: improvements to group messaging
We make a number of simplifications and improvements to the group
messaging service. They aim at readability/maintainability of the code
as well as scalability.
The series is based on commit f9c935db80 ("tipc: fix problems with
multipoint-to-point flow control) which has been applied to 'net' but
not yet to 'net-next'.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current criteria for returning POLLOUT from a group member socket is
too simplistic. It basically returns POLLOUT as soon as the group has
external destinations, something obviously leading to a lot of spinning
during destination congestion situations. At the same time, the internal
congestion handling is unnecessarily complex.
We now change this as follows.
- We introduce an 'open' flag in struct tipc_group. This flag is used
only to help poll() get the setting of POLLOUT right, and *not* for
congeston handling as such. This means that a user can choose to
ignore an EAGAIN for a destination and go on sending messages to
other destinations in the group if he wants to.
- The flag is set to false every time we return EAGAIN on a send call.
- The flag is set to true every time any member, i.e., not necessarily
the member that caused EAGAIN, is removed from the small_win list.
- We remove the group member 'usr_pending' flag. The size of the send
window and presence in the 'small_win' list is sufficient criteria
for recognizing congestion.
This solution seems to be a reasonable compromise between 'anycast',
which is normally not waiting for POLLOUT for a specific destination,
and the other three send modes, which are.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a member joins a group, it also indicates a binding scope. This
makes it possible to create both node local groups, invisible to other
nodes, as well as cluster global groups, visible everywhere.
In order to avoid that different members end up having permanently
differing views of group size and memberhip, we must inhibit locally
and globally bound members from joining the same group.
We do this by using the binding scope as an additional separator between
groups. I.e., a member must ignore all membership events from sockets
using a different scope than itself, and all lookups for message
destinations must require an exact match between the message's lookup
scope and the potential target's binding scope.
Apart from making it possible to create local groups using the same
identity on different nodes, a side effect of this is that it now also
becomes possible to create a cluster global group with the same identity
across the same nodes, without interfering with the local groups.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when a user is subscribing for binding table publications,
he will receive a PUBLISH event for all already existing matching items
in the binding table.
However, a group socket making a subscriptions doesn't need this initial
status update from the binding table, because it has already scanned it
during the join operation. Worse, the multiplicatory effect of issuing
mutual events for dozens or hundreds group members within a short time
frame put a heavy load on the topology server, with the end result that
scale out operations on a big group tend to take much longer than needed.
We now add a new filter option, TIPC_SUB_NO_STATUS, for topology server
subscriptions, so that this initial avalanche of events is suppressed.
This change, along with the previous commit, significantly improves the
range and speed of group scale out operations.
We keep the new option internal for the tipc driver, at least for now.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a socket is joining a group, we look up in the binding table to
find if there are already other members of the group present. This is
used for being able to return EAGAIN instead of EHOSTUNREACH if the
user proceeds directly to a send attempt.
However, the information in the binding table can be used to directly
set the created member in state MBR_PUBLISHED and send a JOIN message
to the peer, instead of waiting for a topology PUBLISH event to do this.
When there are many members in a group, the propagation time for such
events can be significant, and we can save time during the join
operation if we use the initial lookup result fully.
In this commit, we eliminate the member state MBR_DISCOVERED which has
been the result of the initial lookup, and do instead go directly to
MBR_PUBLISHED, which initiates the setup.
After this change, the tipc_member FSM looks as follows:
+-----------+
---->| PUBLISHED |-----------------------------------------------+
PUB- +-----------+ LEAVE/WITHRAW |
LISH |JOIN |
| +-------------------------------------------+ |
| | LEAVE/WITHDRAW | |
| | +------------+ | |
| | +----------->| PENDING |---------+ | |
| | |msg/maxactv +-+---+------+ LEAVE/ | | |
| | | | | WITHDRAW | | |
| | | +----------+ | | | |
| | | |revert/maxactv| | | |
| | | V V V V V
| +----------+ msg +------------+ +-----------+
+-->| JOINED |------>| ACTIVE |------>| LEAVING |--->
| +----------+ +--- -+------+ LEAVE/+-----------+DOWN
| A A | WITHDRAW A A A EVT
| | | |RECLAIM | | |
| | |REMIT V | | |
| | |== adv +------------+ | | |
| | +---------| RECLAIMING |--------+ | |
| | +-----+------+ LEAVE/ | |
| | |REMIT WITHDRAW | |
| | |< adv | |
| |msg/ V LEAVE/ | |
| |adv==ADV_IDLE+------------+ WITHDRAW | |
| +-------------| REMITTED |------------+ |
| +------------+ |
|PUBLISH |
JOIN +-----------+ LEAVE/WITHDRAW |
---->| JOINING |-----------------------------------------------+
+-----------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the changes in the previous commit the group LEAVE sequence
can be simplified.
We now let the arrival of a LEAVE message unconditionally issue a group
DOWN event to the user. When a topology WITHDRAW event is received, the
member, if it still there, is set to state LEAVING, but we only issue a
group DOWN event when the link to the peer node is gone, so that no
LEAVE message is to be expected.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current implementation, a group socket receiving topology
events about other members just converts the topology event message
into a group event message and stores it until it reaches the right
state to issue it to the user. This complicates the code unnecessarily,
and becomes impractical when we in the coming commits will need to
create and issue membership events independently.
In this commit, we change this so that we just notice the type and
origin of the incoming topology event, and then drop the buffer. Only
when it is time to actually send a group event to the user do we
explicitly create a new message and send it upwards.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Analysis reveals that the member state MBR_QURANTINED in reality is
unnecessary, and can be replaced by the state MBR_JOINING at all
occurrencs.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We handle a corner case in the function tipc_group_update_rcv_win().
During extreme pessure it might happen that a message receiver has all
its active senders in RECLAIMING or REMITTED mode, meaning that there
is nobody to reclaim advertisements from if an additional sender tries
to go active.
Currently we just set the new sender to ACTIVE anyway, hence at least
theoretically opening up for a receiver queue overflow by exceeding the
MAX_ACTIVE limit. The correct solution to this is to instead add the
member to the pending queue, while letting the oldest member in that
queue revert to JOINED state.
In this commit we refactor the code for handling message arrival from
a JOINED member, both to make it more comprehensible and to cover the
case described above.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- We remove the 'reclaiming' member list in struct tipc_group, since
it doesn't serve any purpose.
- We simplify the GRP_REMIT_MSG branch of tipc_group_protocol_rcv().
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current code, when creating a new fib6 table, tb6_root.leaf gets
initialized to net->ipv6.ip6_null_entry.
If a default route is being added with rt->rt6i_metric = 0xffffffff,
fib6_add() will add this route after net->ipv6.ip6_null_entry. As
null_entry is shared, it could cause problem.
In order to fix it, set fn->leaf to NULL before calling
fib6_add_rt2node() when trying to add the first default route.
And reset fn->leaf to null_entry when adding fails or when deleting the
last default route.
syzkaller reported the following issue which is fixed by this commit:
WARNING: suspicious RCU usage
4.15.0-rc5+ #171 Not tainted
-----------------------------
net/ipv6/ip6_fib.c:1702 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
4 locks held by swapper/0/0:
#0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
#0: ((&net->ipv6.ip6_fib_timer)){+.-.}, at: [<00000000d43f631b>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1310
#1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
#1: (&(&net->ipv6.fib6_gc_lock)->rlock){+.-.}, at: [<000000002ff9d65c>] fib6_run_gc+0x9d/0x3c0 net/ipv6/ip6_fib.c:2007
#2: (rcu_read_lock){....}, at: [<0000000091db762d>] __fib6_clean_all+0x0/0x3a0 net/ipv6/ip6_fib.c:1560
#3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] spin_lock_bh include/linux/spinlock.h:315 [inline]
#3: (&(&tb->tb6_lock)->rlock){+.-.}, at: [<000000009e503581>] __fib6_clean_all+0x1d0/0x3a0 net/ipv6/ip6_fib.c:1948
stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc5+ #171
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4585
fib6_del+0xcaa/0x11b0 net/ipv6/ip6_fib.c:1701
fib6_clean_node+0x3aa/0x4f0 net/ipv6/ip6_fib.c:1892
fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1815
fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1863
fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1933
__fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1949
fib6_clean_all net/ipv6/ip6_fib.c:1960 [inline]
fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2016
fib6_gc_timer_cb+0x20/0x30 net/ipv6/ip6_fib.c:2033
call_timer_fn+0x228/0x820 kernel/time/timer.c:1320
expire_timers kernel/time/timer.c:1357 [inline]
__run_timers+0x7ee/0xb70 kernel/time/timer.c:1660
run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1686
__do_softirq+0x2d7/0xb85 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:540 [inline]
smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:904
</IRQ>
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 66f5d6ce53 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov says:
====================
Ether fixes for the SolutionEngine771x boards
Here's the series of 2 patches against Linus' repo. This series should
(hoplefully) fix the Ether support on the SolutionEngine771x boards...
[1/2] SolutionEngine771x: fix Ether platform data
[2/2] SolutionEngine771x: add Ether TSU resource
====================
Signed-off-by: David S. Miller <davem@davemloft.net>