linux-xiaomi-chiron

Author	SHA1	Message	Date
Thomas Gleixner	3d4775df0a	futex: Replace PF_EXITPIDONE with a state The futex exit handling relies on PF_ flags. That's suboptimal as it requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in the middle of do_exit() to enforce the observability of PF_EXITING in the futex code. Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic over to the new state. The PF_EXITING dependency will be cleaned up in a later step. This prepares for handling various futex exit issues later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de	2019-11-20 09:40:07 +01:00
Thomas Gleixner	ba31c1a485	futex: Move futex exit handling into futex code The futex exit handling is #ifdeffed into mm_release() which is not pretty to begin with. But upcoming changes to address futex exit races need to add more functionality to this exit code. Split it out into a function, move it into futex code and make the various futex exit functions static. Preparatory only and no functional change. Folded build fix from Borislav. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de	2019-11-20 09:40:07 +01:00
Ingo Molnar	8e1d58ae0c	Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/kcsan Pull the KCSAN subsystem from Paul E. McKenney: "This pull request contains base kernel concurrency sanitizer (KCSAN) enablement for x86, courtesy of Marco Elver. KCSAN is a sampling watchpoint-based data-race detector, and is documented in Documentation/dev-tools/kcsan.rst. KCSAN was announced in September, and much feedback has since been incorporated: http://lkml.kernel.org/r/CANpmjNPJ_bHjfLZCAPV23AXFfiPiyXXqqu72n6TgWzb2Gnu1eA@mail.gmail.com The data races located thus far have resulted in a number of fixes: https://github.com/google/ktsan/wiki/KCSAN#upstream-fixes-of-data-races-found-by-kcsan Additional information may be found here: https://lore.kernel.org/lkml/20191114180303.66955-1-elver@google.com/ " Signed-off-by: Ingo Molnar <mingo@kernel.org>	2019-11-19 19:56:28 +01:00
Dan Williams	e755799aef	libnvdimm: Move nvdimm_bus_attribute_group to device_type A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nvdimm_bus_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309903815.1582359.6418211876315050283.stgit@dwillia2-desk3.amr.corp.intel.com	2019-11-19 09:52:12 -08:00
Dan Williams	360eba7ebd	libnvdimm: Move nvdimm_attribute_group to device_type A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nvdimm_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309903201.1582359.10966209746585062329.stgit@dwillia2-desk3.amr.corp.intel.com	2019-11-19 09:52:12 -08:00
Dan Williams	4ce79fa97e	libnvdimm: Move nd_mapping_attribute_group to device_type A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_mapping_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309902686.1582359.6749533709859492704.stgit@dwillia2-desk3.amr.corp.intel.com	2019-11-19 09:52:12 -08:00
Dan Williams	7c4fc8cde1	libnvdimm: Move nd_region_attribute_group to device_type A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_region_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309902169.1582359.16828508538444551337.stgit@dwillia2-desk3.amr.corp.intel.com	2019-11-19 09:52:12 -08:00
Dan Williams	e2f6a0e348	libnvdimm: Move nd_numa_attribute_group to device_type A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_numa_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157401269537.43284.14411189404186877352.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2019-11-19 09:51:54 -08:00
Daniel Campello	119a3cb6d6	platform/chrome: wilco_ec: Add keyboard backlight LED support The EC is in charge of controlling the keyboard backlight on the Wilco platform. We expose a standard LED class device named platform::kbd_backlight. Since the EC will never change the backlight level of its own accord, we don't need to implement a brightness_get() method. Signed-off-by: Nick Crews <ncrews@chromium.org> Signed-off-by: Daniel Campello <campello@chromium.org> Reviewed-by: Daniel Campello <campello@chromium.org> Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>	2019-11-19 18:12:19 +01:00
Nick Crews	3c4d77b689	platform/chrome: wilco_ec: Add charging config driver Add a device to control the charging algorithm used on Wilco devices, which will be picked up by the drivers/power/supply/wilco-charger.c driver. See Documentation/ABI/testing/sysfs-class-power-wilco for the userspace interface and other info. Signed-off-by: Nick Crews <ncrews@chromium.org> Signed-off-by: Enric Balletbo i Serra <enric.balletbo@collabora.com>	2019-11-19 18:12:19 +01:00
Thomas Zimmermann	f597c2089d	fbdev: Unexport unlink_framebuffer() There are no external callers of unlink_framebuffer() left. Make the function an internal interface. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Noralf Trønnes <noralf@tronnes.org> Link: https://patchwork.freedesktop.org/patch/msgid/20191114125106.28347-4-tzimmermann@suse.de	2019-11-19 14:38:01 +01:00
Rafael J. Wysocki	cbda56d5fe	cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks Commit `99e98d3fb1` ("cpuidle: Consolidate disabled state checks") overlooked the fact that the imx6q and tegra20 cpuidle drivers use the "disabled" field in struct cpuidle_state for quirks which trigger after the initialization of cpuidle, so reading the initial value of that field is not sufficient for those drivers. In order to allow them to implement the quirks without using the "disabled" field in struct cpuidle_state, introduce a new helper function and modify them to use it. Fixes: `99e98d3fb1` ("cpuidle: Consolidate disabled state checks") Reported-by: Len Brown <lenb@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2019-11-19 10:35:13 +01:00
Russell King	298e54fa81	net: phy: add core phylib sfp support Add core phylib help for supporting SFP sockets on PHYs. This provides a mechanism to inform the SFP layer about PHY up/down events, and also unregister the SFP bus when the PHY is going away. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-18 16:56:13 -08:00
Jonathan Corbet	5ca470a0c3	docs: Add request_irq() documentation While checking the results of the :c:func: removal, I noticed that there was no documentation for request_irq(), and request_threaded_irq() was not mentioned at all. Add a kerneldoc comment for request_irq() and add request_threaded_irq() to the list of functions. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2019-11-18 12:40:59 -07:00
Luhua Xu	ae7c2d342a	spi: mediatek: add SPI_CS_HIGH support Change to use SPI_CS_HIGH to support spi CS polarity setting for chips support enhance_timing. Signed-off-by: Luhua Xu <luhua.xu@mediatek.com> Link: https://lore.kernel.org/r/1574053037-26721-2-git-send-email-luhua.xu@mediatek.com Signed-off-by: Mark Brown <broonie@kernel.org>	2019-11-18 17:47:00 +00:00
Steven Rostedt (VMware)	ea806eb3ea	ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization If a direct ftrace callback is at a location that does not have any other ftrace helpers attached to it, it is possible to simply just change the text to call the new caller (if the architecture supports it). But this requires special architecture code. Currently, modify_ftrace_direct() uses a trick to add a stub ftrace callback to the location forcing it to call the ftrace iterator. Then it can change the direct helper to call the new function in C, and then remove the stub. Removing the stub will have the location now call the new location that the direct helper is using. The new helper function does the registering the stub trick, but is a weak function, allowing an architecture to override it to do something a bit more direct. Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>	2019-11-18 11:42:09 -05:00
Tejun Heo	496074f94b	blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1 Currently, cgroup rstat is supported only on cgroup2 hierarchy and rstat functions shouldn't be called on cgroup1 cgroups. While converting blk-cgroup core statistics to rstat, `f733164829` ("blk-cgroup: reimplement basic IO stats using cgroup rstat") accidentally ended up calling cgroup_rstat_updated() on cgroup1 cgroups causing crashes. Longer term, we probably should add cgroup1 support to rstat but for now let's mask the call directly. Fixes: `f733164829` ("blk-cgroup: reimplement basic IO stats using cgroup rstat") Tested-by: Faiz Abbas <faiz_abbas@ti.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-18 08:40:41 -07:00
Ingo Molnar	b21feab0b8	Linux 5.4-rc8 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl3RzgkeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGN18H/0JZbfIpy8/4Irol 0va7Aj2fBi1a5oxfqYsMKN0u3GKbN3OV9tQ+7w1eBNGvL72TGadgVTzTY+Im7A9U UjboAc7jDPCG+YhIwXFufMiIAq5jDIj6h0LDas7ALsMfsnI/RhTwgNtLTAkyI3dH YV/6ljFULwueJHCxzmrYbd1x39PScj3kCNL2pOe6On7rXMKOemY/nbbYYISxY30E GMgKApSS+li7VuSqgrKoq5Qaox26LyR2wrXB1ij4pqEJ9xgbnKRLdHuvXZnE+/5p 46EMirt+yeSkltW3d2/9MoCHaA76ESzWMMDijLx7tPgoTc3RB3/3ZLsm3rYVH+cR cRlNNSk= =0+Cg -----END PGP SIGNATURE----- Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependencies Signed-off-by: Ingo Molnar <mingo@kernel.org>	2019-11-18 14:41:02 +01:00
Andrii Nakryiko	fc9702273e	bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY Add ability to memory-map contents of BPF array map. This is extremely useful for working with BPF global data from userspace programs. It allows to avoid typical bpf_map_{lookup,update}_elem operations, improving both performance and usability. There had to be special considerations for map freezing, to avoid having writable memory view into a frozen map. To solve this issue, map freezing and mmap-ing is happening under mutex now: - if map is already frozen, no writable mapping is allowed; - if map has writable memory mappings active (accounted in map->writecnt), map freezing will keep failing with -EBUSY; - once number of writable memory mappings drops to zero, map freezing can be performed again. Only non-per-CPU plain arrays are supported right now. Maps with spinlocks can't be memory mapped either. For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc() to be mmap()'able. We also need to make sure that array data memory is page-sized and page-aligned, so we over-allocate memory in such a way that struct bpf_array is at the end of a single page of memory with array->value being aligned with the start of the second page. On deallocation we need to accomodate this memory arrangement to free vmalloc()'ed memory correctly. One important consideration regarding how memory-mapping subsystem functions. Memory-mapping subsystem provides few optional callbacks, among them open() and close(). close() is called for each memory region that is unmapped, so that users can decrease their reference counters and free up resources, if necessary. open() is almost symmetrical: it's called for each memory region that is being mapped, except the very first one. So bpf_map_mmap does initial refcnt bump, while open() will do any extra ones after that. Thus number of close() calls is equal to number of open() calls plus one more. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com	2019-11-18 11:41:59 +01:00
Andrii Nakryiko	85192dbf4d	bpf: Convert bpf_prog refcnt to atomic64_t Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64 and remove artificial 32k limit. This allows to make bpf_prog's refcounting non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc. Validated compilation by running allyesconfig kernel build. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com	2019-11-18 11:41:59 +01:00
Andrii Nakryiko	1e0bd5a091	bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails `92117d8443` ("bpf: fix refcnt overflow") turned refcounting of bpf_map into potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit (32k). Due to using 32-bit counter, it's possible in practice to overflow refcounter and make it wrap around to 0, causing erroneous map free, while there are still references to it, causing use-after-free problems. But having a failing refcounting operations are problematic in some cases. One example is mmap() interface. After establishing initial memory-mapping, user is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily splitting it into multiple non-contiguous regions. All this happening without any control from the users of mmap subsystem. Rather mmap subsystem sends notifications to original creator of memory mapping through open/close callbacks, which are optionally specified during initial memory mapping creation. These callbacks are used to maintain accurate refcount for bpf_map (see next patch in this series). The problem is that open() callback is not supposed to fail, because memory-mapped resource is set up and properly referenced. This is posing a problem for using memory-mapping with BPF maps. One solution to this is to maintain separate refcount for just memory-mappings and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively. There are similar use cases in current work on tcp-bpf, necessitating extra counter as well. This seems like a rather unfortunate and ugly solution that doesn't scale well to various new use cases. Another approach to solve this is to use non-failing refcount_t type, which uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX, stays there. This utlimately causes memory leak, but prevents use after free. But given refcounting is not the most performance-critical operation with BPF maps (it's not used from running BPF program code), we can also just switch to 64-bit counter that can't overflow in practice, potentially disadvantaging 32-bit platforms a tiny bit. This simplifies semantics and allows above described scenarios to not worry about failing refcount increment operation. In terms of struct bpf_map size, we are still good and use the same amount of space: BEFORE (3 cache lines, 8 bytes of padding at the end): struct bpf_map { const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 / struct bpf_map inner_map_meta; /* 8 8 / void security; /* 16 8 / enum bpf_map_type map_type; / 24 4 / u32 key_size; / 28 4 / u32 value_size; / 32 4 / u32 max_entries; / 36 4 / u32 map_flags; / 40 4 / int spin_lock_off; / 44 4 / u32 id; / 48 4 / int numa_node; / 52 4 / u32 btf_key_type_id; / 56 4 / u32 btf_value_type_id; / 60 4 / / --- cacheline 1 boundary (64 bytes) --- / struct btf btf; /* 64 8 / struct bpf_map_memory memory; / 72 16 / bool unpriv_array; / 88 1 / bool frozen; / 89 1 / / XXX 38 bytes hole, try to pack / / --- cacheline 2 boundary (128 bytes) --- / atomic_t refcnt __attribute__((__aligned__(64))); / 128 4 / atomic_t usercnt; / 132 4 / struct work_struct work; / 136 32 / char name[16]; / 168 16 / / size: 192, cachelines: 3, members: 21 / / sum members: 146, holes: 1, sum holes: 38 / / padding: 8 / / forced alignments: 2, forced holes: 1, sum forced holes: 38 / } __attribute__((__aligned__(64))); AFTER (same 3 cache lines, no extra padding now): struct bpf_map { const struct bpf_map_ops ops __attribute__((__aligned__(64))); /* 0 8 / struct bpf_map inner_map_meta; /* 8 8 / void security; /* 16 8 / enum bpf_map_type map_type; / 24 4 / u32 key_size; / 28 4 / u32 value_size; / 32 4 / u32 max_entries; / 36 4 / u32 map_flags; / 40 4 / int spin_lock_off; / 44 4 / u32 id; / 48 4 / int numa_node; / 52 4 / u32 btf_key_type_id; / 56 4 / u32 btf_value_type_id; / 60 4 / / --- cacheline 1 boundary (64 bytes) --- / struct btf btf; /* 64 8 / struct bpf_map_memory memory; / 72 16 / bool unpriv_array; / 88 1 / bool frozen; / 89 1 / / XXX 38 bytes hole, try to pack / / --- cacheline 2 boundary (128 bytes) --- / atomic64_t refcnt __attribute__((__aligned__(64))); / 128 8 / atomic64_t usercnt; / 136 8 / struct work_struct work; / 144 32 / char name[16]; / 176 16 / / size: 192, cachelines: 3, members: 21 / / sum members: 154, holes: 1, sum holes: 38 / / forced alignments: 2, forced holes: 1, sum forced holes: 38 */ } __attribute__((__aligned__(64))); This patch, while modifying all users of bpf_map_inc, also cleans up its interface to match bpf_map_put with separate operations for bpf_map_inc and bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref, respectively). Also, given there are no users of bpf_map_inc_not_zero specifying uref=true, remove uref flag and default to uref=false internally. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com	2019-11-18 11:41:59 +01:00
Bradley Bolen	f3d7c2292d	mmc: core: Fix size overflow for mmc partitions With large eMMC cards, it is possible to create general purpose partitions that are bigger than 4GB. The size member of the mmc_part struct is only an unsigned int which overflows for gp partitions larger than 4GB. Change this to a u64 to handle the overflow. Signed-off-by: Bradley Bolen <bradleybolen@gmail.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>	2019-11-18 10:05:38 +01:00
Linus Torvalds	ec53851967	IOMMU Fixes for Linux v5.4-rc7 Including: - Fix for Intel IOMMU to correct invalidation commands when in SVA mode. - Update MAINTAINERS entry for Intel IOMMU -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl3RlmIACgkQK/BELZcB GuPHvQ//c2eWE1ZOLijbQdfihUCF3JJy+greQiofS8BhIZ2gmMEuZYxcs5U16Cpm LRjLPUP+/yTjZavV9dJtz4JSCuNpsn7v+kuhMiZJV1ol2HABbaGHf13203fbkg2f E9KQscWeya0/R7u3DkdNdF64xyys9HJebzqdhpPx8RddWNW08qr5OLENkrHOzMi+ aVvRmKNFsi671oXZ4iW8myBiu5r2pObDHoSuhPBQXQ+aYs/0TLqGx3LG6lLM+V20 sWd0qwo+bfFUpSMXWQxYcQ7aTWQxO9GJutJMIVpKqOhwzGRXp6+F6DJzVsILiYHO fkoQ1JUBadms5lwniHiVqfZPd1hR6RqHuA8NStpJQbOknFGm0+lJPPZeJLjZuaww YwCO4zCzgavRFQOXcLP4eKVV4wszpCgjlJuD8d0VC1lSYF0ZMEHYuvIciDYlJjAn 3FeqEqxIzF7mQKryST1nJGhsC/dKTOrxogxlSaTgK3aAw0Ekz+lpSTedD8hseY+D ShaUv6Ct2hWlRfqFfWJu0Y/utvirsi6mLxotfMqD+vdZDp7ap/Z+jxLkOVC+Dl36 /SCYb1u2aCsRyofD2mlxUE7mEFKdAcO4YCw3k+ezdJuWO1ikTO0z9RAdw14cEtjZ R+lWTPjb78Z9QyorCCfrYcMjkbq4PBjnx3vOFQNynf8RVM9BKo0= =5/bw -----END PGP SIGNATURE----- Merge tag 'iommu-fixes-v5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: - Fix for Intel IOMMU to correct invalidation commands when in SVA mode. - Update MAINTAINERS entry for Intel IOMMU * tag 'iommu-fixes-v5.4-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/vt-d: Fix QI_DEV_IOTLB_PFSID and QI_DEV_EIOTLB_PFSID macros MAINTAINERS: Update for INTEL IOMMU (VT-d) entry	2019-11-17 11:27:44 -08:00
Dan Williams	adbb68293f	libnvdimm: Move nd_device_attribute_group to device_type A 'struct device_type' instance can carry default attributes for the device. Use this facility to remove the export of nd_device_attribute_group and put the responsibility on the core rather than leaf implementations to define this attribute. For regions this creates a new nd_region_attribute_groups[] added to the per-region device-type instances. Cc: Ira Weiny <ira.weiny@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Link: https://lore.kernel.org/r/157309901138.1582359.12909354140826530394.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2019-11-17 09:17:39 -08:00
David S. Miller	19b7e21c55	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Lots of overlapping changes and parallel additions, stuff like that. Signed-off-by: David S. Miller <davem@davemloft.net>	2019-11-16 21:51:42 -08:00
Sebastian Andrzej Siewior	9e8d42a0f7	percpu-refcount: Use normal instead of RCU-sched" This is a revert of commit `a4244454df` ("percpu-refcount: use RCU-sched insted of normal RCU") which claims the only reason for using RCU-sched is "rcu_read_[un]lock() … are slightly more expensive than preempt_disable/enable()" and "As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications." The problem with using RCU-sched here is that it disables preemption and the release callback (called from percpu_ref_put_many()) must not acquire any sleeping locks like spinlock_t. This breaks PREEMPT_RT because some of the users acquire spinlock_t locks in their callbacks. Using rcu_read_lock() on PREEMPTION=n kernels is not any different compared to rcu_read_lock_sched(). On PREEMPTION=y kernels there are already performance issues due to additional preemption points. Looking at the code, the rcu_read_lock() is just an increment and unlock is almost just a decrement unless there is something special to do. Both are functions while disabling preemption is inlined. Doing a small benchmark, the minimal amount of time required was mostly the same. The average time required was higher due to the higher MAX value (which could be preemption). With DEBUG_PREEMPT=y it is rcu_read_lock_sched() that takes a little longer due to the additional debug code. Convert back to normal RCU. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Dennis Zhou <dennis@kernel.org>	2019-11-16 20:02:47 -08:00
Ard Biesheuvel	d63007eb95	crypto: ablkcipher - remove deprecated and unused ablkcipher support Now that all users of the deprecated ablkcipher interface have been moved to the skcipher interface, ablkcipher is no longer used and can be removed. Reviewed-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2019-11-17 09:02:49 +08:00
Linus Torvalds	8be636dd8a	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from David Miller: 1) Fix memory leak in xfrm_state code, from Steffen Klassert. 2) Fix races between devlink reload operations and device setup/cleanup, from Jiri Pirko. 3) Null deref in NFC code, from Stephan Gerhold. 4) Refcount fixes in SMC, from Ursula Braun. 5) Memory leak in slcan open error paths, from Jouni Hogander. 6) Fix ETS bandwidth validation in hns3, from Yonglong Liu. 7) Info leak on short USB request answers in ax88172a driver, from Oliver Neukum. 8) Release mem region properly in ep93xx_eth, from Chuhong Yuan. 9) PTP config timestamp flags validation, from Richard Cochran. 10) Dangling pointers after SKB data realloc in seg6, from Andrea Mayer. 11) Missing free_netdev() in gemini driver, from Chuhong Yuan. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (56 commits) ipmr: Fix skb headroom in ipmr_get_route(). net: hns3: cleanup of stray struct hns3_link_mode_mapping net/smc: fix fastopen for non-blocking connect() rds: ib: update WR sizes when bringing up connection net: gemini: add missed free_netdev net: dsa: tag_8021q: Fix dsa_8021q_restore_pvid for an absent pvid seg6: fix skb transport_header after decap_and_validate() seg6: fix srh pointer in get_srh() net: stmmac: Use the correct style for SPDX License Identifier octeontx2-af: Use the correct style for SPDX License Identifier ptp: Extend the test program to check the external time stamp flags. mlx5: Reject requests to enable time stamping on both edges. igb: Reject requests that fail to enable time stamping on both edges. dp83640: Reject requests to enable time stamping on both edges. mv88e6xxx: Reject requests to enable time stamping on both edges. ptp: Introduce strict checking of external time stamp options. renesas: reject unsupported external timestamp flags mlx5: reject unsupported external timestamp flags igb: reject unsupported external timestamp flags dp83640: reject unsupported external timestamp flags ...	2019-11-16 15:52:00 -08:00
Marco Elver	bf07132f96	seqlock: Require WRITE_ONCE surrounding raw_seqcount_barrier This patch proposes to require marked atomic accesses surrounding raw_write_seqcount_barrier. We reason that otherwise there is no way to guarantee propagation nor atomicity of writes before/after the barrier [1]. For example, consider the compiler tears stores either before or after the barrier; in this case, readers may observe a partial value, and because readers are unaware that writes are going on (writes are not in a seq-writer critical section), will complete the seq-reader critical section while having observed some partial state. [1] https://lwn.net/Articles/793253/ This came up when designing and implementing KCSAN, because KCSAN would flag these accesses as data-races. After careful analysis, our reasoning as above led us to conclude that the best thing to do is to propose an amendment to the raw_seqcount_barrier usage. Signed-off-by: Marco Elver <elver@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-11-16 07:23:15 -08:00
Marco Elver	88ecd153be	seqlock, kcsan: Add annotations for KCSAN Since seqlocks in the Linux kernel do not require the use of marked atomic accesses in critical sections, we teach KCSAN to assume such accesses are atomic. KCSAN currently also pretends that writes to `sequence` are atomic, although currently plain writes are used (their corresponding reads are READ_ONCE). Further, to avoid false positives in the absence of clear ending of a seqlock reader critical section (only when using the raw interface), KCSAN assumes a fixed number of accesses after start of a seqlock critical section are atomic. === Commentary on design around absence of clear begin/end markings === Seqlock usage via seqlock_t follows a predictable usage pattern, where clear critical section begin/end is enforced. With subtle special cases for readers needing to be flat atomic regions, e.g. because usage such as in: - fs/namespace.c:__legitimize_mnt - unbalanced read_seqretry - fs/dcache.c:d_walk - unbalanced need_seqretry But, anything directly accessing seqcount_t seems to be unpredictable. Filtering for usage of read_seqcount_retry not following 'do { .. } while (read_seqcount_retry(..));': $ git grep 'read_seqcount_retry' \| grep -Ev 'while \(\|seqlock.h\|Doc\|\* ' => about 1/3 of the total read_seqcount_retry usage. Just looking at fs/namei.c, we conclude that it is non-trivial to prescribe and migrate to an interface that would force clear begin/end seqlock markings for critical sections. As such, we concluded that the best design currently, is to simply ensure that KCSAN works well with the existing code. Signed-off-by: Marco Elver <elver@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-11-16 07:23:15 -08:00
Marco Elver	c48981eeb0	include/linux/compiler.h: Introduce data_race(expr) macro This introduces the data_race(expr) macro, which can be used to annotate expressions for purposes of (1) documenting, and (2) giving tooling such as KCSAN information about which data races are deemed "safe". More context: http://lkml.kernel.org/r/CAHk-=wg5CkOEF8DTez1Qu0XTEFw_oHhxN98bDnFqbY7HL5AB2g@mail.gmail.com Signed-off-by: Marco Elver <elver@google.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Eric Dumazet <edumazet@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-11-16 07:23:13 -08:00
Marco Elver	dfd402a4c4	kcsan: Add Kernel Concurrency Sanitizer infrastructure Kernel Concurrency Sanitizer (KCSAN) is a dynamic data-race detector for kernel space. KCSAN is a sampling watchpoint-based data-race detector. See the included Documentation/dev-tools/kcsan.rst for more details. This patch adds basic infrastructure, but does not yet enable KCSAN for any architecture. Signed-off-by: Marco Elver <elver@google.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2019-11-16 07:23:13 -08:00
Hans de Goede	a079973f46	usb: typec: tcpm: Remove tcpc_config configuration mechanism All configuration can and should be done through fwnodes instead of through the tcpc_config struct and there are no existing users left of struct tcpc_config, so lets remove it. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20191114111840.40876-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-11-16 14:45:31 +01:00
Lina Iyer	e35a6ae0eb	pinctrl/msm: Setup GPIO chip in hierarchy Some GPIOs are marked as wakeup capable and are routed to another interrupt controller that is an always-domain and can detect interrupts even when most of the SoC is powered off. The wakeup interrupt controller wakes up the GIC and replays the interrupt at the GIC. Setup the TLMM irqchip in hierarchy with the wakeup interrupt controller and ensure the wakeup GPIOs are handled correctly. Co-developed-by: Maulik Shah <mkshah@codeaurora.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-9-git-send-email-ilina@codeaurora.org ---- Changes in v2: - Address review comments - Fix Co-developed-by tag Changes in v1: - Address minor review comments - Remove redundant call to set irq handler - Move irq_domain_qcom_handle_wakeup() to this patch Changes in RFC v2: - Rebase on top of GPIO hierarchy support in linux-next - Set the chained irq handler for summary line	2019-11-16 10:23:15 +00:00
Lina Iyer	81ef8bf880	irqchip/qcom-pdc: Add irqdomain for wakeup capable GPIOs Introduce a new domain for wakeup capable GPIOs. The domain can be requested using the bus token DOMAIN_BUS_WAKEUP. In the following patches, we will specify PDC as the wakeup-parent for the TLMM GPIO irqchip. Requesting a wakeup GPIO will setup the GPIO and the corresponding PDC interrupt as its parent. Co-developed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-5-git-send-email-ilina@codeaurora.org	2019-11-16 10:21:15 +00:00
Maulik Shah	4a169a95d8	genirq: Introduce irq_chip_get/set_parent_state calls On certain QTI chipsets some GPIOs are direct-connect interrupts to the GIC to be used as regular interrupt lines. When the GPIOs are not used for interrupt generation the interrupt line is disabled. But disabling the interrupt at GIC does not prevent the interrupt to be reported as pending at GIC_ISPEND. Later, when drivers call enable_irq() on the interrupt, an unwanted interrupt occurs. Introduce get and set methods for irqchip's parent to clear it's pending irq state. This then can be invoked by the GPIO interrupt controller on the parents in it hierarchy to clear the interrupt before enabling the interrupt. Signed-off-by: Maulik Shah <mkshah@codeaurora.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-7-git-send-email-ilina@codeaurora.org [updated commit text and minor code fixes]	2019-11-16 10:20:02 +00:00
Lina Iyer	d46bca2b5d	irqdomain: Add bus token DOMAIN_BUS_WAKEUP A single controller can handle normal interrupts and wake-up interrupts independently, with a different numbering space. It is thus crucial to allow the driver for such a controller discriminate between the two. A simple way to do so is to tag the wake-up irqdomain with a "bus token" that indicates the wake-up domain. This slightly abuses the notion of bus, but also radically simplifies the design of such a driver. Between two evils, we choose the least damaging. Suggested-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Lina Iyer <ilina@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Link: https://lore.kernel.org/r/1573855915-9841-2-git-send-email-ilina@codeaurora.org	2019-11-16 10:18:52 +00:00
David Hildenbrand	2c91f8fc6c	mm/memory_hotplug: fix try_offline_node() try_offline_node() is pretty much broken right now: - The node span is updated when onlining memory, not when adding it. We ignore memory that was mever onlined. Bad. - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily trigger a kernel panic. Bad for memory that is offline but also bad for subsection hotadd with ZONE_DEVICE, whereby the memmap of the first PFN of a section might contain garbage. - Sections belonging to mixed nodes are not properly considered. As memory blocks might belong to multiple nodes, we would have to walk all pageblocks (or at least subsections) within present sections. However, we don't have a way to identify whether a memmap that is not online was initialized (relevant for ZONE_DEVICE). This makes things more complicated. Luckily, we can piggy pack on the node span and the nid stored in memory blocks. Currently, the node span is grown when calling move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when removing memory, before calling try_offline_node(). Sysfs links are created via link_mem_sections(), e.g., during boot or when adding memory. If the node still spans memory or if any memory block belongs to the nid, we don't set the node offline. As memory blocks that span multiple nodes cannot get offlined, the nid stored in memory blocks is reliable enough (for such online memory blocks, the node still spans the memory). Introduce for_each_memory_block() to efficiently walk all memory blocks. Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span when removing ZONE_DEVICE memory to fix similar issues (access of garbage memmaps) - until we have a reliable way to identify whether these memmaps were properly initialized. This implies later, that once a node had ZONE_DEVICE memory, we won't be able to set a node offline - which should be acceptable. Since commit `f1dd2cd13c` ("mm, memory_hotplug: do not associate hotadded memory to zones until online") memory that is added is not assoziated with a zone/node (memmap not initialized). The introducing commit `60a5a19e74` ("memory-hotplug: remove sysfs file of node") already missed that we could have multiple nodes for a section and that the zone/node span is updated when onlining pages, not when adding them. I tested this by hotplugging two DIMMs to a memory-less and cpu-less NUMA node. The node is properly onlined when adding the DIMMs. When removing the DIMMs, the node is properly offlined. Masayoshi Mizuma reported: : Without this patch, memory hotplug fails as panic: : : BUG: kernel NULL pointer dereference, address: 0000000000000000 : ... : Call Trace: : remove_memory_block_devices+0x81/0xc0 : try_remove_memory+0xb4/0x130 : __remove_memory+0xa/0x20 : acpi_memory_device_remove+0x84/0x100 : acpi_bus_trim+0x57/0x90 : acpi_bus_trim+0x2e/0x90 : acpi_device_hotplug+0x2b2/0x4d0 : acpi_hotplug_work_fn+0x1a/0x30 : process_one_work+0x171/0x380 : worker_thread+0x49/0x3f0 : kthread+0xf8/0x130 : ret_from_fork+0x35/0x40 [david@redhat.com: v3] Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com Fixes: `60a5a19e74` ("memory-hotplug: remove sysfs file of node") Fixes: `f1dd2cd13c` ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after `d0dc12e86b` Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Nayna Jain <nayna@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-15 18:34:00 -08:00
Adrian Reber	49cb2fc42c	fork: extend clone3() to support setting a PID The main motivation to add set_tid to clone3() is CRIU. To restore a process with the same PID/TID CRIU currently uses /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to ns_last_pid and then (quickly) does a clone(). This works most of the time, but it is racy. It is also slow as it requires multiple syscalls. Extending clone3() to support set_tid makes it possible restore a process using CRIU without accessing /proc/sys/kernel/ns_last_pid and race free (as long as the desired PID/TID is available). This clone3() extension places the same restrictions (CAP_SYS_ADMIN) on clone3() with set_tid as they are currently in place for ns_last_pid. The original version of this change was using a single value for set_tid. At the 2019 LPC, after presenting set_tid, it was, however, decided to change set_tid to an array to enable setting the PID of a process in multiple PID namespaces at the same time. If a process is created in a PID namespace it is possible to influence the PID inside and outside of the PID namespace. Details also in the corresponding selftest. To create a process with the following PIDs: PID NS level Requested PID 0 (host) 31496 1 42 2 1 For that example the two newly introduced parameters to struct clone_args (set_tid and set_tid_size) would need to be: set_tid[0] = 1; set_tid[1] = 42; set_tid[2] = 31496; set_tid_size = 3; If only the PIDs of the two innermost nested PID namespaces should be defined it would look like this: set_tid[0] = 1; set_tid[1] = 42; set_tid_size = 2; The PID of the newly created process would then be the next available free PID in the PID namespace level 0 (host) and 42 in the PID namespace at level 1 and the PID of the process in the innermost PID namespace would be 1. The set_tid array is used to specify the PID of a process starting from the innermost nested PID namespaces up to set_tid_size PID namespaces. set_tid_size cannot be larger then the current PID namespace level. Signed-off-by: Adrian Reber <areber@redhat.com> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Acked-by: Andrei Vagin <avagin@gmail.com> Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2019-11-15 23:49:22 +01:00
Alexei Starovoitov	5b92a28aae	bpf: Support attaching tracing BPF program to other BPF programs Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type including their subprograms. This feature allows snooping on input and output packets in XDP, TC programs including their return values. In order to do that the verifier needs to track types not only of vmlinux, but types of other BPF programs as well. The verifier also needs to translate uapi/linux/bpf.h types used by networking programs into kernel internal BTF types used by FENTRY/FEXIT BPF programs. In some cases LLVM optimizations can remove arguments from BPF subprograms without adjusting BTF info that LLVM backend knows. When BTF info disagrees with actual types that the verifiers sees the BPF trampoline has to fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT program can still attach to such subprograms, but it won't be able to recognize pointer types like 'struct sk_buff *' and it won't be able to pass them to bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program would need to use bpf_probe_read_kernel() instead. The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd points to previously loaded BPF program the attach_btf_id is BTF type id of main function or one of its subprograms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org	2019-11-15 23:45:24 +01:00
Alexei Starovoitov	8c1b6e69dc	bpf: Compare BTF types of functions arguments with actual types Make the verifier check that BTF types of function arguments match actual types passed into top-level BPF program and into BPF-to-BPF calls. If types match such BPF programs and sub-programs will have full support of BPF trampoline. If types mismatch the trampoline has to be conservative. It has to save/restore five program arguments and assume 64-bit scalars. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org	2019-11-15 23:45:02 +01:00
Alexei Starovoitov	91cc1a9974	bpf: Annotate context types Annotate BPF program context types with program-side type and kernel-side type. This type information is used by the verifier. btf_get_prog_ctx_type() is used in the later patches to verify that BTF type of ctx in BPF program matches to kernel expected ctx type. For example, the XDP program type is: BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff) That means that XDP program should be written as: int xdp_prog(struct xdp_md *ctx) { ... } Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org	2019-11-15 23:44:48 +01:00
Alexei Starovoitov	9cc31b3a09	bpf: Fix race in btf_resolve_helper_id() btf_resolve_helper_id() caching logic is a bit racy, since under root the verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE. Fix the type as well, since error is also recorded. Fixes: `a7658e1a41` ("bpf: Check types of arguments passed into helpers") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org	2019-11-15 23:44:20 +01:00
Alexei Starovoitov	fec56f5890	bpf: Introduce BPF trampoline Introduce BPF trampoline concept to allow kernel code to call into BPF programs with practically zero overhead. The trampoline generation logic is architecture dependent. It's converting native calling convention into BPF calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The registers R1 to R5 are used to pass arguments into BPF functions. The main BPF program accepts only single argument "ctx" in R1. Whereas CPU native calling convention is different. x86-64 is passing first 6 arguments in registers and the rest on the stack. x86-32 is passing first 3 arguments in registers. sparc64 is passing first 6 in registers. And so on. The trampolines between BPF and kernel already exist. BPF_CALL_x macros in include/linux/filter.h statically compile trampolines from BPF into kernel helpers. They convert up to five u64 arguments into kernel C pointers and integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On 32-bit architecture they're meaningful. The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert kernel function arguments into array of u64s that BPF program consumes via R1=ctx pointer. This patch set is doing the same job as __bpf_trace_##call() static trampolines, but dynamically for any kernel function. There are ~22k global kernel functions that are attachable via nop at function entry. The function arguments and types are described in BTF. The job of btf_distill_func_proto() function is to extract useful information from BTF into "function model" that architecture dependent trampoline generators will use to generate assembly code to cast kernel function arguments into array of u64s. For example the kernel function eth_type_trans has two pointers. They will be casted to u64 and stored into stack of generated trampoline. The pointer to that stack space will be passed into BPF program in R1. On x86-64 such generated trampoline will consume 16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will make sure that only two u64 are accessed read-only by BPF program. The verifier will also recognize the precise type of the pointers being accessed and will not allow typecasting of the pointer to a different type within BPF program. The tracing use case in the datacenter demonstrated that certain key kernel functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always active. Other functions have both kprobe and kretprobe. So it is essential to keep both kernel code and BPF programs executing at maximum speed. Hence generated BPF trampoline is re-generated every time new program is attached or detached to maintain maximum performance. To avoid the high cost of retpoline the attached BPF programs are called directly. __bpf_prog_enter/exit() are used to support per-program execution stats. In the future this logic will be optimized further by adding support for bpf_stats_enabled_key inside generated assembly code. Introduction of preemptible and sleepable BPF programs will completely remove the need to call to __bpf_prog_enter/exit(). Detach of a BPF program from the trampoline should not fail. To avoid memory allocation in detach path the half of the page is used as a reserve and flipped after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly which is enough for BPF tracing use cases. This limit can be increased in the future. BPF_TRACE_FENTRY programs have access to raw kernel function arguments while BPF_TRACE_FEXIT programs have access to kernel return value as well. Often kprobe BPF program remembers function arguments in a map while kretprobe fetches arguments from a map and analyzes them together with return value. BPF_TRACE_FEXIT accelerates this typical use case. Recursion prevention for kprobe BPF programs is done via per-cpu bpf_prog_active counter. In practice that turned out to be a mistake. It caused programs to randomly skip execution. The tracing tools missed results they were looking for. Hence BPF trampoline doesn't provide builtin recursion prevention. It's a job of BPF program itself and will be addressed in the follow up patches. BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases in the future. For example to remove retpoline cost from XDP programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org	2019-11-15 23:41:51 +01:00
Alexei Starovoitov	5964b2000f	bpf: Add bpf_arch_text_poke() helper Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch nops/calls in kernel text into calls into BPF trampoline and to patch calls/nops inside BPF programs too. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org	2019-11-15 23:41:28 +01:00
Ilya Leoshkevich	b7b3fc8dd9	bpf: Support doubleword alignment in bpf_jit_binary_alloc Currently passing alignment greater than 4 to bpf_jit_binary_alloc does not work: in such cases it silently aligns only to 4 bytes. On s390, in order to load a constant from memory in a large (>512k) BPF program, one must use lgrl instruction, whose memory operand must be aligned on an 8-byte boundary. This patch makes it possible to request 8-byte alignment from bpf_jit_binary_alloc, and also makes it issue a warning when an unsupported alignment is requested. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com	2019-11-15 22:25:00 +01:00
Wolfram Sang	9af433840b	i2c: remove helpers for ref-counting clients There are no in-tree users of these helpers anymore, and there shouldn't. Most use cases went away once the driver model started to refcount for us. There have been users like the media subsystem, but they all switched to better refcounting methods meanwhile. Media did this in 2008. Last user (IPMI) left 2018. Remove this cruft. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Jean Delvare <jdelvare@suse.de> Tested-by: Luca Ceresoli <luca@lucaceresoli.net> Reviewed-by: Luca Ceresoli <luca@lucaceresoli.net> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>	2019-11-15 22:04:48 +01:00
Al Viro	6c2d4798a8	new helper: lookup_positive_unlocked() Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-11-15 13:49:04 -05:00
Al Viro	d41efb522e	fs/namei.c: pull positivity check into follow_managed() There are 4 callers; two proceed to check if result is positive and fail with ENOENT if it isn't; one (in handle_lookup_down()) is guaranteed to yield positive and one (in lookup_fast()) is _preceded_ by positivity check. However, follow_managed() on a negative dentry is a (fairly cheap) no-op on anything other than autofs. And negative autofs dentries are never hashed, so lookup_fast() is not going to run into one of those. Moreover, successful follow_managed() on a _positive_ dentry never yields a negative one (and we significantly rely upon that in callers of lookup_fast()). In other words, we can easily transpose the positivity check and the call of follow_managed() in lookup_fast(). And that allows to fold the positivity check into follow_managed(), simplifying life for the code downstream of its calls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-11-15 13:49:04 -05:00
David Howells	6718b6f855	pipe: Allow pipes to have kernel-reserved slots Split pipe->ring_size into two numbers: (1) pipe->ring_size - indicates the hard size of the pipe ring. (2) pipe->max_usage - indicates the maximum number of pipe ring slots that userspace orchestrated events can fill. This allows for a pipe that is both writable by the general kernel notification facility and by userspace, allowing plenty of ring space for notifications to be added whilst preventing userspace from being able to pin too much unswappable kernel space. Signed-off-by: David Howells <dhowells@redhat.com>	2019-11-15 16:22:54 +00:00

... 115 116 117 118 119 ...

75139 commits