Commit graph

753663 commits

Author SHA1 Message Date
Jakub Kicinski
bfee64deaa nfp: bpf: add map deletes from the datapath
Support calling map_delete_elem() FW helper from the datapath
programs.  For JIT checks and code are basically equivalent
to map lookups.  Similarly to other map helper key must be on
the stack.  Different pointer types are left for future extension.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:13 -07:00
Jakub Kicinski
44d65a47ae nfp: bpf: add map updates from the datapath
Support calling map_update_elem() from the datapath programs
by calling into FW-provided helper.  Value pointer is passed
in LM pointer #2.  Keeping track of old state for arg3 is not
necessary, since LM pointer #2 will be always loaded in this
case, the trivial optimization for value at the bottom of the
stack can't be done here.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:13 -07:00
Jakub Kicinski
289c5b7630 nfp: bpf: add helper for basic map call checks
Add a verifier helper for performing the basic state checks
before a call to a map helper.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:13 -07:00
Jakub Kicinski
2f46e0c127 nfp: bpf: add helper for validating stack pointers
Our implementation has restriction on stack pointers for function
calls.  Move the common checks into a helper for reuse.  The state
has to be encapsulated into a structure to support parameters
other than BPF_REG_2.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:13 -07:00
Jakub Kicinski
fc4484970e nfp: bpf: rename map_lookup_stack() to map_call_stack_common()
We will reuse most of map call code gen for other map calls.
Rename the lookup gen function and use meta->func_id instead
of hard-coding lookup.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:13 -07:00
Jiong Wang
87b10ecdce nfp: bpf: detect packet reads could be cached, enable the optimisation
This patch is the front end of this optimisation, it detects and marks
those packet reads that could be cached. Then the optimisation "backend"
will be activated automatically.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:12 -07:00
Jiong Wang
91ff69e840 nfp: bpf: support unaligned read offset
This patch add the support for unaligned read offset, i.e. the read offset
to the start of packet cache area is not aligned to REG_WIDTH. In this
case, the read area might across maximum three transfer-in registers.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:12 -07:00
Jiong Wang
be75923786 nfp: bpf: read from packet data cache for PTR_TO_PACKET
This patch assumes there is a packet data cache, and would try to read
packet data from the cache instead of from memory.

This patch only implements the optimisation "backend", it doesn't build
the packet data cache, so this optimisation is not enabled.

This patch has only enabled aligned packet data read, i.e. when the read
offset to the start of cache is REG_WIDTH aligned.

Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-03-28 19:36:12 -07:00
Colin Ian King
f62fd7a777 ecryptfs: fix spelling mistake: "cadidate" -> "candidate"
Trivial fix to spelling mistake in debug message text.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2018-03-29 01:33:30 +00:00
Guenter Roeck
ab13a9218c ecryptfs: lookup: Don't check if mount_crypt_stat is NULL
mount_crypt_stat is assigned to
&ecryptfs_superblock_to_private(ecryptfs_dentry->d_sb)->mount_crypt_stat,
and mount_crypt_stat is not the first object in struct ecryptfs_sb_info.
mount_crypt_stat is therefore never NULL. At the same time, no crash
in ecryptfs_lookup() has been reported, and the lookup functions in
other file systems don't check if d_sb is NULL either.
Given that, remove the NULL check.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2018-03-29 01:33:29 +00:00
Linus Torvalds
0b412605ef tegra + amdkfd final fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJavDfaAAoJEAx081l5xIa+PJkQAJJTlmLeXSFhzMyxmBwOEiKl
 jAawsi7tTiJTjKeMwDmBZDbaU6TdlmEiwpwR0GgBvn+Bp7yM7MulSEByQF8lVv19
 fFnakWFyYiCvQyeuPIxXud1z7CihCFYDFhxLdkcB/fSbo9fiCwn07G895HKCY//P
 V6+O+5ghYckVVOYJzXhhPmPQnRAPPAL2vb924lvqQnnfdyufQTpZLsTFdZWIIDSO
 USiD8aLZ/fkyLvgTsOL/Fw2lsX/ToQkgnzciO3h/xMpGy2VY+C1gDe0Fp2KOJZrv
 dXdTGvoUi76z98l5hz+R+aCVBTKFEEtRIYKvpeWmHpZPU1bIoyCCZChQfEw3w13M
 +Vs4trHEixnOkPnHUorfUW+dPZRBvKqKXUDEBFdI3zUqaU7oWo2uWpKFqwvlu8GJ
 /MdMBDPFTy3RVCbecpEprTTtIXMJiNkNSod3rsGjEHxIZPSGUUYNGwqUh+9ZeGSf
 3GR3uIACKyizrLNRQfR8168XMGDwOYg8CNIu1gn1wzJeu6BSJR9OAH/UPpqybXUh
 zICNJp6JrHe2A49JorfO+UjI3vVk+4DhjrFXmRjrykMiftz0xF2TrGmYe3/6QsQ+
 WjUqFgSZmNP6JIeblfNpGA1J/FztJHDtfZfY52qPGweOx0nJMb4NLJCMIr3ulrQC
 5d7Cgz7KyOPD7JqnJCTL
 =oXU3
 -----END PGP SIGNATURE-----

Merge tag 'drm-fixes-for-v4.16-rc8' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "Nothing serious, two amdkfd and two tegra fixes"

* tag 'drm-fixes-for-v4.16-rc8' of git://people.freedesktop.org/~airlied/linux:
  drm/tegra: dc: Using NULL instead of plain integer
  drm/amdkfd: Deallocate SDMA queues correctly
  drm/amdkfd: Fix scratch memory with HWS enabled
  drm/tegra: dc: Use correct format array for Tegra124
2018-03-28 15:07:23 -10:00
Masahiro Yamada
28913ee819 netfilter: nf_nat_snmp_basic: add correct dependency to Makefile
nf_nat_snmp_basic_main.c includes a generated header, but the
necessary dependency is missing in Makefile. This could cause
build error in parallel building.

Remove a weird line, and add a correct one.

Fixes: cc2d58634e ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2018-03-29 09:42:32 +09:00
Linus Torvalds
68b8dffce6 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "8 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: demote ARM port to "odd fixes"
  MAINTAINERS: correct rmk's email address
  mm/kmemleak.c: wait for scan completion before disabling free
  mm/memcontrol.c: fix parameter description mismatch
  mm/vmstat.c: fix vmstat_update() preemption BUG
  mm/page_owner: fix recursion bug after changing skip entries
  ipc/shm.c: add split function to shm_vm_ops
  mm, slab: memcg_link the SLAB's kmem_cache
2018-03-28 14:34:55 -10:00
Dave Airlie
ef55d1538d drm/tegra: Fixes for v4.16
This contains two small fixes, one which fixes a typo that causes a
 crash with the new framebuffer modifier query support and another that
 fixes a build warning.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEiOrDCAFJzPfAjcif3SOs138+s6EFAlq7sZkTHHRyZWRpbmdA
 bnZpZGlhLmNvbQAKCRDdI6zXfz6zofC6D/9hGJTvHS4ILI8pf1uoVFxFKkviIsXg
 803H9VYavBm5PVBDIbEa1jw/TPjMBnaqzIH3MAjwxF57gaiFQzlnkaXEd9XZZAXf
 AywOU3ONbVXkDunYr1zU2SyhluCnWYroOAJMzo0VVhOhymoC0q6QH+CI88+EtI53
 MnELeFieRspFdzB1dX4Hxn0HoMZ8jXdnn4Jr2/LD+Da8hLfXGvEZPwBYj5WPgGRi
 vKT8KMRngtbka44ymQfD6oxEpgvaEJSZ1wnfXl4guW9SnRTUq33KTgW4xlMPw/2+
 //H7clEvmE61FsiNbnBqXfyRSsnUq90yI+QBrmvNFY1H8iqvfRleYE/kY0R8o5eg
 lb9822mPCu9B6DDukP4hz5pPS7iX9m4V+9CF9pPkG3PX+ogZAOfZ05odIGuoxICU
 EvVY5uQuksleEu6YbgfyZStGutLPxqoJ4ocMkAQRWRLeNIUf2jHssdfQqwUvD1gr
 9RcoZtBO5xtN1ro+rvAx3bvetuN+rFae7TYe6oG9X/aOVCVmsrRwORa9143HiKfK
 5/62rugC5gGJz4dUZKMcMjfswTLjzaHg6Te27yFhm945hobuSkaiB3MUjgdXmXNZ
 G8enRnZrL2GhAsqRvCkwISHHMNX8j7QQ94bkzc4Tm/vlnxBm3o8REW9xRk+Mjm0o
 hOHROiAaJ2nDhw==
 =vROp
 -----END PGP SIGNATURE-----

Merge tag 'drm/tegra/for-4.16-fixes' of git://anongit.freedesktop.org/tegra/linux into drm-fixes

drm/tegra: Fixes for v4.16

This contains two small fixes, one which fixes a typo that causes a
crash with the new framebuffer modifier query support and another that
fixes a build warning.

* tag 'drm/tegra/for-4.16-fixes' of git://anongit.freedesktop.org/tegra/linux:
  drm/tegra: dc: Using NULL instead of plain integer
  drm/tegra: dc: Use correct format array for Tegra124
2018-03-29 09:57:09 +10:00
Linus Torvalds
a2601d78b7 powerpc fixes for 4.16 #6
These are actually all fixes for pre-4.16 code, or new hardware workarounds.
 
 Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new firmware interface
 for describing CPU features.
 
 Fix lost pending interrupts due to a race in our interrupt soft-masking code.
 
 A workaround for a nest MMU bug with TLB invalidations on Power9.
 
 A workaround for broadcast TLB invalidations on Power9.
 
 Fix a bug in our instruction SLB miss handler, when handling bad addresses
 (eg. >= TASK_SIZE), which could corrupt non-volatile user GPRs.
 
 Thanks to:
   Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJau3wfExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYCz
 dA/+JnB5iKCXCCebnqoaX4AFTqMfxT3nr/+JkfchovZLV0PBVzKME5JtL61udmDe
 j1JZU8UASLqN/8/j652s87XuuRi6xPjSPjMNXmU1LFQ7DjS9yA6FOAsbE4c1Xg4D
 jSded2BSnMRtA/yw8AupvdYr4w72zKMQYzo8/Or3eUQAAge+oX3d1SQiRkD3DOUg
 EdpHnOScSwz6GL9amfaQBhXwvik+4crTQ/wZ/SsTpQrfJkVzHXLn/DnHEP1qO+ky
 v/Y0ix5TxpH132XsVM7UaUvy1ZcZSyEmT2qGOisGm0fj4jesVn9dQMzP+97W4QeW
 ghfHj2fvzx6IsPM3PhNKITknQi/GTrukjSuzYNuj7MyvKY15HUP1MPXNeJUl5thw
 kI5uYWuTvyI3daQKFXRQa7V6H0auuYeEV6/RvIlJ2YtUfqmvyECviNM/+mDC0+Jk
 bgqz47qqeEz2cwIUu/vQm2phVpq+15cLPwmdA37IdyT6GvYgGmsW4HWVIsyxLR2z
 fo9ghX+1oMhmMNhgVYtL2P9BfCzQenK2R+uAmUOHdNyc0LBlGKN+RPAQqQkBhKGp
 BB1L2F13kpeNBNTOsPU4yH3DpPaJFtfnaeL7jd5SanwsxNnoKApFglf0nE73bvbw
 AwRF/vWokbd3WzuPmOtldtluWUHQhaLECU24odVGB/r3XCI=
 =qP8V
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some more powerpc fixes for 4.16. Apologies if this is a bit big at
  rc7, but they're all reasonably important fixes. None are actually for
  new code, so they aren't indicative of 4.16 being in bad shape from
  our point of view.

   - Fix missing AT_BASE_PLATFORM (in auxv) when we're using a new
     firmware interface for describing CPU features.

   - Fix lost pending interrupts due to a race in our interrupt
     soft-masking code.

   - A workaround for a nest MMU bug with TLB invalidations on Power9.

   - A workaround for broadcast TLB invalidations on Power9.

   - Fix a bug in our instruction SLB miss handler, when handling bad
     addresses (eg. >= TASK_SIZE), which could corrupt non-volatile user
     GPRs.

  Thanks to: Aneesh Kumar K.V, Balbir Singh, Benjamin Herrenschmidt,
  Nicholas Piggin"

* tag 'powerpc-4.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs
  powerpc/mm: Fixup tlbie vs store ordering issue on POWER9
  powerpc/mm/radix: Move the functions that does the actual tlbie closer
  powerpc/mm/radix: Remove unused code
  powerpc/mm: Workaround Nest MMU bug with TLB invalidations
  powerpc/mm: Add tracking of the number of coprocessors using a context
  powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened
  powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features
2018-03-28 13:54:03 -10:00
Linus Torvalds
616d8cf0fa ARM: SoC fixes for 4.16
Here are are a couple of last-minute fixes for 4.16, mostly for
 regressions. As usual, the majory are device tree changes:
 
 - USB 3 support on rk3399 didn't work and is being reverted for now
 
 - One fix for an old suspend/resume bug on rk3399
 
 - A few regulator related fixes on Banana Pi M2, and on imx7d-sdb
 
 - A boot regression fix for all Aspeed SoCs failing to find
   their memory
 
 - One more dtc warning fix
 
 The other changes are:
 
 - A few updates to the MAINTAINERS file
 
 - A revert for an incorrect orion5x cleanup
 
 - Two power management fixes for OMAP
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJau+0YAAoJEGCrR//JCVInPu4P/2cGvKWc7SIARWTFpfadIkbM
 X4O+emNsPhYW3Nr2XRIq8JHxHVRzxVtNWeXEpbpX2uOpv/w9AUt3amyrIpdZ3oHh
 IWdDhX9b4gEU85mWAVCstWmsH4gioBC+LW3cn+GSSFrvQBXJUWHMkDqnLa2GSq32
 NSwvhYQLEpQeJPYQUtZKCt2L73UV1JWhHspMnuAEANZ+D2MbQ0iFKVM+mctkpxKE
 m8pFcGP7yBFm/5SADVo9MKnfqEa2IL5wCUbVz54xC6P+3v/DzgxgQG2dUXVVucBV
 arl+VECHh7IVDX9lxNzMkBUvfRd45dXWuHnf+lx9FE5nVs6OpypuSIrE5xunNeD7
 o0APtfjYbqZA62ZFRKP//3A1/CuyxQxK7PSzMXFO0G8QNleobJBcxsCEhOBLSGc5
 DrGzxtEGKUolY3l+d5VYA9EXlbmc1BWK5zGGWIJ7Id1v/KU54Kj+kIGxs7QDnIKC
 bPy4dw1bV8RzGIEJJcOPuGtdxtWBsHXTcvgXXrMMqPbYi6H3Bh2H+ezpYs9aLUF0
 8ejbPF1ekjN5prsxpWIGxUAd5BluIk5mpvFcYqm2oOkYfolo2yM5oLv91xrjuY68
 xQKr86oJU9Mcyc7IVNbN4L3iUu3MjDxB1p4zGao7ofD52lcsqvxx/P2Nnk0rifg9
 ClZFDtGkPp/76v5bXHb3
 =qZnT
 -----END PGP SIGNATURE-----

Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Arnd Bergmann:
 "Here are are a couple of last-minute fixes for 4.16, mostly for
  regressions. As usual, the majory are device tree changes:

   - USB 3 support on rk3399 didn't work and is being reverted for now

   - One fix for an old suspend/resume bug on rk3399

   - A few regulator related fixes on Banana Pi M2, and on imx7d-sdb

   - A boot regression fix for all Aspeed SoCs failing to find their
     memory

   - One more dtc warning fix

  The other changes are:

   - A few updates to the MAINTAINERS file

   - A revert for an incorrect orion5x cleanup

   - Two power management fixes for OMAP"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: OMAP: Fix SRAM W+X mapping
  ARM: dts: aspeed: Add default memory node
  mailmap: Update email address for Gregory CLEMENT
  ARM: davinci: fix the GPIO lookup for omapl138-hawk
  MAINTAINERS: Update Tegra IOMMU maintainer
  ARM: dts: imx7d-sdb: Fix regulator-usb-otg2-vbus node name
  ARM: ux500: Fix PMU IRQ regression
  ARM: dts: rockchip: Add missing #sound-dai-cells on rk3288
  Revert "arm64: dts: rockchip: add usb3-phy otg-port support for rk3399"
  arm64: dts: rockchip: Fix rk3399-gru-* s2r (pinctrl hogs, wifi reset)
  ARM: OMAP: Fix dmtimer init for omap1
  MAINTAINERS: update email address for Maxime Ripard
  ARM: dts: sun6i: a31s: bpi-m2: add missing regulators
  ARM: dts: sun6i: a31s: bpi-m2: improve pmic properties
2018-03-28 13:52:13 -10:00
Russell King
18bd49043c MAINTAINERS: demote ARM port to "odd fixes"
As of the start of 2018, I am no longer paid to support the core 32-bit
ARM architecture code.  This means that this code is no longer
commercially supported, and is now only supported through voluntary
effort.

I will continue to merge patches as and when able, but this will be at a
lower priority than before (which means a longer latency.) I have also
be scaled back the amount of time spent reading email, so email that is
intended for my attention needs to make itself plainly obvious, or I
will miss it.

In an attempt to reduce the amount of email Cc'd to me, exclude
arch/arm/boot/dts from the maintainers patterns, but add entries for the
SolidRun platforms I look after.

Link: http://lkml.kernel.org/r/E1ezkgn-0002fO-52@rmk-PC.armlinux.org.uk
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:06 -10:00
Russell King
5b634e8e38 MAINTAINERS: correct rmk's email address
Correct my email address in the MAINTAINTERS file.

Link: http://lkml.kernel.org/r/E1ezkgi-0002fH-01@rmk-PC.armlinux.org.uk
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Vinayak Menon
914b6dfff7 mm/kmemleak.c: wait for scan completion before disabling free
A crash is observed when kmemleak_scan accesses the object->pointer,
likely due to the following race.

  TASK A             TASK B                     TASK C
  kmemleak_write
   (with "scan" and
   NOT "scan=on")
  kmemleak_scan()
                     create_object
                     kmem_cache_alloc fails
                     kmemleak_disable
                     kmemleak_do_cleanup
                     kmemleak_free_enabled = 0
                                                kfree
                                                kmemleak_free bails out
                                                 (kmemleak_free_enabled is 0)
                                                slub frees object->pointer
  update_checksum
  crash - object->pointer
   freed (DEBUG_PAGEALLOC)

kmemleak_do_cleanup waits for the scan thread to complete, but not for
direct call to kmemleak_scan via kmemleak_write.  So add a wait for
kmemleak_scan completion before disabling kmemleak_free, and while at it
fix the comment on stop_scan_thread.

[vinmenon@codeaurora.org: fix stop_scan_thread comment]
  Link: http://lkml.kernel.org/r/1522219972-22809-1-git-send-email-vinmenon@codeaurora.org
Link: http://lkml.kernel.org/r/1522063429-18992-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Honglei Wang
b213b54fbf mm/memcontrol.c: fix parameter description mismatch
There are a couple of places where parameter description and function
name do not match the actual code.  Fix it.

Link: http://lkml.kernel.org/r/1520843448-17347-1-git-send-email-honglei.wang@oracle.com
Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Steven J. Hill
c7f26ccfb2 mm/vmstat.c: fix vmstat_update() preemption BUG
Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
cause vmstat_update() to report a BUG due to preemption not being
disabled around smp_processor_id().

Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

  BUG: using smp_processor_id() in preemptible [00000000] code:
  kworker/1:1/269
  caller is vmstat_update+0x50/0xa0
  CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
  4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
  Workqueue: mm_percpu_wq vmstat_update
  Call Trace:
    show_stack+0x94/0x128
    dump_stack+0xa4/0xe0
    check_preemption_disabled+0x118/0x120
    vmstat_update+0x50/0xa0
    process_one_work+0x144/0x348
    worker_thread+0x150/0x4b8
    kthread+0x110/0x140
    ret_from_kernel_thread+0x14/0x1c

Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
Signed-off-by: Steven J. Hill <steven.hill@cavium.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Maninder Singh
299815a4fb mm/page_owner: fix recursion bug after changing skip entries
This patch fixes commit 5f48f0bd4e ("mm, page_owner: skip unnecessary
stack_trace entries").

Because if we skip first two entries then logic of checking count value
as 2 for recursion is broken and code will go in one depth recursion.

so we need to check only one call of _RET_IP(__set_page_owner) while
checking for recursion.

Current Backtrace while checking for recursion:-

  (save_stack)             from (__set_page_owner)  // (But recursion returns true here)
  (__set_page_owner)       from (get_page_from_freelist)
  (get_page_from_freelist) from (__alloc_pages_nodemask)
  (__alloc_pages_nodemask) from (depot_save_stack)
  (depot_save_stack)       from (save_stack)       // recursion should return true here
  (save_stack)             from (__set_page_owner)
  (__set_page_owner)       from (get_page_from_freelist)
  (get_page_from_freelist) from (__alloc_pages_nodemask+)
  (__alloc_pages_nodemask) from (depot_save_stack)
  (depot_save_stack)       from (save_stack)
  (save_stack)             from (__set_page_owner)
  (__set_page_owner)       from (get_page_from_freelist)

Correct Backtrace with fix:

  (save_stack)             from (__set_page_owner) // recursion returned true here
  (__set_page_owner)       from (get_page_from_freelist)
  (get_page_from_freelist) from (__alloc_pages_nodemask+)
  (__alloc_pages_nodemask) from (depot_save_stack)
  (depot_save_stack)       from (save_stack)
  (save_stack)             from (__set_page_owner)
  (__set_page_owner)       from (get_page_from_freelist)

Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com
Fixes: 5f48f0bd4e ("mm, page_owner: skip unnecessary stack_trace entries")
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ayush Mittal <ayush.m@samsung.com>
Cc: Prakash Gupta <guptap@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Vasyl Gomonovych <gomonovych@gmail.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Cc: <pankaj.m@samsung.com>
Cc: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Mike Kravetz
3d942ee079 ipc/shm.c: add split function to shm_vm_ops
If System V shmget/shmat operations are used to create a hugetlbfs
backed mapping, it is possible to munmap part of the mapping and split
the underlying vma such that it is not huge page aligned.  This will
untimately result in the following BUG:

  kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
  CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G         C  E 4.15.0-10-generic #11-Ubuntu
  NIP:  c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
  REGS: c000003fbcdcf810 TRAP: 0700   Tainted: G         C  E (4.15.0-10-generic)
  MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 20040000
  CFAR: c00000000036ee44 SOFTE: 1
  NIP __unmap_hugepage_range+0xa4/0x760
  LR __unmap_hugepage_range_final+0x28/0x50
  Call Trace:
    0x7115e4e00000 (unreliable)
    __unmap_hugepage_range_final+0x28/0x50
    unmap_single_vma+0x11c/0x190
    unmap_vmas+0x94/0x140
    exit_mmap+0x9c/0x1d0
    mmput+0xa8/0x1d0
    do_exit+0x360/0xc80
    do_group_exit+0x60/0x100
    SyS_exit_group+0x24/0x30
    system_call+0x58/0x6c
  ---[ end trace ee88f958a1c62605 ]---

This bug was introduced by commit 31383c6865 ("mm, hugetlbfs:
introduce ->split() to vm_operations_struct").  A split function was
added to vm_operations_struct to determine if a mapping can be split.
This was mostly for device-dax and hugetlbfs mappings which have
specific alignment constraints.

Mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops.  shm_vm_ops functions will call back to the
original vm_ops if needed.  Add such a split function to shm_vm_ops.

Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
Fixes: 31383c6865 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Shakeel Butt
880cd276df mm, slab: memcg_link the SLAB's kmem_cache
All the root caches are linked into slab_root_caches which was
introduced by the commit 510ded33e0 ("slab: implement slab_root_caches
list") but it missed to add the SLAB's kmem_cache.

While experimenting with opt-in/opt-out kmem accounting, I noticed
system crashes due to NULL dereference inside cache_from_memcg_idx()
while deferencing kmem_cache.memcg_params.memcg_caches.  The upstream
clean kernel will not see these crashes but SLAB should be consistent
with SLUB which does linked its boot caches (kmem_cache_node and
kmem_cache) into slab_root_caches.

Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
Fixes: 510ded33e0 ("slab: implement slab_root_caches list")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28 13:42:05 -10:00
Dave Airlie
694f54f680 Merge branch 'drm-misc-next-fixes' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
- Mask mode type garbage from userspace (Ville)

Something went wrong on the misc tree side, but I'll pull the patch directly.

* 'drm-misc-next-fixes' of git://anongit.freedesktop.org/drm/drm-misc:
  drm: Fix uabi regression by allowing garbage mode->type from userspace
2018-03-29 09:25:13 +10:00
Steve Wise
1b90d3002e RDMA/CMA: remove RDMA_PS_SDP
This is no longer supported, so remove it.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-28 16:23:01 -06:00
Roland Dreier
84652aefb3 RDMA/ucma: Introduce safer rdma_addr_size() variants
There are several places in the ucma ABI where userspace can pass in a
sockaddr but set the address family to AF_IB.  When that happens,
rdma_addr_size() will return a size bigger than sizeof struct sockaddr_in6,
and the ucma kernel code might end up copying past the end of a buffer
not sized for a struct sockaddr_ib.

Fix this by introducing new variants

    int rdma_addr_size_in6(struct sockaddr_in6 *addr);
    int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr);

that are type-safe for the types used in the ucma ABI and return 0 if the
size computed is bigger than the size of the type passed in.  We can use
these new variants to check what size userspace has passed in before
copying any addresses.

Reported-by: <syzbot+6800425d54ed3ed8135d@syzkaller.appspotmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-28 16:13:36 -06:00
Eric W. Biederman
2236d4d390 ipc/shm: Fix pid freeing.
The 0day kernel test build report reported an oops:
>
>  IP: put_pid+0x22/0x5c
>  PGD 19efa067 P4D 19efa067 PUD 0
>  Oops: 0000 [#1]
>  CPU: 0 PID: 727 Comm: trinity Not tainted 4.16.0-rc2-00010-g98f929b #1
>  RIP: 0010:put_pid+0x22/0x5c
>  RSP: 0018:ffff986719f73e48 EFLAGS: 00010202
>  RAX: 00000006d765f710 RBX: ffff98671a4fa4d0 RCX: ffff986719f73d40
>  RDX: 000000006f6e6125 RSI: 0000000000000000 RDI: ffffffffa01e6d21
>  RBP: ffffffffa0955fe0 R08: 0000000000000020 R09: 0000000000000000
>  R10: 0000000000000078 R11: ffff986719f73e76 R12: 0000000000001000
>  R13: 00000000ffffffea R14: 0000000054000fb0 R15: 0000000000000000
>  FS:  00000000028c2880(0000) GS:ffffffffa06ad000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000677846439 CR3: 0000000019fc1005 CR4: 00000000000606b0
>  Call Trace:
>   ? ipc_update_pid+0x36/0x3e
>   ? newseg+0x34c/0x3a6
>   ? ipcget+0x5d/0x528
>   ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
>   ? SyS_shmget+0x5a/0x84
>   ? do_syscall_64+0x194/0x1b3
>   ? entry_SYSCALL_64_after_hwframe+0x42/0xb7
>  Code: ff 05 e7 20 9b 03 58 c9 c3 48 ff 05 85 21 9b 03 48 85 ff 74 4f 8b 47 04 8b 17 48 ff 05 7c 21 9b 03 48 83 c0 03 48 c1 e0 04 ff ca <48> 8b 44 07 08 74 1f 48 ff 05 6c 21 9b 03 ff 0f 0f 94 c2 48 ff
>  RIP: put_pid+0x22/0x5c RSP: ffff986719f73e48
>  CR2: 0000000677846439
>  ---[ end trace ab8c5cb4389d37c5 ]---
>  Kernel panic - not syncing: Fatal exception

In newseg when changing shm_cprid and shm_lprid from pid_t to struct
pid* I misread the kvmalloc as kvzalloc and thought shp was
initialized to 0.  As that is not the case it is not safe to for the
error handling to address shm_cprid and shm_lprid before they are
initialized.

Therefore move the cleanup of shm_cprid and shm_lprid from the no_file
error cleanup path to the no_id error cleanup path.  Ensuring that an
early error exit won't cause the oops above.

Reported-by: kernel test robot <fengguang.wu@intel.com>
Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-03-28 16:40:08 -05:00
Peter Zijlstra
c28d62cf52 locking/rtmutex: Handle non enqueued waiters gracefully in remove_waiter()
In -RT task_blocks_on_rt_mutex() may return with -EAGAIN due to
(->pi_blocked_on == PI_WAKEUP_INPROGRESS) before it added itself as a
waiter. In such a case remove_waiter() must not be called because without a
waiter it will trigger the BUG_ON() statement.

This was initially reported by Yimin Deng. Thomas Gleixner fixed it then
with an explicit check for waiters before calling remove_waiter().

Instead of an explicit NULL check before calling rt_mutex_top_waiter() make
the function return NULL if there are no waiters. With that fixed the now
pointless NULL check is removed from rt_mutex_slowlock().

Reported-and-debugged-by: Yimin Deng <yimin11.deng@gmail.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/CAAh1qt=DCL9aUXNxanP5BKtiPp3m+qj4yB+gDohhXPVFCxWwzg@mail.gmail.com
Link: https://lkml.kernel.org/r/20180327121438.sss7hxg3crqy4ecd@linutronix.de
2018-03-28 23:01:30 +02:00
Daniel Borkmann
f6ef565893 Merge branch 'bpf-raw-tracepoints'
Alexei Starovoitov says:

====================
v7->v8:
- moved 'u32 num_args' from 'struct tracepoint' into 'struct bpf_raw_event_map'
  that increases memory overhead, but can be optimized/compressed later.
  Now it's zero changes in tracepoint.[ch]

v6->v7:
- adopted Steven's bpf_raw_tp_map section approach to find tracepoint
  and corresponding bpf probe function instead of kallsyms approach.
  dropped kernel_tracepoint_find_by_name() patch

v5->v6:
- avoid changing semantics of for_each_kernel_tracepoint() function, instead
  introduce kernel_tracepoint_find_by_name() helper

v4->v5:
- adopted Daniel's fancy REPEAT macro in bpf_trace.c in patch 6

v3->v4:
- adopted Linus's CAST_TO_U64 macro to cast any integer, pointer, or small
  struct to u64. That nicely reduced the size of patch 1

v2->v3:
- with Linus's suggestion introduced generic COUNT_ARGS and CONCATENATE macros
  (or rather moved them from apparmor)
  that cleaned up patch 6
- added patch 4 to refactor trace_iwlwifi_dev_ucode_error() from 17 args to 4
  Now any tracepoint with >12 args will have build error

v1->v2:
- simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into
  bpf_raw_tp_open(name, prog_fd) as suggested by Daniel.
  That simplifies bpf_detach as well which is now simple close() of fd.
- fixed memory leak in error path which was spotted by Daniel.
- fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints
- added more tests
- fixed allyesconfig build caught by buildbot

v1:
This patch set is a different way to address the pressing need to access
task_struct pointers in sched tracepoints from bpf programs.

The first approach simply added these pointers to sched tracepoints:
https://lkml.org/lkml/2017/12/14/753
which Peter nacked.
Few options were discussed and eventually the discussion converged on
doing bpf specific tracepoint_probe_register() probe functions.
Details here:
https://lkml.org/lkml/2017/12/20/929

Patch 1 is kernel wide cleanup of pass-struct-by-value into
pass-struct-by-reference into tracepoints.

Patches 2 and 3 are minor cleanups to address allyesconfig build

Patch 4 refactor trace_iwlwifi_dev_ucode_error from 17 to 4 args

Patch 5 introduces COUNT_ARGS macro

Patch 6 introduces BPF_RAW_TRACEPOINT api.
the auto-cleanup and multiple concurrent users are must have
features of tracing api. For bpf raw tracepoints it looks like:
  // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
  prog_fd = bpf_prog_load(...);

  // receive anon_inode fd for given bpf_raw_tracepoint
  // and attach bpf program to it
  raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool will automatically
detach bpf program, unload it and unregister tracepoint probe.
More details in patch 6.

Patch 7 - trivial support in libbpf
Patches 8, 9 - user space tests

samples/bpf/test_overhead performance on 1 cpu:

tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename   1.1M   769K        947K            1.0M
urandom_read  789K   697K        750K            755K
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:21 +02:00
Alexei Starovoitov
3bbe086988 selftests/bpf: test for bpf_get_stackid() from raw tracepoints
similar to traditional traceopint test add bpf_get_stackid() test
from raw tracepoints
and reduce verbosity of existing stackmap test

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:19 +02:00
Alexei Starovoitov
4662a4e538 samples/bpf: raw tracepoint test
add empty raw_tracepoint bpf program to test overhead similar
to kprobe and traditional tracepoint tests

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:19 +02:00
Alexei Starovoitov
a0fe3e574b libbpf: add bpf_raw_tracepoint_open helper
add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:19 +02:00
Alexei Starovoitov
c4f6699dfc bpf: introduce BPF_RAW_TRACEPOINT
Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
kernel internal arguments of the tracepoints in their raw form.

>From bpf program point of view the access to the arguments look like:
struct bpf_raw_tracepoint_args {
       __u64 args[0];
};

int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
{
  // program can read args[N] where N depends on tracepoint
  // and statically verified at program load+attach time
}

kprobe+bpf infrastructure allows programs access function arguments.
This feature allows programs access raw tracepoint arguments.

Similar to proposed 'dynamic ftrace events' there are no abi guarantees
to what the tracepoints arguments are and what their meaning is.
The program needs to type cast args properly and use bpf_probe_read()
helper to access struct fields when argument is a pointer.

For every tracepoint __bpf_trace_##call function is prepared.
In assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0xffffffff81132080 <+0>:     mov    %ecx,%ecx
   0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
        TP_PROTO(const struct net_device *dev,
                 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
and in total this approach adds 7k bytes to .text.

This approach gives the lowest possible overhead
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since tracepoint+bpf are used at speeds of 1M+ events per second
this is valuable optimization.

The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
that returns anon_inode FD of 'bpf-raw-tracepoint' object.

The user space looks like:
// load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
prog_fd = bpf_prog_load(...);
// receive anon_inode fd for given bpf_raw_tracepoint with prog attached
raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);

Ctrl-C of tracing daemon or cmdline tool that uses this feature
will automatically detach bpf program, unload it and
unregister tracepoint probe.

On the kernel side the __bpf_raw_tp_map section of pointers to
tracepoint definition and to __bpf_trace_*() probe function is used
to find a tracepoint with "xdp_exception" name and
corresponding __bpf_trace_xdp_exception() probe function
which are passed to tracepoint_probe_register() to connect probe
with tracepoint.

Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
tracepoint mechanisms. perf_event_open() can be used in parallel
on the same tracepoint.
Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
Each with its own bpf program. The kernel will execute
all tracepoint probes and all attached bpf programs.

In the future bpf_raw_tracepoints can be extended with
query/introspection logic.

__bpf_raw_tp_map section logic was contributed by Steven Rostedt

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:19 +02:00
Alexei Starovoitov
cf14f27f82 macro: introduce COUNT_ARGS() macro
move COUNT_ARGS() macro from apparmor to generic header and extend it
to count till twelve.

COUNT() was an alternative name for this logic, but it's used for
different purpose in many other places.

Similarly for CONCATENATE() macro.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:19 +02:00
Alexei Starovoitov
4fe43c2c00 net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint
fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table
instead of all 17 arguments by value.
dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table'
defined with very similar yet subtly different fields and offsets.
tracepoint is still common and using definition of 'struct iwl_error_event_table'
from dvm/commands.h while copying fields.
Long term this tracepoint probably should be split into two.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:19 +02:00
Alexei Starovoitov
14624a9387 net/mac802154: disambiguate mac80215 vs mac802154 trace events
two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:18 +02:00
Alexei Starovoitov
d992ee6c1e net/mediatek: disambiguate mt76 vs mt7601u trace events
two trace events defined with the same name and both unused.
They conflict in allyesconfig build. Rename one of them.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:18 +02:00
Alexei Starovoitov
c105547501 treewide: remove large struct-pass-by-value from tracepoint arguments
- fix trace_hfi1_ctxt_info() to pass large struct by reference instead of by value
- convert 'type array[]' tracepoint arguments into 'type *array',
  since compiler will warn that sizeof('type array[]') == sizeof('type *array')
  and later should be used instead

The CAST_TO_U64 macro in the later patch will enforce that tracepoint
arguments can only be integers, pointers, or less than 8 byte structures.
Larger structures should be passed by reference.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-28 22:55:18 +02:00
Liran Alon
f497b6c25d KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
When vCPU runs L2 and there is a pending event that requires to exit
from L2 to L1 and nested_run_pending=1, vcpu_enter_guest() will request
an immediate-exit from L2 (See req_immediate_exit).

Since now handling of req_immediate_exit also makes sure to set
KVM_REQ_EVENT, there is no need to also set it on vmx_vcpu_run() when
nested_run_pending=1.

This optimizes cases where VMRESUME was executed by L1 to enter L2 and
there is no pending events that require exit from L2 to L1. Previously,
this would have set KVM_REQ_EVENT unnecessarly.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon
1a680e355c KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.

While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.

When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!

This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.

The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).

Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15

Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).

We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon
a042c26fd8 KVM: x86: Fix misleading comments on handling pending exceptions
The reason that exception.pending should block re-injection of
NMI/interrupt is not described correctly in comment in code.
Instead, it describes why a pending exception should be injected
before a pending NMI/interrupt.

Therefore, move currently present comment to code-block evaluating
a new pending event which explains why exception.pending is evaluated
first.
In addition, create a new comment describing that exception.pending
blocks re-injection of NMI/interrupt because the exception was
queued by handling vmexit which was due to NMI/interrupt delivery.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
[Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon
04140b4144 KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.

However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.

This change follows logic of previous commit 664f8e26b0 ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.

It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.

This patch introduce no semantics change.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Liran Alon
7c5a6a5970 KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
kvm_inject_realmode_interrupt() is called from one of the injection
functions which writes event-injection to VMCS: vmx_queue_exception(),
vmx_inject_irq() and vmx_inject_nmi().

All these functions are called just to cause an event-injection to
guest. They are not responsible of manipulating the event-pending
flag. The only purpose of kvm_inject_realmode_interrupt() should be
to emulate real-mode interrupt-injection.

This was also incorrect when called from vmx_queue_exception().

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Vitaly Kuznetsov
773e8a0425 x86/kvm: use Enlightened VMCS when running on Hyper-V
Enlightened VMCS is just a structure in memory, the main benefit
besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
mask: we tell the underlying hypervisor which fields were modified
since VMEXIT so there's no need to inspect them all.

Tight CPUID loop test shows significant speedup:
Before: 18890 cycles
After: 8304 cycles

Static key is being used to avoid performance penalty for non-Hyper-V
deployments.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Vitaly Kuznetsov
5431390b30 x86/hyper-v: detect nested features
TLFS 5.0 says: "Support for an enlightened VMCS interface is reported with
CPUID leaf 0x40000004. If an enlightened VMCS interface is supported,
 additional nested enlightenments may be discovered by reading the CPUID
leaf 0x4000000A (see 2.4.11)."

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Vitaly Kuznetsov
68d1eb72ee x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
The definitions are according to the Hyper-V TLFS v5.0. KVM on Hyper-V will
use these.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Vitaly Kuznetsov
a46d15cc1a x86/hyper-v: allocate and use Virtual Processor Assist Pages
Virtual Processor Assist Pages usage allows us to do optimized EOI
processing for APIC, enable Enlightened VMCS support in KVM and more.
struct hv_vp_assist_page is defined according to the Hyper-V TLFS v5.0b.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Ladi Prosek
d4abc577bb x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGE
The assist page has been used only for the paravirtual EOI so far, hence
the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
called "Virtual VP Assist MSR".

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00
Vitaly Kuznetsov
415bd1cd3a x86/hyper-v: move definitions from TLFS to hyperv-tlfs.h
mshyperv.h now only contains fucntions/variables we define in kernel, all
definitions from TLFS should go to hyperv-tlfs.h.

'enum hv_cpuid_function' is removed as we already have this info in
hyperv-tlfs.h, code in mshyperv.c is adjusted accordingly.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-28 22:47:06 +02:00