Commit graph

781771 commits

Author SHA1 Message Date
Geert Uytterhoeven
45c9a603a4 dmaengine: rcar-dmac: Disable interrupts while stopping channels
During system reboot or halt, with lockdep enabled:

    ================================
    WARNING: inconsistent lock state
    4.18.0-rc1-salvator-x-00002-g9203dbec90a68103 #41 Tainted: G        W
    --------------------------------
    inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
    reboot/2779 [HC0[0]:SC0[0]:HE1:SE1] takes:
    0000000098ae4ad3 (&(&rchan->lock)->rlock){?.-.}, at: rcar_dmac_shutdown+0x58/0x6c
    {IN-HARDIRQ-W} state was registered at:
      lock_acquire+0x208/0x238
      _raw_spin_lock+0x40/0x54
      rcar_dmac_isr_channel+0x28/0x200
      __handle_irq_event_percpu+0x1c0/0x3c8
      handle_irq_event_percpu+0x34/0x88
      handle_irq_event+0x48/0x78
      handle_fasteoi_irq+0xc4/0x12c
      generic_handle_irq+0x18/0x2c
      __handle_domain_irq+0xa8/0xac
      gic_handle_irq+0x78/0xbc
      el1_irq+0xec/0x1c0
      arch_cpu_idle+0xe8/0x1bc
      default_idle_call+0x2c/0x30
      do_idle+0x144/0x234
      cpu_startup_entry+0x20/0x24
      rest_init+0x27c/0x290
      start_kernel+0x430/0x45c
    irq event stamp: 12177
    hardirqs last  enabled at (12177): [<ffffff800881d804>] _raw_spin_unlock_irq+0x2c/0x4c
    hardirqs last disabled at (12176): [<ffffff800881d638>] _raw_spin_lock_irq+0x1c/0x60
    softirqs last  enabled at (11948): [<ffffff8008081da8>] __do_softirq+0x160/0x4ec
    softirqs last disabled at (11935): [<ffffff80080ec948>] irq_exit+0xa0/0xfc

    other info that might help us debug this:
     Possible unsafe locking scenario:

	   CPU0
	   ----
      lock(&(&rchan->lock)->rlock);
      <Interrupt>
	lock(&(&rchan->lock)->rlock);

     *** DEADLOCK ***

    3 locks held by reboot/2779:
     #0: 00000000bfabfa74 (reboot_mutex){+.+.}, at: sys_reboot+0xdc/0x208
     #1: 00000000c75d8c3a (&dev->mutex){....}, at: device_shutdown+0xc8/0x1c4
     #2: 00000000ebec58ec (&dev->mutex){....}, at: device_shutdown+0xd8/0x1c4

    stack backtrace:
    CPU: 6 PID: 2779 Comm: reboot Tainted: G        W         4.18.0-rc1-salvator-x-00002-g9203dbec90a68103 #41
    Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ (DT)
    Call trace:
     dump_backtrace+0x0/0x148
     show_stack+0x14/0x1c
     dump_stack+0xb0/0xf0
     print_usage_bug.part.26+0x1c4/0x27c
     mark_lock+0x38c/0x610
     __lock_acquire+0x3fc/0x14d4
     lock_acquire+0x208/0x238
     _raw_spin_lock+0x40/0x54
     rcar_dmac_shutdown+0x58/0x6c
     platform_drv_shutdown+0x20/0x2c
     device_shutdown+0x160/0x1c4
     kernel_restart_prepare+0x34/0x3c
     kernel_restart+0x14/0x5c
     sys_reboot+0x160/0x208
     el0_svc_naked+0x30/0x34

rcar_dmac_stop_all_chan() takes the channel lock while stopping a
channel, but does not disable interrupts, leading to a deadlock when a
DMAC interrupt comes in.  Before, the same code block was called from an
interrupt handler, hence taking the spinlock was sufficient.

Fix this by disabling local interrupts while taking the spinlock.

Fixes: 9203dbec90 ("dmaengine: rcar-dmac: don't use DMAC error interrupt")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
2018-07-09 22:20:53 +05:30
Krzysztof Kozlowski
8ab11f8068 ARM: tegra: Work safely with 256 MB Colibri-T20 modules
Colibri-T20 can come in 256 MB RAM (with 512 MB NAND) or 512 MB RAM
(with 1024 MB NAND) flavors.  Both of them will use the same DTSI
expecting the bootloader to do the fixup of /memory node.  However in
case it does not happen, let's stay on safe side by limiting the memory
to 256 MB for both versions of Colibri-T20.

Rename to remove the unnecessary memory size from the device tree file
name.  While at it, also follow the typical Toradex SoC, module, carrier
board hierarchy.

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Tested-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
2018-07-09 18:50:53 +02:00
Krzysztof Kozlowski
35a21229f8 ARM: tegra: Fix unit_address_vs_reg and avoid_unnecessary_addr_size DTC warnings
Remove unneeded address/size cells properties and unit addresses to fix
DTC warnings like:

    arch/arm/boot/dts/tegra30-apalis-eval.dtb: Warning (unit_address_vs_reg):
        /i2c@7000d000/stmpe811@41/stmpe_touchscreen@0: node has a unit name, but no reg property
    arch/arm/boot/dts/tegra30-apalis-eval.dtb: Warning (avoid_unnecessary_addr_size):
        /i2c@7000d000/stmpe811@41: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
2018-07-09 18:50:33 +02:00
Krzysztof Kozlowski
482997699e ARM: tegra: Fix unit_address_vs_reg DTC warnings for /memory
Add a generic /memory node in each Tegra DTSI (with empty reg property,
to be overidden by each DTS) and set proper unit address for /memory
nodes to fix the DTC warnings:

    arch/arm/boot/dts/tegra20-harmony.dtb: Warning (unit_address_vs_reg):
        /memory: node has a reg or ranges property, but no unit name

The DTB after the change is the same as before except adding
unit-address to /memory node.

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
2018-07-09 18:50:10 +02:00
Krzysztof Kozlowski
f48ba1ae6a ARM: tegra: Remove usage of deprecated skeleton.dtsi
Remove the usage of skeleton.dtsi because it was deprecated since commit
9c0da3cc61 ("ARM: dts: explicitly mark skeleton.dtsi as deprecated").
It also allows later to fix DTC warnings for missing unit name in
/memory nodes.

Compiled DTBs are the same as before this commit.

Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
2018-07-09 18:49:44 +02:00
Stephen Boyd
166f3a8ad6 First round of updates for meson clocks targeted at v4.19
* Remove legacy register access (finish moving to syscon)
 * Clean up configuration flags
 * Add axg PCIe clocks
 * Add GEN CLK on gxbb, gxl and axg
 * Remove clk_audio_divider driver
 * Add axg audio clock controller
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE9OFZrhjz9W1fG7cb5vwPHDfy2oUFAltDV88ACgkQ5vwPHDfy
 2oWk1w//bKOHNvjBEqkA5kZcFd6hQ81KP0poG3QoZqd9Mkvj7sT1GOUVm8QS9oll
 neVXUT6JIavwOobwnD6f7WCNSHJ6buWTlnxJc37JJU81PBhCV2t2owhLkItSf/ni
 GDK/BhGf/vGCIAvjlfU2Bh2crxG7hXGVMeg8OFG99my9WryXPL1HaGJfO6ge1T26
 N7Gx1bFAdf2wAK3B/EpCPNBYsnS+ohJz0Cq1S8BbJXOpMLjgAsR7UB25F+DCCjp2
 pPSHq7gNKUVcgZKveiy2xRqj8F/j3eQyF9tr6K3mpx0obFzXXxFT8meMR54a6y61
 swJ6E0PI9+u9kOOo7KOAEuLBIpamgIP19BJNLmXxb+ibAjKjp/mUc7i5qBoqvCOW
 8Rab8Yc9/PP03VPY+Pm66OrryrsLciVX9vcSCA5TPXTzp5chyM9Yi9ZsP0QQEC2f
 NY7A55aL7WfHJsG28NxGEsbAf4F9ozvIFgg4FFPBq/kFlf/wZeWGtz3JPyhK84I6
 yxsQbCFDqyFCNJvwPnMVKmGuui26E7w2Ctf8pdWt+/Amnd2kHekJRDi5dx/d5RK6
 UhRrrckBBqV3C8UweQZZPPz7aBEA4SYnxwUc9w1jU9l2osbn0WAA/wuOuVEUVRAg
 5675tDnmZwVpVmxnMH8aZH0ykaNPX7jk/XLHTdDHqk8NRDUKHRk=
 =w4lD
 -----END PGP SIGNATURE-----

Merge tag 'meson-clk-4.19-1' of https://github.com/BayLibre/clk-meson into clk-meson

Pull first round of updates for meson clocks from Jerome Brunet:

 - Remove legacy register access (finish moving to syscon)
 - Clean up configuration flags
 - Add axg PCIe clocks
 - Add GEN CLK on gxbb, gxl and axg
 - Remove clk_audio_divider driver
 - Add axg audio clock controller

* tag 'meson-clk-4.19-1' of https://github.com/BayLibre/clk-meson:
  clk: meson: add gen_clk
  clk: meson: gxbb: remove HHI_GEN_CLK_CTNL duplicate definition
  clk: meson-axg: add clocks required by pcie driver
  clk: meson: remove unused clk-audio-divider driver
  clk: meson: stop rate propagation for audio clocks
  clk: meson: axg: add the audio clock controller driver
  clk: meson: add axg audio sclk divider driver
  clk: meson: add triple phase clock driver
  clk: meson: add clk-phase clock driver
  clk: meson: clean-up meson clock configuration
  clk: meson: remove obsolete register access
  clk: meson: expose GEN_CLK clkid
  clk: meson-axg: add pcie and mipi clock bindings
  dt-bindings: clock: add meson axg audio clock controller bindings
  clk: meson: audio-divider is one based
  clk: add duty cycle support
  clk: meson-gxbb: set fclk_div2 as CLK_IS_CRITICAL
2018-07-09 09:47:55 -07:00
Arnd Bergmann
5ba57babcb drm: vkms: select DRM_KMS_HELPER
Without this, we get link errors during randconfig build:

drivers/gpu/drm/vkms/vkms_drv.o:(.rodata+0xa0): undefined reference to `drm_atomic_helper_check'
drivers/gpu/drm/vkms/vkms_drv.o:(.rodata+0xa8): undefined reference to `drm_atomic_helper_commit'
drivers/gpu/drm/vkms/vkms_plane.o:(.rodata+0x0): undefined reference to `drm_atomic_helper_update_plane'
drivers/gpu/drm/vkms/vkms_plane.o:(.rodata+0x8): undefined reference to `drm_atomic_helper_disable_plane'
drivers/gpu/drm/vkms/vkms_plane.o:(.rodata+0x18): undefined reference to `drm_atomic_helper_plane_reset'
drivers/gpu/drm/vkms/vkms_plane.o:(.rodata+0x28): undefined reference to `drm_atomic_helper_plane_duplicate_state'
drivers/gpu/drm/vkms/vkms_plane.o:(.rodata+0x30): undefined reference to `drm_atomic_helper_plane_destroy_state'
drivers/gpu/drm/vkms/vkms_output.o:(.rodata+0x1c0): undefined reference to `drm_helper_probe_single_connector_modes'
drivers/gpu/drm/vkms/vkms_crtc.o:(.rodata+0x40): undefined reference to `drm_atomic_helper_crtc_reset'
drivers/gpu/drm/vkms/vkms_crtc.o:(.rodata+0x70): undefined reference to `drm_atomic_helper_set_config'
drivers/gpu/drm/vkms/vkms_crtc.o:(.rodata+0x78): undefined reference to `drm_atomic_helper_page_flip'
drivers/gpu/drm/vkms/vkms_crtc.o:(.rodata+0x90): undefined reference to `drm_atomic_helper_crtc_duplicate_state'
drivers/gpu/drm/vkms/vkms_crtc.o:(.rodata+0x98): undefined reference to `drm_atomic_helper_crtc_destroy_state'

Fixes: 854502fa0a ("drm/vkms: Add basic CRTC initialization")
Fixes: 1c7c5fd916 ("drm/vkms: Introduce basic VKMS driver")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20180709154901.1989316-1-arnd@arndb.de
2018-07-09 18:46:05 +02:00
Gregory CLEMENT
66c7bb7c41 clk: mvebu: armada-37xx-periph: switch to SPDX license identifier
Adopt the SPDX license identifier headers to ease license compliance
management.

Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2018-07-09 09:44:49 -07:00
Gregory CLEMENT
61c40f35f5 clk: mvebu: armada-37xx-periph: Fix switching CPU rate from 300Mhz to 1.2GHz
Switching the CPU from the L2 or L3 frequencies (300 and 200 Mhz
respectively) to L0 frequency (1.2 Ghz) requires a significant amount
of time to let VDD stabilize to the appropriate voltage. This amount of
time is large enough that it cannot be covered by the hardware
countdown register. Due to this, the CPU might start operating at L0
before the voltage is stabilized, leading to CPU stalls.

To work around this problem, we prevent switching directly from the
L2/L3 frequencies to the L0 frequency, and instead switch to the L1
frequency in-between. The sequence therefore becomes:

1. First switch from L2/L3(200/300MHz) to L1(600MHZ)
2. Sleep 20ms for stabling VDD voltage
3. Then switch from L1(600MHZ) to L0(1200Mhz).

It is based on the work done by Ken Ma <make@marvell.com>

Cc: stable@vger.kernel.org
Fixes: 2089dc33ea ("clk: mvebu: armada-37xx-periph: add DVFS support for cpu clocks")
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2018-07-09 09:44:06 -07:00
Florian Westphal
84379c9afe netfilter: ipv6: nf_defrag: drop skb dst before queueing
Eric Dumazet reports:
 Here is a reproducer of an annoying bug detected by syzkaller on our production kernel
 [..]
 ./b78305423 enable_conntrack
 Then :
 sleep 60
 dmesg | tail -10
 [  171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2
 [  181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2
 [  191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2
 [  201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2
 [  211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2
 [  221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2

Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
skb gets queued until frag timer expiry -- 1 minute.

Normally nf_conntrack_reasm gets called during prerouting, so skb has
no dst yet which might explain why this wasn't spotted earlier.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-09 18:04:12 +02:00
Andrey Ryabinin
2045cdfa1b netfilter: nf_conntrack: Fix possible possible crash on module loading.
Loading the nf_conntrack module with doubled hashsize parameter, i.e.
	  modprobe nf_conntrack hashsize=12345 hashsize=12345
causes NULL-ptr deref.

If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
will be called also twice.
The first nf_conntrack_set_hashsize() call will set the
'nf_conntrack_htable_size' variable:

	nf_conntrack_set_hashsize()
		...
		/* On boot, we can set this without any fancy locking. */
		if (!nf_conntrack_htable_size)
			return param_set_uint(val, kp);

But on the second invocation, the nf_conntrack_htable_size is already set,
so the nf_conntrack_set_hashsize() will take a different path and call
the nf_conntrack_hash_resize() function. Which will crash on the attempt
to dereference 'nf_conntrack_hash' pointer:

	BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
	RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
	Call Trace:
	 nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
	 parse_args+0x1f9/0x5a0
	 load_module+0x1281/0x1a50
	 __se_sys_finit_module+0xbe/0xf0
	 do_syscall_64+0x7c/0x390
	 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix this, by checking !nf_conntrack_hash instead of
!nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
after the module loaded, so the second invocation of the
nf_conntrack_set_hashsize() won't crash, it will just reinitialize
nf_conntrack_htable_size again.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-07-09 18:04:11 +02:00
Florian Fainelli
3be77fe8c3 Merge tag 'bcm2835-dt-next-2018-07-03' into devicetree/next
This pull request brings in a board DT for the Raspberry Pi Compute
Module and its I/O board, the Pi3's PMU node, and the display's
transposer block.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
2018-07-09 08:12:15 -07:00
Vivek Unune
2bebdfcdcd ARM: dts: BCM5301X: Add support for Linksys EA9500
Hardware Info
-------------

Processor	- Broadcom BCM4709C0KFEBG dual-core @ 1.4 GHz
Switch		- BCM53012 in BCM4709C0KFEBG & external BCM53125
DDR3 RAM	- 256 MB
Flash		- 128 MB (Toshiba TC58BVG0S3HTA00)
2.4GHz		- BCM4366 4×4 2.4/5G single chip 802.11ac SoC
Power Amp	- Skyworks SE2623L 2.4 GHz power amp (x4)
5GHz x 2	- BCM4366 4×4 2.4/5G single chip 802.11ac SoC
Power Amp	- PLX Technology PEX8603 3-lane, 3-port PCIe switch
Ports		- 8 Ports, 1 WAN Ports
Antennas	- 8 Antennas
Serial Port	- @J6 [GND,TX,RX] (VCC NC)    115200 8n1

Tested with OpenWrt built with DSA driver and Kernel v4.14

Signed-off-by: Vivek Unune <npcomplete13@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
2018-07-09 08:12:13 -07:00
Rafał Miłecki
a21e754843 ARM: dts: BCM53573: Add architected timer
It's a standard ARM architected timer that was simply missed when
initially adding this .dtsi file.

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
2018-07-09 08:12:12 -07:00
Vivek Unune
37f6130ec3 ARM: dts: BCM5301X: Make USB 3.0 PHY use MDIO PHY driver
Currently, the USB 3.0 PHY in bcm5301x.dtsi uses platform driver which
requires register range "ccb-mii" <0x18003000 0x1000>. This range
overlaps with MDIO cmd and param registers (<0x18003000 0x8>).
Essentially, the platform driver partly acts like a MDIO bus driver,
hence to use of this register range.

In some Northstar devices like Linksys EA9500, secondary switch is
connected via external MDIO. The only way to access and configure the
external switch is via MDIO bus. When we enable the MDIO bus in it's
current state, the MDIO bus and any child buses fail to register because
of the register range overlap.

On Northstar, the USB 3.0 PHY is connected at address 0x10 on the
internal MDIO bus. This change moves the usb3_phy node and makes it a
child node of internal MDIO bus.

Thanks to Rafał Miłecki's commit af850e14a7 ("phy: bcm-ns-usb3: add
MDIO driver using proper bus layer") the same USB 3.0 platform driver
can now act as USB 3.0 PHY MDIO driver.

Tested on Linksys Panamera (EA9500)

Signed-off-by: Vivek Unune <npcomplete13@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
2018-07-09 08:12:11 -07:00
Mohamed Ismail Abdul Packir Mohamed
a08e950de6 ARM: dts: cygnus: enable iproc-hwrng
Enable the HW rng driver "iproc-rng200" for all cygnus platforms.

Signed-off-by: Mohamed Ismail Abdul Packir Mohamed <mohamed-ismail.abdul@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Tested-by: Clément Péron <peron.clem@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
2018-07-09 08:12:10 -07:00
Clément Péron
00d1ae3840 ARM: dts: cygnus: add ethernet0 alias
In order to avoid Linux generating a random mac address on every boot,
add an ethernet0 alias that will allow u-boot to patch the dtb with
the MAC address.

Signed-off-by: Clément Péron <peron.clem@gmail.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
2018-07-09 08:12:01 -07:00
Boris Brezillon
b7dd29b401 ARM: dts: bcm283x: Add Transposer block
The transposer block is allowing one to write the result of the VC4
composition back to memory instead of displaying it on a screen.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Eric Anholt <eric@anholt.net>
2018-07-09 08:10:20 -07:00
Eric Anholt
b8ccf02a50 ARM: dts: bcm283x: Add the PMU to the devicetree.
This only probes on arm64 so far, but hopefully that driver will be
generalized soon.

Signed-off-by: Eric Anholt <eric@anholt.net>
Acked-by: Stefan Wahren <stefan.wahren@i2se.com>
2018-07-09 08:10:08 -07:00
Randy Dunlap
3993e501bf block/DAC960.c: fix defined but not used build warnings
Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.

../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:55 -06:00
Matias Bjørling
ca4b2a0119 null_blk: add zone support
Adds support for exposing a null_blk device through the zone device
interface.

The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.

Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:55 -06:00
Matias Bjørling
6dad38d38f null_blk: move shared definitions to header file
Split the null_blk device driver, such that it can prepare for
zoned block interface support.

Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Geert Uytterhoeven
e9a8385330 block: Add default switch case to blk_pm_allow_request() to kill warning
With gcc 4.9.0 and 7.3.0:

    block/blk-core.c: In function 'blk_pm_allow_request':
    block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
      switch (rq->q->rpm_status) {
      ^

Convert the return statement below the switch() block into a default
case to fix this.

Fixes: e4f36b249b ("block: fix peeking requests during PM")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Mikulas Patocka
b88aef36b8 block: fix infinite loop if the device loses discard capability
If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.

This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Zdenek Kabelac <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Shakeel Butt
c137969bd4 block, mm: remove unnecessary __GFP_HIGH flag
The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Liu Bo
00a8cdb84f null_blk: remove NULLB_DEV_FL_CONFIGURED on turning off nullb device
Currently mbps knob could only be set once before switching power knob to
on, after power knob has been set at least once, there is no way to set
mbps knob again due to -EBUSY.

As nullb is mainly used for testing, in order to make it flexible, this
removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset
when power knob is off, e.g.

echo 0 > /config/nullb/a/power
echo 40 > /config/nullb/a/mbps
echo 1 > /config/nullb/a/power

So does other knobs under /config/nullb/a.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
ca47e8c72a mm: skip readahead if the cgroup is congested
We noticed in testing we'd get pretty bad latency stalls under heavy
pressure because read ahead would try to do its thing while the cgroup
was under severe pressure.  If we're under this much pressure we want to
do as little IO as possible so we can still make progress on real work
if we're a throttled cgroup, so just skip readahead if our group is
under pressure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
b351f0c76c Documentation: add a doc for blk-iolatency
A basic documentation to describe the interface, statistics, and
behavior of io.latency.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
d706751215 block: introduce blk-iolatency io controller
Current IO controllers for the block layer are less than ideal for our
use case.  The io.max controller is great at hard limiting, but it is
not work conserving.  This patch introduces io.latency.  You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets.  This makes use of
the rq-qos infrastructure and works much like the wbt stuff.  There are
a few differences from wbt

 - It's bio based, so the latency covers the whole block layer in addition to
   the actual io.
 - We will throttle all IO types that comes in here if we need to.
 - We use the mean latency over the 100ms window.  This is because writes can
   be particularly fast, which could give us a false sense of the impact of
   other workloads on our protected workload.
 - By default there's no throttling, we set the queue_depth to INT_MAX so that
   we can have as many outstanding bio's as we're allowed to.  Only at
   throttle time do we pay attention to the actual queue depth.
 - We backcharge cgroups for root cg issued IO and induce artificial
   delays in order to deal with cases like metadata only or swap heavy
   workloads.

In testing this has worked out relatively well.  Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).

Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group.  We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.

Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all.  With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
67b42d0bf7 rq-qos: introduce dio_bio callback
wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
c1c80384c8 block: remove external dependency on wbt_flags
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
a79050434b blk-rq-qos: refactor out common elements of blk-wbt
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis.  Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
2ecbf45635 blk-stat: export helpers for modifying blk_rq_stat
We need to use blk_rq_stat in the blkcg qos stuff, so export some of
these helpers so they can be used by other things.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Tejun Heo
2cf855837b memcontrol: schedule throttling if we are congested
Memory allocations can induce swapping via kswapd or direct reclaim.  If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling.  So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
d09d8df3a2 blkcg: add generic throttling mechanism
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies.  The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources.  Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.

This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Tejun Heo
0d3bd88d54 swap,blkcg: issue swap io with the appropriate context
For backcharging we need to know who the page belongs to when swapping
it out.  We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
0d1e0c7cd5 blk: introduce REQ_SWAP
Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups.  Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
903d23f0a3 blk-cgroup: allow controllers to output their own stats
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.

Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:54 -06:00
Josef Bacik
c7c98fd376 block: introduce bio_issue_as_root_blkg
Instead of forcing all file systems to get the right context on their
bio's, simply check for REQ_META to see if we need to issue as the root
blkg.  We don't want to force all bio's to have the root blkg associated
with them if REQ_META is set, as some controllers (blk-iolatency) need
to know who the originating cgroup is so it can backcharge them for the
work they are doing.  This helper will make sure that the controllers do
the proper thing wrt the IO priority and backcharging.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Josef Bacik
08e18eab0c block: add bi_blkg to the bio for cgroups
Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well.  Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.

Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio.  This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.

So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei
6e76871730 blk-mq: dequeue request one by one from sw queue if hctx is busy
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.

This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.

Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Gustavo A. R. Silva
d893ff8603 block/loop: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Gustavo A. R. Silva
d769a99296 drbd: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Warning level 2 was used in this case: -Wimplicit-fallthrough=2

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei
b04f50ab8a blk-mq: only attempt to merge bio if there is rq in sw queue
Only attempt to merge bio iff the ctx->rq_list isn't empty, because:

1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock

2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Ming Lei
3f0cedc7e9 blk-mq: use list_splice_tail_init() to insert requests
list_splice_tail_init() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.

Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Minwoo Im
c018c84fdb blk-mq: fix typo in a function comment
Fix typo in a function blk_mq_alloc_tag_set() comment.
if if it too large -> if it's too large.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Minwoo Im
0da73d00ca blk-mq: code clean-up by adding an API to clear set->mq_map
set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c.  It's also cleared before
updating a queue map in blk_mq_update_queue_map().

This patch provides an API to clear set->mq_map to make it clear.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Colin Ian King
5efac89c84 paride: remove redundant variable n
Variable n is being assigned but is never used hence it is redundant
and can be removed. Also put spacing between variables in declaration
to clean up checkpatch warnings.

Cleans up clang warning:
warning: variable 'n' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Colin Ian King
e84422cdf3 partitions/ldm: remove redundant pointer dgrp
Pointer dgrp is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
warning: variable 'dgrp' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00
Colin Ian King
f4354a94e2 loop: remove redundant pointer inode
Pointer inode is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
warning: variable 'inode' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09 09:07:53 -06:00