During system reboot or halt, with lockdep enabled:
================================
WARNING: inconsistent lock state
4.18.0-rc1-salvator-x-00002-g9203dbec90a68103 #41 Tainted: G W
--------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
reboot/2779 [HC0[0]:SC0[0]:HE1:SE1] takes:
0000000098ae4ad3 (&(&rchan->lock)->rlock){?.-.}, at: rcar_dmac_shutdown+0x58/0x6c
{IN-HARDIRQ-W} state was registered at:
lock_acquire+0x208/0x238
_raw_spin_lock+0x40/0x54
rcar_dmac_isr_channel+0x28/0x200
__handle_irq_event_percpu+0x1c0/0x3c8
handle_irq_event_percpu+0x34/0x88
handle_irq_event+0x48/0x78
handle_fasteoi_irq+0xc4/0x12c
generic_handle_irq+0x18/0x2c
__handle_domain_irq+0xa8/0xac
gic_handle_irq+0x78/0xbc
el1_irq+0xec/0x1c0
arch_cpu_idle+0xe8/0x1bc
default_idle_call+0x2c/0x30
do_idle+0x144/0x234
cpu_startup_entry+0x20/0x24
rest_init+0x27c/0x290
start_kernel+0x430/0x45c
irq event stamp: 12177
hardirqs last enabled at (12177): [<ffffff800881d804>] _raw_spin_unlock_irq+0x2c/0x4c
hardirqs last disabled at (12176): [<ffffff800881d638>] _raw_spin_lock_irq+0x1c/0x60
softirqs last enabled at (11948): [<ffffff8008081da8>] __do_softirq+0x160/0x4ec
softirqs last disabled at (11935): [<ffffff80080ec948>] irq_exit+0xa0/0xfc
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&rchan->lock)->rlock);
<Interrupt>
lock(&(&rchan->lock)->rlock);
*** DEADLOCK ***
3 locks held by reboot/2779:
#0: 00000000bfabfa74 (reboot_mutex){+.+.}, at: sys_reboot+0xdc/0x208
#1: 00000000c75d8c3a (&dev->mutex){....}, at: device_shutdown+0xc8/0x1c4
#2: 00000000ebec58ec (&dev->mutex){....}, at: device_shutdown+0xd8/0x1c4
stack backtrace:
CPU: 6 PID: 2779 Comm: reboot Tainted: G W 4.18.0-rc1-salvator-x-00002-g9203dbec90a68103 #41
Hardware name: Renesas Salvator-X 2nd version board based on r8a7795 ES2.0+ (DT)
Call trace:
dump_backtrace+0x0/0x148
show_stack+0x14/0x1c
dump_stack+0xb0/0xf0
print_usage_bug.part.26+0x1c4/0x27c
mark_lock+0x38c/0x610
__lock_acquire+0x3fc/0x14d4
lock_acquire+0x208/0x238
_raw_spin_lock+0x40/0x54
rcar_dmac_shutdown+0x58/0x6c
platform_drv_shutdown+0x20/0x2c
device_shutdown+0x160/0x1c4
kernel_restart_prepare+0x34/0x3c
kernel_restart+0x14/0x5c
sys_reboot+0x160/0x208
el0_svc_naked+0x30/0x34
rcar_dmac_stop_all_chan() takes the channel lock while stopping a
channel, but does not disable interrupts, leading to a deadlock when a
DMAC interrupt comes in. Before, the same code block was called from an
interrupt handler, hence taking the spinlock was sufficient.
Fix this by disabling local interrupts while taking the spinlock.
Fixes: 9203dbec90 ("dmaengine: rcar-dmac: don't use DMAC error interrupt")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Colibri-T20 can come in 256 MB RAM (with 512 MB NAND) or 512 MB RAM
(with 1024 MB NAND) flavors. Both of them will use the same DTSI
expecting the bootloader to do the fixup of /memory node. However in
case it does not happen, let's stay on safe side by limiting the memory
to 256 MB for both versions of Colibri-T20.
Rename to remove the unnecessary memory size from the device tree file
name. While at it, also follow the typical Toradex SoC, module, carrier
board hierarchy.
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Tested-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Remove unneeded address/size cells properties and unit addresses to fix
DTC warnings like:
arch/arm/boot/dts/tegra30-apalis-eval.dtb: Warning (unit_address_vs_reg):
/i2c@7000d000/stmpe811@41/stmpe_touchscreen@0: node has a unit name, but no reg property
arch/arm/boot/dts/tegra30-apalis-eval.dtb: Warning (avoid_unnecessary_addr_size):
/i2c@7000d000/stmpe811@41: unnecessary #address-cells/#size-cells without "ranges" or child "reg" property
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add a generic /memory node in each Tegra DTSI (with empty reg property,
to be overidden by each DTS) and set proper unit address for /memory
nodes to fix the DTC warnings:
arch/arm/boot/dts/tegra20-harmony.dtb: Warning (unit_address_vs_reg):
/memory: node has a reg or ranges property, but no unit name
The DTB after the change is the same as before except adding
unit-address to /memory node.
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Remove the usage of skeleton.dtsi because it was deprecated since commit
9c0da3cc61 ("ARM: dts: explicitly mark skeleton.dtsi as deprecated").
It also allows later to fix DTC warnings for missing unit name in
/memory nodes.
Compiled DTBs are the same as before this commit.
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Switching the CPU from the L2 or L3 frequencies (300 and 200 Mhz
respectively) to L0 frequency (1.2 Ghz) requires a significant amount
of time to let VDD stabilize to the appropriate voltage. This amount of
time is large enough that it cannot be covered by the hardware
countdown register. Due to this, the CPU might start operating at L0
before the voltage is stabilized, leading to CPU stalls.
To work around this problem, we prevent switching directly from the
L2/L3 frequencies to the L0 frequency, and instead switch to the L1
frequency in-between. The sequence therefore becomes:
1. First switch from L2/L3(200/300MHz) to L1(600MHZ)
2. Sleep 20ms for stabling VDD voltage
3. Then switch from L1(600MHZ) to L0(1200Mhz).
It is based on the work done by Ken Ma <make@marvell.com>
Cc: stable@vger.kernel.org
Fixes: 2089dc33ea ("clk: mvebu: armada-37xx-periph: add DVFS support for cpu clocks")
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Eric Dumazet reports:
Here is a reproducer of an annoying bug detected by syzkaller on our production kernel
[..]
./b78305423 enable_conntrack
Then :
sleep 60
dmesg | tail -10
[ 171.599093] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 181.631024] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 191.687076] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 201.703037] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 211.711072] unregister_netdevice: waiting for lo to become free. Usage count = 2
[ 221.959070] unregister_netdevice: waiting for lo to become free. Usage count = 2
Reproducer sends ipv6 fragment that hits nfct defrag via LOCAL_OUT hook.
skb gets queued until frag timer expiry -- 1 minute.
Normally nf_conntrack_reasm gets called during prerouting, so skb has
no dst yet which might explain why this wasn't spotted earlier.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Loading the nf_conntrack module with doubled hashsize parameter, i.e.
modprobe nf_conntrack hashsize=12345 hashsize=12345
causes NULL-ptr deref.
If 'hashsize' specified twice, the nf_conntrack_set_hashsize() function
will be called also twice.
The first nf_conntrack_set_hashsize() call will set the
'nf_conntrack_htable_size' variable:
nf_conntrack_set_hashsize()
...
/* On boot, we can set this without any fancy locking. */
if (!nf_conntrack_htable_size)
return param_set_uint(val, kp);
But on the second invocation, the nf_conntrack_htable_size is already set,
so the nf_conntrack_set_hashsize() will take a different path and call
the nf_conntrack_hash_resize() function. Which will crash on the attempt
to dereference 'nf_conntrack_hash' pointer:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
RIP: 0010:nf_conntrack_hash_resize+0x255/0x490 [nf_conntrack]
Call Trace:
nf_conntrack_set_hashsize+0xcd/0x100 [nf_conntrack]
parse_args+0x1f9/0x5a0
load_module+0x1281/0x1a50
__se_sys_finit_module+0xbe/0xf0
do_syscall_64+0x7c/0x390
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix this, by checking !nf_conntrack_hash instead of
!nf_conntrack_htable_size. nf_conntrack_hash will be initialized only
after the module loaded, so the second invocation of the
nf_conntrack_set_hashsize() won't crash, it will just reinitialize
nf_conntrack_htable_size again.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This pull request brings in a board DT for the Raspberry Pi Compute
Module and its I/O board, the Pi3's PMU node, and the display's
transposer block.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
It's a standard ARM architected timer that was simply missed when
initially adding this .dtsi file.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Currently, the USB 3.0 PHY in bcm5301x.dtsi uses platform driver which
requires register range "ccb-mii" <0x18003000 0x1000>. This range
overlaps with MDIO cmd and param registers (<0x18003000 0x8>).
Essentially, the platform driver partly acts like a MDIO bus driver,
hence to use of this register range.
In some Northstar devices like Linksys EA9500, secondary switch is
connected via external MDIO. The only way to access and configure the
external switch is via MDIO bus. When we enable the MDIO bus in it's
current state, the MDIO bus and any child buses fail to register because
of the register range overlap.
On Northstar, the USB 3.0 PHY is connected at address 0x10 on the
internal MDIO bus. This change moves the usb3_phy node and makes it a
child node of internal MDIO bus.
Thanks to Rafał Miłecki's commit af850e14a7 ("phy: bcm-ns-usb3: add
MDIO driver using proper bus layer") the same USB 3.0 platform driver
can now act as USB 3.0 PHY MDIO driver.
Tested on Linksys Panamera (EA9500)
Signed-off-by: Vivek Unune <npcomplete13@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
In order to avoid Linux generating a random mac address on every boot,
add an ethernet0 alias that will allow u-boot to patch the dtb with
the MAC address.
Signed-off-by: Clément Péron <peron.clem@gmail.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
The transposer block is allowing one to write the result of the VC4
composition back to memory instead of displaying it on a screen.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Reviewed-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Eric Anholt <eric@anholt.net>
This only probes on arm64 so far, but hopefully that driver will be
generalized soon.
Signed-off-by: Eric Anholt <eric@anholt.net>
Acked-by: Stefan Wahren <stefan.wahren@i2se.com>
Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.
../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Adds support for exposing a null_blk device through the zone device
interface.
The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split the null_blk device driver, such that it can prepare for
zoned block interface support.
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With gcc 4.9.0 and 7.3.0:
block/blk-core.c: In function 'blk_pm_allow_request':
block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
switch (rq->q->rpm_status) {
^
Convert the return statement below the switch() block into a default
case to fix this.
Fixes: e4f36b249b ("block: fix peeking requests during PM")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.
This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Zdenek Kabelac <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to
explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently mbps knob could only be set once before switching power knob to
on, after power knob has been set at least once, there is no way to set
mbps knob again due to -EBUSY.
As nullb is mainly used for testing, in order to make it flexible, this
removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset
when power knob is off, e.g.
echo 0 > /config/nullb/a/power
echo 40 > /config/nullb/a/mbps
echo 1 > /config/nullb/a/power
So does other knobs under /config/nullb/a.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We noticed in testing we'd get pretty bad latency stalls under heavy
pressure because read ahead would try to do its thing while the cgroup
was under severe pressure. If we're under this much pressure we want to
do as little IO as possible so we can still make progress on real work
if we're a throttled cgroup, so just skip readahead if our group is
under pressure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A basic documentation to describe the interface, statistics, and
behavior of io.latency.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Current IO controllers for the block layer are less than ideal for our
use case. The io.max controller is great at hard limiting, but it is
not work conserving. This patch introduces io.latency. You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets. This makes use of
the rq-qos infrastructure and works much like the wbt stuff. There are
a few differences from wbt
- It's bio based, so the latency covers the whole block layer in addition to
the actual io.
- We will throttle all IO types that comes in here if we need to.
- We use the mean latency over the 100ms window. This is because writes can
be particularly fast, which could give us a false sense of the impact of
other workloads on our protected workload.
- By default there's no throttling, we set the queue_depth to INT_MAX so that
we can have as many outstanding bio's as we're allowed to. Only at
throttle time do we pay attention to the actual queue depth.
- We backcharge cgroups for root cg issued IO and induce artificial
delays in order to deal with cases like metadata only or swap heavy
workloads.
In testing this has worked out relatively well. Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).
Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group. We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.
Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all. With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to use blk_rq_stat in the blkcg qos stuff, so export some of
these helpers so they can be used by other things.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Memory allocations can induce swapping via kswapd or direct reclaim. If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling. So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies. The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources. Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.
This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP. We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For backcharging we need to know who the page belongs to when swapping
it out. We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.
Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of forcing all file systems to get the right context on their
bio's, simply check for REQ_META to see if we need to issue as the root
blkg. We don't want to force all bio's to have the root blkg associated
with them if REQ_META is set, as some controllers (blk-iolatency) need
to know who the originating cgroup is so it can backcharge them for the
work they are doing. This helper will make sure that the controllers do
the proper thing wrt the IO priority and backcharging.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well. Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.
Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio. This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.
So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.
This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.
Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Warning level 2 was used in this case: -Wimplicit-fallthrough=2
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only attempt to merge bio iff the ctx->rq_list isn't empty, because:
1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock
2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
list_splice_tail_init() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fix typo in a function blk_mq_alloc_tag_set() comment.
if if it too large -> if it's too large.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c. It's also cleared before
updating a queue map in blk_mq_update_queue_map().
This patch provides an API to clear set->mq_map to make it clear.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Variable n is being assigned but is never used hence it is redundant
and can be removed. Also put spacing between variables in declaration
to clean up checkpatch warnings.
Cleans up clang warning:
warning: variable 'n' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pointer dgrp is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'dgrp' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pointer inode is being assigned but is never used hence it is redundant
and can be removed.
Cleans up clang warning:
warning: variable 'inode' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>