add support for new CSR atlas7 SoC. atlas7 exists V1 and V2 IP.
atlas7 DMAv1 is basically moved from marco, which has never been
delivered to customers and renamed in this patch.
atlas7 DMAv2 supports chain DMA by a chain table, this
patch also adds chain DMA support for atlas7.
atlas7 DMAv1 and DMAv2 co-exist in the same chip. there are some HW
configuration differences(register offset etc.) with old prima2 chips,
so we use compatible string to differentiate old prima2 and new atlas7,
then results in different set in HW for them.
Signed-off-by: Hao Liu <Hao.Liu@csr.com>
Signed-off-by: Yanchang Li <Yanchang.Li@csr.com>
Signed-off-by: Barry Song <Baohua.Song@csr.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
acpi_video_unregister() not only unregisters the acpi-video backlight
interface but also unregisters the acpi video bus event listener, causing
e.g. brightness hotkey presses to no longer generate keypress events.
The unregistering of the acpi video bus event listener usually is
undesirable, which by itself is a good reason to switch to
acpi_video_unregister_backlight().
Another problem with using acpi_video_unregister() rather then using
acpi_video_unregister_backlight() is that on systems with an intel video
opregion (most systems) and a broken_acpi_video quirk, whether or not
the acpi video bus event listener actually gets unregistered depends on
module load ordering:
Scenario a:
1) acpi/video.ko gets loaded (*), does not do acpi_video_register as there
is an intel opregion.
2) intel.ko gets loaded, calls acpi_video_register() which registers both
the listener and the acpi backlight interface
3) samsung-laptop.ko gets loaded, calls acpi_video_unregister() causing
both the listener and the acpi backlight interface to unregister
Scenario b:
1) acpi/video.ko gets loaded (*), does not do acpi_video_register as there
is an intel opregion.
2) samsung-laptop.ko gets loaded, calls acpi_video_dmi_promote_vendor(),
calls acpi_video_unregister(), which is a nop since acpi_video_register
has not yet been called
2) intel.ko gets loaded, calls acpi_video_register() which registers
the listener, but does not register the acpi backlight interface due to
the call to the preciding call to acpi_video_dmi_promote_vendor()
*) acpi/video.ko always loads first as both other modules depend on it.
So we end up with or without an acpi video bus event listener depending
on module load ordering, not good.
Switching to using acpi_video_unregister_backlight() means that independ
of ordering we will always have an acpi video bus event listener fixing
this.
Note that this commit means that systems without an intel video opregion,
and systems which were hitting scenario a wrt module load ordering, are
now getting an acpi video bus event listener while before they were not!
On some systems this may cause the brightness hotkeys to start generating
keypresses while before they were not (good), while on other systems this
may cause the brightness hotkeys to generate multiple keypress events for
a single press (not so good). Since on most systems the acpi video bus is
the canonical source for brightness events I believe that the latter case
will needs to be handled on a case by case basis by filtering out the
duplicate keypresses at the other source for them.
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Corentin Chary <corentin.chary@gmail.com)
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
acpi_video_unregister() not only unregisters the acpi-video backlight
interface but also unregisters the acpi video bus event listener, causing
e.g. brightness hotkey presses to no longer generate keypress events.
The unregistering of the acpi video bus event listener usually is
undesirable, which by itself is a good reason to switch to
acpi_video_unregister_backlight().
Another problem with using acpi_video_unregister() rather then using
acpi_video_unregister_backlight() is that on systems with an intel video
opregion (most systems) and a wmi_backlight_power quirk, whether or not
the acpi video bus event listener actually gets unregistered depends on
module load ordering:
Scenario a:
1) acpi/video.ko gets loaded (*), does not do acpi_video_register as there
is an intel opregion.
2) intel.ko gets loaded, calls acpi_video_register() which registers both
the listener and the acpi backlight interface
3) asus-wmi.ko gets loaded, calls acpi_video_unregister() causing both
the listener and the acpi backlight interface to unregister
Scenario b:
1) acpi/video.ko gets loaded (*), does not do acpi_video_register as there
is an intel opregion.
2) asus-wmi.ko gets loaded, calls acpi_video_dmi_promote_vendor(),
calls acpi_video_unregister(), which is a nop since acpi_video_register
has not yet been called
2) intel.ko gets loaded, calls acpi_video_register() which registers
the listener, but does not register the acpi backlight interface due to
the call to the preciding call to acpi_video_dmi_promote_vendor()
*) acpi/video.ko always loads first as both other modules depend on it.
So we end up with or without an acpi video bus event listener depending
on module load ordering, not good.
Switching to using acpi_video_unregister_backlight() means that independ
of ordering we will always have an acpi video bus event listener fixing
this.
Note that this commit means that systems without an intel video opregion,
and systems which were hitting scenario a wrt module load ordering, are
now getting an acpi video bus event listener while before they were not!
On some systems this may cause the brightness hotkeys to start generating
keypresses while before they were not (good), while on other systems this
may cause the brightness hotkeys to generate multiple keypress events for
a single press (not so good). Since on most systems the acpi video bus is
the canonical source for brightness events I believe that the latter case
will needs to be handled on a case by case basis by filtering out the
duplicate keypresses at the other source for them.
Cc: acpi4asus-user@lists.sourceforge.net
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Acked-by: Corentin Chary <corentin.chary@gmail.com)
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
acpi_video_unregister() not only unregisters the acpi-video backlight
interface but also unregisters the acpi video bus event listener, causing
e.g. brightness hotkey presses to no longer generate keypress events.
The unregistering of the acpi video bus event listener usually is
undesirable, which by itself is a good reason to switch to
acpi_video_unregister_backlight().
Another problem with using acpi_video_unregister() rather then using
acpi_video_unregister_backlight() is that on systems with an intel video
opregion (most systems) whether or not the acpi video bus event listener
actually gets unregistered depends on module load ordering:
Scenario a:
1) acpi/video.ko gets loaded (*), does not do acpi_video_register as there
is an intel opregion.
2) intel.ko gets loaded, calls acpi_video_register() which registers both
the listener and the acpi backlight interface
3) apple-gmux.ko gets loaded, calls acpi_video_unregister() causing both
the listener and the acpi backlight interface to unregister
Scenario b:
1) acpi/video.ko gets loaded (*), does not do acpi_video_register as there
is an intel opregion.
2) apple-gmux.ko gets loaded, calls acpi_video_dmi_promote_vendor(),
calls acpi_video_unregister(), which is a nop since acpi_video_register
has not yet been called
2) intel.ko gets loaded, calls acpi_video_register() which registers
the listener, but does not register the acpi backlight interface due to
the call to the preciding call to acpi_video_dmi_promote_vendor()
*) acpi/video.ko always loads first as both other modules depend on it.
So we end up with or without an acpi video bus event listener depending
on module load ordering, not good.
Switching to using acpi_video_unregister_backlight() means that independ
of ordering we will always have an acpi video bus event listener fixing
this.
Note that this commit means that systems without an intel video opregion,
and systems which were hitting scenario a wrt module load ordering, are
now getting an acpi video bus event listener while before they were not!
On some systems this may cause the brightness hotkeys to start generating
keypresses while before they were not (good), while on other systems this
may cause the brightness hotkeys to generate multiple keypress events for
a single press (not so good). Since on most systems the acpi video bus is
the canonical source for brightness events I believe that the latter case
will needs to be handled on a case by case basis by filtering out the
duplicate keypresses at the other source for them.
Cc: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
pvpanic was not properly detected when _STA was missing.
ACPI 6.0 April 2015, 6.3.7 _STA (Status)
If a device object (including the processor object) does not have an
_STA object, then OSPM assumes that all of the above bits are set
(i.e., the device is present, enabled, shown in the UI, and
functioning).
Not adhering to the specification made pvpanic dormant under QEMU 2.3.
The original patch used acpi_bus_get_status_handle, which was not
being exported, so module build blew up; switch to acpi_bus_get_status
and use the status it populates.
Populated status is a bitfield so we can make the code self-documenting.
We do not check 'present' because 'enabled' has to be false in that case
by specification. Older QEMUs set 0xff to status and newer ones do 0xb.
Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
[dvhart@linux.intel.com: Merge acpi_bug_get_status fix to avoid bisect breakage]
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
Fix static checker warnings in the flow of system guid query.
Fixes: 707c4602cd ('net/mlx5_core: Add new query HCA vport commands')
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the driver gets unregistered a call to netif_napi_del() was
missing, this all was also missing in the error paths of
b44_init_one().
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to disable softirqs because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. The spin locks in br_fdb_update were _bh before commit f8ae737dee
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d139898 ("bridge: add NTF_USE support")
Using local_bh_disable/enable around br_fdb_update() allows us to keep
using the spin_lock/unlock in br_fdb_update for the fast-path.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d139898 ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>
In big endian cases, macro cpu_to_le16 unfolds to __swab16 which
provides special case for constants. In little endian cases,
__constant_cpu_to_le16 and cpu_to_le16 expand directly to the
same expression. So, replace __constant_cpu_to_le16 with
cpu_to_le16 with the goal of getting rid of the definition of
__constant_cpu_to_le16 completely.
The semantic patch that performs this transformation is as follows:
@@expression x;@@
- __constant_cpu_to_le16(x)
+ cpu_to_le16(x)
Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove unneeded semicolon.
The semantic patch that makes this change is available
in scripts/coccinelle/misc/semicolon.cocci.
More information about semantic patching is available at
http://coccinelle.lip6.fr/
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: Peter Chen <peter.chen@freescale.com>
The mpls device is used in an RCU read context without a lock being
held. As the memory is freed without waiting for the RCU grace period
to elapse, the freed memory could still be in use.
Address this by using kfree_rcu to free the memory for the mpls device
after the RCU grace period has elapsed.
Fixes: 03c57747a7 ("mpls: Per-device MPLS state")
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
use time_is_before_eq_jiffies macro for time comparison
Signed-off-by: Antonio Murdaca <antonio.murdaca@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull MIPS updates from Ralf Baechle:
"Eight fixes across arch/mips. Nothing stands particuarly out nor is
complicated but fixes keep coming in at a higher than comfortable
rate"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: KVM: Do not sign extend on unsigned MMIO load
MIPS: BPF: Fix stack pointer allocation
MIPS: Loongson-3: Fix a cpu-hotplug issue in loongson3_ipi_interrupt()
MIPS: Fix enabling of DEBUG_STACKOVERFLOW
MIPS: c-r4k: Fix typo in probe_scache()
MIPS: Avoid an FPE exception in FCSR mask probing
MIPS: ath79: Add a missing new line in log message
MIPS: ralink: Fix clearing the illegal access interrupt
There are several places in the driver (all in control paths) where
coherent dma memory is being allocated using either dma_alloc_coherent()
or the deprecated pci_alloc_consistent(). All these calls should be
changed to use dma_zalloc_coherent() to avoid uninitialized fields in
data structures backed by this memory.
Reported-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@avagotech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current driver adjust freq formula is:
fe * diff = ppb * pc
Note:
fe: ENET ref clock frequency in Hz
diff = inc_corr - inc: difference between default increment and correction increment
ppb: parts per billion adjustment from base
pc: correction period (in number of fe clock cycles)
The correction increment will be used after N cycles of regular increments,
not every N cycles (with N being the correction period). For example, set ENET_ATCOR=4,
INC=8, INC_CORR=9, there will be 4 increments of 8 (ENET_ATINC[INC]) , followed by 1
increment of 9 (ENET_ATINC[INC_CORR]).
So, the correct formula is:
fe * diff = ppb * (pc + 1)
For ENET_ATCOR, a value 0 disables the correction counter and no corrections occur.
So base on the origin formula, set pc = pc > 1 ? pc - 1 : pc.
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Frank Li <Frank.Li@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
br_fdb_update() can be called in process context in the following way:
br_fdb_add() -> __br_fdb_add() -> br_fdb_update() (if NTF_USE flag is set)
so we need to use spin_lock_bh because there are softirq users of the
hash_lock. One easy way to reproduce this is to modify the bridge utility
to set NTF_USE, enable stp and then set maxageing to a low value so
br_fdb_cleanup() is called frequently and then just add new entries in
a loop. This happens because br_fdb_cleanup() is called from timer/softirq
context. These locks were _bh before commit f8ae737dee
("[BRIDGE]: forwarding remove unneeded preempt and bh diasables")
and at the time that commit was correct because br_fdb_update() couldn't be
called from process context, but that changed after commit:
292d139898 ("bridge: add NTF_USE support")
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 292d139898 ("bridge: add NTF_USE support")
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove sparse warnings:
drivers/net/ethernet/xilinx/ll_temac_main.c:65:16: warning: cast removes
address space of expression
drivers/net/ethernet/xilinx/ll_temac_main.c:70:9: warning: cast removes
address space of expression
drivers/net/ethernet/xilinx/ll_temac_main.c:127:16: warning: cast
removes address space of expression
drivers/net/ethernet/xilinx/ll_temac_main.c:137:9: warning: cast removes
address space of expression
drivers/net/ethernet/xilinx/ll_temac_main.c:409:3: warning: symbol
'temac_options' was not declared. Should it be static?
drivers/net/ethernet/xilinx/ll_temac_main.c:590:6: warning: symbol
'temac_adjust_link' was not declared. Should it be static?
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv4 and IPv6 share same implementation of get_cookie_sock(),
and there is no point inlining it.
We add tcp_ prefix to the common helper name and export it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dump useful ring statistics along with interrupt status, software
maintained pointers and hardware registers to help troubleshoot TX queue
stalls.
When a timeout occurs, disable TX NAPI for the rings, dump their states
while interrupts are disabled, re-enable interrupts, NAPI and queue flow
control to help with the recovery.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool -S on a DSA interface can deadlock for some switches because
the same lock is taken twice. Use the register read function which
expects the lock to be already held.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Fixes: 31888234b7 ("net: dsa: mv88e6xxx: Replace stats mutex with SMI mutex")
Signed-off-by: David S. Miller <davem@davemloft.net>
Some sensors like the LIS331DL only support 8bit data by a single
register per axis. These utilize the MSB byte. Make it possible
to register these apropriately.
A oneliner change is needed in the ST sensors core to handle 8bit
reads as this is the first supported 8bit sensor.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Denis Ciocca <denis.ciocca@st.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Support configurable conversion mode through sysfs. So far, the
mode used was low-power, which is enabled by default now. Beside
that, the modes normal and high-speed are selectable as well.
Use the new device tree property which specifies the maximum ADC
conversion clock frequencies. Depending on the mode used, the
available resulting conversion frequency are calculated
dynamically.
Acked-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
The @chan parameter can be 0 or 1 and not a bit mask. Fix wrong description.
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Marek Belisko <marek@goldelico.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
The bit mask to read the setting of the constant current source
for measuring the NTC voltage was the wrong one. Since default
value is initialized to the lowest level (000 = 10uA) the difference
was probably never noticed in practice.
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Signed-off-by: Marek Belisko <marek@goldelico.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Implement interrupt driven trigger for data ready.
This allows more efficient access to the sample data.
Signed-off-by: Martin Fuzzey <mfuzzey@parkeon.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Allow the cutoff frequency of the high pass filter to be configured.
Signed-off-by: Martin Fuzzey <mfuzzey@parkeon.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Allow the debouce counter for transient events to be configured
using the sysfs attribute events/in_accel_thresh_rising_period
Signed-off-by: Martin Fuzzey <mfuzzey@parkeon.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
The event is triggered when the highpass filtered absolute acceleration
exceeds the threshold.
Signed-off-by: Martin Fuzzey <mfuzzey@parkeon.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
This patch adds compensation formula to raw readings, borrowed
from Memsic's input driver.
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Datasheet says (Page 2) that typical value for sensitivity
for 16 bits mode on Z-axis is 770. Anyhow, looking at the
input driver provided by Memsic the value for MMC35240 is
1024.
Also, testing shows that using 1024 for Z-axis senzitivity
offers better results.
Fixes: abeb6b1e7b ("iio: magnetometer: Add support for MEMSIC MMC35240")
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
According to datasheet, Page 8, minimum wait time to complete
measurement is 10ms. Adjusting this value will increase the
userspace polling rate.
Fixes: abeb6b1e7b ("iio: magnetometer: Add support for MEMSIC MMC35240")
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
The current computation for fractional part of the magnetic
field is broken. This patch fixes it by taking a different
approach. We expose the raw reading in milli Gauss (to avoid
rounding errors) with a scale of 0.001.
Thus the final computation is done in userspace where floating
point operation are more relaxed.
Fixes: abeb6b1e7b ("iio: magnetometer: Add support for MEMSIC MMC35240")
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
This avoid nasty crashes when registering the IIO device.
Fixes: abeb6b1e7b ("iio: magnetometer: Add support for MEMSIC MMC35240")
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
This is the standard convention for i2c device name and
also this is the name used in some Intel platforms DT
files.
Fixes: abeb6b1e7b ("iio: magnetometer: Add support for MEMSIC MMC35240")
Signed-off-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
The MAC address of the soft-interface is used to initialise
the "non-purge" TT entry of each existing VLAN. Therefore
when the user invokes ndo_set_mac_address() all the
"non-purge" TT entries have to be updated, not only the one
belonging to the non-tagged network.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The header files could not be build indepdent from each other. This is
happened because headers didn't include the files for things they've used.
This was problematic because the success of a build depended on the
knowledge about the right order of local includes.
Also source files were not including everything they've used explicitly.
Instead they required that transitive includes are always stable. This is
problematic because some transitive includes are not obvious, depend on
config settings and may not be stable in the future.
The order for include blocks are:
* primary headers (main.h and the *.h file of a *.c file)
* global linux headers
* required local headers
* extra forward declarations for pointers in function/struct declarations
The only exceptions are linux/bitops.h and linux/if_ether.h in packet.h.
This header file is shared with userspace applications like batctl and must
therefore build together with userspace applications. The header
linux/bitops.h is not part of the uapi headers and linux/if_ether.h
conflicts with the musl implementation of netinet/if_ether.h. The
maintainers rejected the use of __KERNEL__ preprocessor checks and thus
these two headers are only in main.h. All files using packet.h first have
to include main.h to work correctly.
Reported-by: Markus Pargmann <mpa@pengutronix.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
This API has to be used to let any routing protocol free
neighbor specific allocated resources
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Some mesh attributes are behind substructs in the
batadv_priv object and for this reason the name cannot be
used anymore to refer to them.
This patch allows to specify the variable name where the
attribute is stored inside batadv_priv instead of using the
name
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
An unoptimized version of the Jenkins one-at-a-time hash function is used
and partially copied all over the code wherever an hashtable is used.
Instead the optimized version shared between the whole kernel should be
used to reduce code duplication and use better optimized code.
Only the DAT code must use the old implementation because it is used as
distributed hash function which has to be common for all nodes.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
The block IO (blkio) controller enables the block layer to provide service
guarantees in a hierarchical fashion. Specifically, service guarantees
are provided by registered request-accounting policies. As of now, a
proportional-share and a throttling policy are available. They are
implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
subsystem. Unfortunately, as for adding new policies, the current
implementation of the block IO controller is only halfway ready to allow
new policies to be plugged in. This commit provides a solution to make
the block IO controller fully ready to handle new policies.
In what follows, we first describe briefly the current state, and then
list the changes made by this commit.
The throttling policy does not need any per-cgroup information to perform
its task. In contrast, the proportional share policy uses, for each cgroup,
both the weight assigned by the user to the cgroup, and a set of dynamically-
computed weights, one for each device.
The first, user-defined weight is stored in the blkcg data structure: the
block IO controller allocates a private blkcg data structure for each
cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
In other words, the block IO controller internally mirrors the blkio cgroups
with private blkcg data structures.
On the other hand, for each cgroup and device, the corresponding dynamically-
computed weight is maintained in the following, different way. For each device,
the block IO controller keeps a private blkcg_gq structure for each cgroup in
blkio. In other words, block IO also keeps one private mirror copy of the blkio
cgroups hierarchy for each device, made of blkcg_gq structures.
Each blkcg_gq structure keeps per-policy information in a generic array of
dynamically-allocated 'dedicated' data structures, one for each registered
policy (so currently the array contains two elements). To be inserted into the
generic array, each dedicated data structure embeds a generic blkg_policy_data
structure. Consider now the array contained in the blkcg_gq structure
corresponding to a given pair of cgroup and device: one of the elements
of the array contains the dedicated data structure for the proportional-share
policy, and this dedicated data structure contains the dynamically-computed
weight for that pair of cgroup and device.
The generic strategy adopted for storing per-policy data in blkcg_gq structures
is already capable of handling new policies, whereas the one adopted with blkcg
structures is not, because per-policy data are hard-coded in the blkcg
structures themselves (currently only data related to the proportional-
share policy).
This commit addresses the above issues through the following changes:
. It generalizes blkcg structures so that per-policy data are stored in the same
way as in blkcg_gq structures.
Specifically, it lets also the blkcg structure store per-policy data in a
generic array of dynamically-allocated dedicated data structures. We will
refer to these data structures as blkcg dedicated data structures, to
distinguish them from the dedicated data structures inserted in the generic
arrays kept by blkcg_gq structures.
To allow blkcg dedicated data structures to be inserted in the generic array
inside a blkcg structure, this commit also introduces a new blkcg_policy_data
structure, which is the equivalent of blkg_policy_data for blkcg dedicated
data structures.
. It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
cpd_size field and a cpd_init field, to be initialized by the policy with,
respectively, the size of the blkcg dedicated data structures, and the
address of a constructor function for blkcg dedicated data structures.
. It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
the fields related to the proportional-share policy), into a new blkcg
dedicated data structure called cfq_group_data.
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
PEBSv3 as present on Skylake fixed the long standing issue of the
status bits. They now really reflect the events that generated the
record.
Tested-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch modifies the perf tool to handle the new RECORD type,
PERF_RECORD_LOST_SAMPLES.
The number of lost-sample events is stored in
.nr_events[PERF_RECORD_LOST_SAMPLES]. The exact number of samples
which the kernel dropped is stored in total_lost_samples.
When the percentage of dropped samples is greater than 5%, a warning
is printed.
Here are some examples:
Eg 1, Recording different frequently-occurring events is safe with the
patch. Only a very low drop rate is associated with such actions.
$ perf record -e '{cycles:p,instructions:p}' -c 20003 --no-time ~/tchain ~/tchain
$ perf report -D | tail
SAMPLE events: 120243
MMAP2 events: 5
LOST_SAMPLES events: 24
FINISHED_ROUND events: 15
cycles:p stats:
TOTAL events: 59348
SAMPLE events: 59348
instructions:p stats:
TOTAL events: 60895
SAMPLE events: 60895
$ perf report --stdio --group
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 24
#
# Samples: 120K of event 'anon group { cycles:p, instructions:p }'
# Event count (approx.): 24048600000
#
# Overhead Command Shared Object Symbol
# ................ ........... ................
..................................
#
99.74% 99.86% tchain_edit tchain_edit [.] f3
0.09% 0.02% tchain_edit tchain_edit [.] f2
0.04% 0.00% tchain_edit [kernel.vmlinux] [k] ixgbe_read_reg
Eg 2, Recording the same thing multiple times can lead to high drop
rate, but it is not a useful configuration.
$ perf record -e '{cycles:p,cycles:p}' -c 20003 --no-time ~/tchain
Warning: Processed 600592 samples and lost 99.73% samples!
[perf record: Woken up 148 times to write data]
[perf record: Captured and wrote 36.922 MB perf.data (1206322 samples)]
[perf record: Woken up 1 times to write data]
[perf record: Captured and wrote 0.121 MB perf.data (1629 samples)]
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285195-14269-9-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After enlarging the PEBS interrupt threshold, there may be some mixed up
PEBS samples which are discarded by the kernel.
This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
the number of possible discarded records when it is impossible to demux
the samples.
It makes sure the user is not left in the dark about such discards.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the PEBS buffer size is 4k, it can only hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as the BTS buffer).
64k memory can hold about 330 PEBS records. This will significantly
reduce the number of PMIs when batched PEBS interrupts are enabled.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-7-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Flush the PEBS buffer during context switches if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.
For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.
For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.
So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.
The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.
One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.
Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
( The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test. )
period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1
With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>