Commit graph

79307 commits

Author SHA1 Message Date
Al Viro
8cd54c1c84 iov_iter: separate direction from flavour
Instead of having them mixed in iter->type, use separate ->iter_type
and ->data_source (u8 and bool resp.)  And don't bother with (pseudo-)
bitmap for the former - microoptimizations from being able to check
if the flavour is one of two values are not worth the confusion for
optimizer.  It can't prove that we never get e.g. ITER_IOVEC | ITER_PIPE,
so we end up with extra headache.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-10 11:45:08 -04:00
Al Viro
4b6c132b7d iov_iter: switch ..._full() variants of primitives to use of iov_iter_revert()
Use corresponding plain variants, revert on short copy.  That's the way it
should've been done from the very beginning, except that we didn't have
iov_iter_revert() back then...

[fixed another braino caught by Qian Cai <quic_qiancai@quicinc.com>]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-06-10 11:44:23 -04:00
Miquel Raynal
493db2b05d memory: pl353-smc: Let lower level controller drivers handle inits
There is no point in having all these definitions at the SMC bus level,
these are extremely tight to the NAND controller driver implementation,
are not particularly generic, imply more boilerplate than needed, do
not really follow the device model by receiving no argument and some of
them are actually buggy.

Let's get rid of these right now as there is no current user and keep
this driver at a simple level: only the SMC bare initializations.

The NAND controller driver which I am going to introduce will take care
of redefining properly all these helpers and using them directly.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20210610082040.2075611-13-miquel.raynal@bootlin.com
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
2021-06-10 17:15:16 +02:00
Muneendra Kumar
d2bcbeab42 scsi: blkcg: Add app identifier support for blkcg
Add a unique application identifier (i.e fc_app_id member) in blkcg. This
allows identification of traffic belonging to an specific both on the host
and in the fabric infrastructure. As an example, this allows the storage
stack to uniquely identify traffic belong to particular virtual machine.

Link: https://lore.kernel.org/r/20210608043556.274139-3-muneendra.kumar@broadcom.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-10 10:01:32 -04:00
Muneendra Kumar
6b658c4863 scsi: cgroup: Add cgroup_get_from_id()
Add a new function, cgroup_get_from_id(), to retrieve the cgroup associated
with a cgroup id. Also export the function cgroup_get_e_css() as this is
needed in blk-cgroup.h.

Link: https://lore.kernel.org/r/20210608043556.274139-2-muneendra.kumar@broadcom.com
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-10 10:01:31 -04:00
Erik Kaneda
60faa8f1ac ACPI: Add \_SB._OSC bit for PRM
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-10 15:07:55 +02:00
Erik Kaneda
cefc7ca462 ACPI: PRM: implement OperationRegion handler for the PlatformRtMechanism subtype
Platform Runtime Mechanism (PRM) is a firmware interface that exposes
a set of binary executables that can either be called from the AML
interpreter or device drivers by bypassing the AML interpreter.
This change implements the AML interpreter path.

According to the specification [1], PRM services are listed in an
ACPI table called the PRMT. This patch parses module and handler
information listed in the PRMT and registers the PlatformRtMechanism
OpRegion handler before ACPI tables are loaded.

Each service is defined by a 16-byte GUID and called from writing a
26-byte ASL buffer containing the identifier to a FieldUnit object
defined inside a PlatformRtMechanism OperationRegion.

    OperationRegion (PRMR, PlatformRtMechanism, 0, 26)
    Field (PRMR, BufferAcc, NoLock, Preserve)
    {
        PRMF, 208 // Write to this field to invoke the OperationRegion Handler
    }

The 26-byte ASL buffer is defined as the following:

Byte Offset   Byte Length    Description
=============================================================
     0             1         PRM OperationRegion handler status
     1             8         PRM service status
     9             1         PRM command
    10            16         PRM handler GUID

The ASL caller fills out a 26-byte buffer containing the PRM command
and the PRM handler GUID like so:

    /* Local0 is the PRM data buffer */
    Local0 = buffer (26){}

    /* Create byte fields over the buffer */
    CreateByteField (Local0, 0x9, CMD)
    CreateField (Local0, 0x50, 0x80, GUID)

    /* Fill in the command and data fields of the data buffer */
    CMD = 0 // run command
    GUID = ToUUID("xxxx-xx-xxx-xxxx")

    /*
     * Invoke PRM service with an ID that matches GUID and save the
     * result.
     */
    Local0 = (\_SB.PRMT.PRMF = Local0)

Byte offset 0 - 8 are written by the handler as a status passed back to AML
and used by ASL like so:

    /* Create byte fields over the buffer */
    CreateByteField (Local0, 0x0, PSTA)
    CreateQWordField (Local0, 0x1, USTA)

In this ASL code, PSTA contains a status from the OperationRegion and
USTA contains a status from the PRM service.

The 26-byte buffer is recieved by acpi_platformrt_space_handler. This
handler will look at the command value and the handler guid and take
the approperiate actions.

Command value    Action
=====================================================================
    0            Run the PRM service indicated by the PRM handler
                 GUID (bytes 10-26)

    1            Prevent PRM runtime updates from happening to the
                 service's parent module

    2            Allow PRM updates from happening to the service's parent module

This patch enables command value 0.

Link: https://uefi.org/sites/default/files/resources/Platform%20Runtime%20Mechanism%20-%20with%20legal%20notice.pdf # [1]
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-10 15:06:54 +02:00
Marc Zyngier
e1c054918c genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()
Despite the name, handle_domain_irq() deals with non-irqdomain
handling for the sake of a handful of legacy ARM platforms.

Move such handling into ARM's handle_IRQ(), allowing for better
code generation for everyone else. This allows us get rid of
some complexity, and to rearrange the guards on the various helpers
in a more logical way.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:19 +01:00
Marc Zyngier
8240ef50d4 genirq: Add generic_handle_domain_irq() helper
Provide generic_handle_domain_irq() as a pendent to handle_domain_irq()
for non-root interrupt controllers

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:19 +01:00
Marc Zyngier
9626d18a20 irqdesc: Fix __handle_domain_irq() comment
It appears that the comment about a NULL domain meaning anything
has always been wrong. Fix it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:18 +01:00
Marc Zyngier
a3016b26ee genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co
In order to start reaping the benefits of irq_resolve_mapping(),
start using it in __handle_domain_irq() and handle_domain_nmi().

This involves splitting generic_handle_irq() to be able to directly
provide the irq_desc.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:18 +01:00
Marc Zyngier
d22558dd0a irqdomain: Introduce irq_resolve_mapping()
Rework irq_find_mapping() to return an both an irq_desc pointer,
optionally the virtual irq number, and rename the result to
__irq_resolve_mapping(). a new helper called irq_resolve_mapping()
is provided for code that doesn't need the virtual irq number.

irq_find_mapping() is also rewritten in terms of __irq_resolve_mapping().

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:18 +01:00
Marc Zyngier
d4a45c68dc irqdomain: Protect the linear revmap with RCU
It is pretty odd that the radix tree uses RCU while the linear
portion doesn't, leading to potential surprises for the users,
depending on how the irqdomain has been created.

Fix this by moving the update of the linear revmap under
the mutex, and the lookup under the RCU read-side lock.

The mutex name is updated to reflect that it doesn't only
cover the radix-tree anymore.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:18 +01:00
Marc Zyngier
48b15a7921 irqdomain: Cache irq_data instead of a virq number in the revmap
Caching a virq number in the revmap is pretty inefficient, as
it means we will need to convert it back to either an irq_data
or irq_desc to do anything with it.

It is also a bit odd, as the radix tree does cache irq_data
pointers.

Change the revmap type to be an irq_data pointer instead of
an unsigned int, and preserve the current API for now.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:18 +01:00
Marc Zyngier
4f86a06e2d irqdomain: Make normal and nomap irqdomains exclusive
Direct mappings are completely exclusive of normal mappings, meaning
that we can refactor the code slightly so that we can get rid of
the revmap_direct_max_irq field and use the revmap_size field
instead, reducing the size of the irqdomain structure.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:17 +01:00
Marc Zyngier
e37af8011a powerpc: Move the use of irq_domain_add_nomap() behind a config option
Only a handful of old PPC systems are still using the old 'nomap'
variant of the irqdomain library. Move the associated definitions
behind a configuration option, which will allow us to make some
more radical changes.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:17 +01:00
Marc Zyngier
1da027362a irqdomain: Reimplement irq_linear_revmap() with irq_find_mapping()
irq_linear_revmap() is supposed to be a fast path for domain
lookups, but it only exposes low-level details of the irqdomain
implementation, details which are better kept private.

The *overhead* between the two is only a function call and
a couple of tests, so it is likely that noone can show any
meaningful difference compared to the cost of taking an
interrupt.

Reimplement irq_linear_revmap() with irq_find_mapping()
in order to preserve source code compatibility, and
rename the internal field for a measure.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:17 +01:00
Marc Zyngier
405e94e9ae irqdomain: Kill irq_domain_add_legacy_isa
This helper doesn't have a user anymore, let's remove it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-06-10 13:09:17 +01:00
Parav Pandit
9739ba327c iommu/vt-d: Define counter explicitly as unsigned int
Avoid below checkpatch warning.

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
+       unsigned        iommu_refcnt[DMAR_UNITS_SUPPORTED];

Fixes: 29a27719ab ("iommu/vt-d: Replace iommu_bmp with a refcount")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210530075053.264218-1-parav@nvidia.com
Link: https://lore.kernel.org/r/20210610020115.1637656-23-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:14 +02:00
Parav Pandit
74f6d776ae iommu/vt-d: Removed unused iommu_count in dmar domain
DMAR domain uses per DMAR refcount. It is indexed by iommu seq_id.
Older iommu_count is only incremented and decremented but no decisions
are taken based on this refcount. This is not of much use.

Hence, remove iommu_count and further simplify domain_detach_iommu()
by returning void.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210530075053.264218-1-parav@nvidia.com
Link: https://lore.kernel.org/r/20210610020115.1637656-21-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:14 +02:00
Parav Pandit
1f106ff0ea iommu/vt-d: Use bitfields for DMAR capabilities
IOTLB device presence, iommu coherency and snooping are boolean
capabilities. Use them as bits and keep them adjacent.

Structure layout before the reorg.
$ pahole -C dmar_domain drivers/iommu/intel/dmar.o
struct dmar_domain {
        int                        nid;                  /*     0     4 */
        unsigned int               iommu_refcnt[128];    /*     4   512 */
        /* --- cacheline 8 boundary (512 bytes) was 4 bytes ago --- */
        u16                        iommu_did[128];       /*   516   256 */
        /* --- cacheline 12 boundary (768 bytes) was 4 bytes ago --- */
        bool                       has_iotlb_device;     /*   772     1 */

        /* XXX 3 bytes hole, try to pack */

        struct list_head           devices;              /*   776    16 */
        struct list_head           subdevices;           /*   792    16 */
        struct iova_domain         iovad __attribute__((__aligned__(8)));
							 /*   808  2320 */
        /* --- cacheline 48 boundary (3072 bytes) was 56 bytes ago --- */
        struct dma_pte *           pgd;                  /*  3128     8 */
        /* --- cacheline 49 boundary (3136 bytes) --- */
        int                        gaw;                  /*  3136     4 */
        int                        agaw;                 /*  3140     4 */
        int                        flags;                /*  3144     4 */
        int                        iommu_coherency;      /*  3148     4 */
        int                        iommu_snooping;       /*  3152     4 */
        int                        iommu_count;          /*  3156     4 */
        int                        iommu_superpage;      /*  3160     4 */

        /* XXX 4 bytes hole, try to pack */

        u64                        max_addr;             /*  3168     8 */
        u32                        default_pasid;        /*  3176     4 */

        /* XXX 4 bytes hole, try to pack */

        struct iommu_domain        domain;               /*  3184    72 */

        /* size: 3256, cachelines: 51, members: 18 */
        /* sum members: 3245, holes: 3, sum holes: 11 */
        /* forced alignments: 1 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

After arranging it for natural padding and to make flags as u8 bits, it
saves 8 bytes for the struct.

struct dmar_domain {
        int                        nid;                  /*     0     4 */
        unsigned int               iommu_refcnt[128];    /*     4   512 */
        /* --- cacheline 8 boundary (512 bytes) was 4 bytes ago --- */
        u16                        iommu_did[128];       /*   516   256 */
        /* --- cacheline 12 boundary (768 bytes) was 4 bytes ago --- */
        u8                         has_iotlb_device:1;   /*   772: 0  1 */
        u8                         iommu_coherency:1;    /*   772: 1  1 */
        u8                         iommu_snooping:1;     /*   772: 2  1 */

        /* XXX 5 bits hole, try to pack */
        /* XXX 3 bytes hole, try to pack */

        struct list_head           devices;              /*   776    16 */
        struct list_head           subdevices;           /*   792    16 */
        struct iova_domain         iovad __attribute__((__aligned__(8)));
							 /*   808  2320 */
        /* --- cacheline 48 boundary (3072 bytes) was 56 bytes ago --- */
        struct dma_pte *           pgd;                  /*  3128     8 */
        /* --- cacheline 49 boundary (3136 bytes) --- */
        int                        gaw;                  /*  3136     4 */
        int                        agaw;                 /*  3140     4 */
        int                        flags;                /*  3144     4 */
        int                        iommu_count;          /*  3148     4 */
        int                        iommu_superpage;      /*  3152     4 */

        /* XXX 4 bytes hole, try to pack */

        u64                        max_addr;             /*  3160     8 */
        u32                        default_pasid;        /*  3168     4 */

        /* XXX 4 bytes hole, try to pack */

        struct iommu_domain        domain;               /*  3176    72 */

        /* size: 3248, cachelines: 51, members: 18 */
        /* sum members: 3236, holes: 3, sum holes: 11 */
        /* sum bitfield members: 3 bits, bit holes: 1, sum bit holes: 5 bits */
        /* forced alignments: 1 */
        /* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));

Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210530075053.264218-1-parav@nvidia.com
Link: https://lore.kernel.org/r/20210610020115.1637656-20-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:13 +02:00
Lu Baolu
55ee5e67a5 iommu/vt-d: Add common code for dmar latency performance monitors
The execution time of some operations is very performance critical, such
as cache invalidation and PRQ processing time. This adds some common code
to monitor the execution time range of those operations. The interfaces
include enabling/disabling, checking status, updating sampling data and
providing a common string format for users.

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210520031531.712333-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20210610020115.1637656-14-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:13 +02:00
Lu Baolu
e93a67f5a0 iommu/vt-d: Add prq_report trace event
This adds a new trace event to track the page fault request report.
This event will provide almost all information defined in a page
request descriptor.

A sample output:
| prq_report: dmar0/0000:00:0a.0 seq# 1: rid=0x50 addr=0x559ef6f97 r---- pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 2: rid=0x50 addr=0x559ef6f9c rw--l pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 3: rid=0x50 addr=0x559ef6f98 r---- pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 4: rid=0x50 addr=0x559ef6f9d rw--l pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 5: rid=0x50 addr=0x559ef6f99 r---- pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 6: rid=0x50 addr=0x559ef6f9e rw--l pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 7: rid=0x50 addr=0x559ef6f9a r---- pasid=0x2 index=0x1
| prq_report: dmar0/0000:00:0a.0 seq# 8: rid=0x50 addr=0x559ef6f9f rw--l pasid=0x2 index=0x1

This will be helpful for I/O page fault related debugging.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210520031531.712333-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20210610020115.1637656-13-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:13 +02:00
Lu Baolu
4c82b88696 iommu/vt-d: Allocate/register iopf queue for sva devices
This allocates and registers the iopf queue infrastructure for devices
which want to support IO page fault for SVA.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210520031531.712333-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20210610020115.1637656-11-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:13 +02:00
Lu Baolu
4048377414 iommu/vt-d: Use iommu_sva_alloc(free)_pasid() helpers
Align the pasid alloc/free code with the generic helpers defined in the
iommu core. This also refactored the SVA binding code to improve the
readability.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20210520031531.712333-1-baolu.lu@linux.intel.com
Link: https://lore.kernel.org/r/20210610020115.1637656-8-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-06-10 09:06:12 +02:00
Vlad Buslov
19e9bfa044 net/mlx5: Bridge, add offload infrastructure
Create new files bridge.{c|h} in en/rep directory that implement bridge
interaction with representor netdevices and handle required
events/notifications, bridge.{c|h} in esw directory that implement all
necessary eswitch offloading infrastructure and works on vport/eswitch
level. Provide new kconfig MLX5_BRIDGE which is automatically selected when
both kernel bridge and mlx5 eswitch configs are enabled.

Provide basic infrastructure for bridge offloads:

- struct mlx5_esw_bridge_offloads - per-eswitch bridge offload structure
that encapsulates generic bridge-offloads data (notifier blocks, ingress
flow table/group, etc.) that is created/deleted on enable/disable eswitch
offloads.

- struct mlx5_esw_bridge - per-bridge structure that encapsulates
per-bridge data (reference counter, FDB, egress flow table/group, etc.)
that is created when first eswitch represetor is attached to new bridge and
deleted when last representor is removed from the bridge as a result of
NETDEV_CHANGEUPPER event.

The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB
namespace. The new priority is between tc-miss and slow path priorities.
Priority consist of two levels: the ingress table that is global per
eswitch and matches incoming packets by src_mac/vid and redirects them to
next level (egress table) that is chosen according to ingress port bridge
membership and matches on dst_mac/vid in order to redirect packet to vport
according to the following diagram:

                +
                |
      +---------v----------+
      |                    |
      |   FDB_TC_OFFLOAD   |
      |                    |
      +---------+----------+
                |
                |
      +---------v----------+
      |                    |
      |   FDB_FT_OFFLOAD   |
      |                    |
      +---------+----------+
                |
                |
      +---------v----------+
      |                    |
      |    FDB_TC_MISS     |
      |                    |
      +---------+----------+
                |
+--------------------------------------+
|               |                      |
|        +------+                      |
|        |                             |
| +------v--------+   FDB_BR_OFFLOAD   |
| | INGRESS_TABLE |                    |
| +------+---+----+                    |
|        |   |      match              |
|        |   +---------+               |
|        |             |               |    +-------+
|        |     +-------v-------+ match |    |       |
|        |     | EGRESS_TABLE  +------------> vport |
|        |     +-------+-------+       |    |       |
|        |             |               |    +-------+
|        |    miss     |               |
|        +------+------+               |
|               |                      |
+--------------------------------------+
                |
                |
      +---------v----------+
      |                    |
      |   FDB_SLOW_PATH    |
      |                    |
      +---------+----------+
                |
                v

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:09 -07:00
Vlad Buslov
ec3be8873d net/mlx5: Create TC-miss priority and table
In order to adhere to kernel software datapath model bridge offloads must
come after TC and NF FDBs. Following patches in this series add new FDB
priority for bridge after FDB_FT_OFFLOAD. However, since netfilter offload
is implemented with unmanaged tables, its miss path is not automatically
connected to next priority and requires the code to manually connect with
slow table. To keep bridge offloads encapsulated and not mix it with
eswitch offloads, create a new FDB_TC_MISS priority between FDB_FT_OFFLOAD
and FDB_SLOW_PATH:

          +
          |
+---------v----------+
|                    |
|   FDB_TC_OFFLOAD   |
|                    |
+---------+----------+
          |
          |
          |
+---------v----------+
|                    |
|   FDB_FT_OFFLOAD   |
|                    |
+---------+----------+
          |
          |
          |
+---------v----------+
|                    |
|    FDB_TC_MISS     |
|                    |
+---------+----------+
          |
          |
          |
+---------v----------+
|                    |
|   FDB_SLOW_PATH    |
|                    |
+---------+----------+
          |
          v

Initialize the new priority with single default empty managed table and use
the table as TC/NF miss patch instead of slow table. This approach allows
bridge offloads to be created as new FDB namespace priority between
FDB_TC_MISS and FDB_SLOW_PATH without exposing its internal tables to any
other modules since miss path of managed TC-miss table is automatically
wired to next priority.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:08 -07:00
Yevgeny Kliteynik
3f3f05ab88 net/mlx5: Added new parameters to reformat context
Adding new reformat context type (INSERT_HEADER) requires adding two new
parameters to reformat context - reformat_param_0 and reformat_param_1.
As defined by HW spec, these parameters have different meaning for
different reformat context type.

The first parameter (reformat_param_0) is not new to HW spec, but it
wasn't used by any of the supported reformats. The second parameter
(reformat_param_1) is new to the HW spec - it was added to allow
supporting INSERT_HEADER.

For NSERT_HEADER, reformat_param_0 indicates the header used to
reference the location of the inserted header, and reformat_param_1
indicates the offset of the inserted header from the reference point
defined by reformat_param_0.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:07 -07:00
Yevgeny Kliteynik
67133eaa93 net/mlx5: mlx5_ifc support for header insert/remove
Add support for HCA caps 2 that contains capabilities for the new
insert/remove header actions.

Added the required definitions for supporting the new reformat type:
added packet reformat parameters, reformat anchors and definitions
to allow copy/set into the inserted EMD (Embedded MetaData) tag.

Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 18:36:06 -07:00
Dima Chumak
a3e5fd9314 net/mlx5e: Fix page reclaim for dead peer hairpin
When adding a hairpin flow, a firmware-side send queue is created for
the peer net device, which claims some host memory pages for its
internal ring buffer. If the peer net device is removed/unbound before
the hairpin flow is deleted, then the send queue is not destroyed which
leads to a stack trace on pci device remove:

[ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource
[ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110
[ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0
[ 748.002171] ------------[ cut here ]------------
[ 748.001177] FW pages counter is 4 after reclaiming all pages
[ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]                      [  +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core]
[ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1
[ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core]
[ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9
[ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286
[ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000
[ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51
[ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8
[ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30
[ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000
[ 748.001429] FS:  00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000
[ 748.001695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0
[ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 748.001654] Call Trace:
[ 748.000576]  ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core]
[ 748.001416]  ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core]
[ 748.001354]  ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core]
[ 748.001203]  mlx5_function_teardown+0x30/0x60 [mlx5_core]
[ 748.001275]  mlx5_uninit_one+0xa7/0xc0 [mlx5_core]
[ 748.001200]  remove_one+0x5f/0xc0 [mlx5_core]
[ 748.001075]  pci_device_remove+0x9f/0x1d0
[ 748.000833]  device_release_driver_internal+0x1e0/0x490
[ 748.001207]  unbind_store+0x19f/0x200
[ 748.000942]  ? sysfs_file_ops+0x170/0x170
[ 748.001000]  kernfs_fop_write_iter+0x2bc/0x450
[ 748.000970]  new_sync_write+0x373/0x610
[ 748.001124]  ? new_sync_read+0x600/0x600
[ 748.001057]  ? lock_acquire+0x4d6/0x700
[ 748.000908]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[ 748.001126]  ? fd_install+0x1c9/0x4d0
[ 748.000951]  vfs_write+0x4d0/0x800
[ 748.000804]  ksys_write+0xf9/0x1d0
[ 748.000868]  ? __x64_sys_read+0xb0/0xb0
[ 748.000811]  ? filp_open+0x50/0x50
[ 748.000919]  ? syscall_enter_from_user_mode+0x1d/0x50
[ 748.001223]  do_syscall_64+0x3f/0x80
[ 748.000892]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 748.001026] RIP: 0033:0x7f58bcfb22f7
[ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7
[ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003
[ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001
[ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d
[ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700
[ 748.001564] irq event stamp: 0
[ 748.000787] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0
[ 748.001854] softirqs last  enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0
[ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 748.001492] ---[ end trace a6fabd773d1c51ae ]---

Fix by destroying the send queue of a hairpin peer net device that is
being removed/unbound, which returns the allocated ring buffer pages to
the host.

Fixes: 4d8fcf216c ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules")
Signed-off-by: Dima Chumak <dchumak@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09 17:20:03 -07:00
David S. Miller
7f3579e189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add nfgenmsg field to nfnetlink's struct nfnl_info and use it.

2) Remove nft_ctx_init_from_elemattr() and nft_ctx_init_from_setattr()
   helper functions.

3) Add the nf_ct_pernet() helper function to fetch the conntrack
   pernetns data area.

4) Expose TCP and UDP flowtable offload timeouts through sysctl,
   from Oz Shlomo.

5) Add nfnetlink_hook subsystem to fetch the netfilter hook
   pipeline configuration, from Florian Westphal. This also includes
   a new field to annotate the hook type as metadata.

6) Fix unsafe memory access to non-linear skbuff in the new SCTP
   chunk support for nft_exthdr, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-09 14:50:35 -07:00
Linus Torvalds
a4c30b8691 A trivial update to the compiler attributes:
- Add continue in comment (from Wei Ming Chen)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmDAxXcACgkQGXyLc2ht
 IW2/PQ//aNoXZVRmuG+bFVkEdm/jbQycjEgqw/jbFzZ/ObA5ujunRej9zaeb2T2A
 aQNXjXw/j0WWYUp+hxQkuQv0Fqaf5nHY7clht/KoOgeJf8dccu30dFZM/f7vvawB
 wBJe53t+ItpUpp/RcRPcqSyM4RfVr+8E1TYI6wTYqB8p9J3pfkT8Sh8X3+cmXUf4
 YXTezJ/rpMYu8DhvQI0CGYPPggvpwgbaP1O16aeAVpCka7OxnQQa3zbvSNqMOwzb
 Hlz2NnJh1h235Q/HJYDpHH8cIzjnwATmODet9U8F5Nxx6PqCkj/6zJy2n1KEx6mh
 nZv996oki3TVZvsFCoIBPqBWBD2xdd5gDXdSEssVxi0I0Z2WKUhjoFW5h3zKu2UG
 V/akCpYEhoJltixlWY0jelAbpSzZKMblwQdHzbmUWEISOfkAZ9fvutvujBS00KPk
 jxzggO5ZiwY5Kahkrj8wFlJ79wrnHsRkTWUbRg2LoXU6tu1b7o+k8dIPZjYxRR4G
 S1uL98dsfctaoz8R33++Ntg7C1PKS1seN/AGrYE951hzI2tlIV22rqPQpgpwgWbW
 ss7siPEmQ/yN7Fm9pNR2F77NBdOcgXJ+10CnvutbV2SGyPnnDvy6ImCcyR0e2z/h
 zaJ2h3V1BYpE34aPhmyLQwjEdaXB0czMEzlfVHIErstjT/oFJCk=
 =8kWi
 -----END PGP SIGNATURE-----

Merge tag 'compiler-attributes-for-linus-v5.13-rc6' of git://github.com/ojeda/linux

Pull compiler attribute update from Miguel Ojeda:
 "A trivial update to the compiler attributes: Add 'continue' keyword to
  documentation in comment (from Wei Ming Chen)"

* tag 'compiler-attributes-for-linus-v5.13-rc6' of git://github.com/ojeda/linux:
  Compiler Attributes: Add continue in comment
2021-06-09 14:48:29 -07:00
Linus Torvalds
2f673816b2 Bugfixes, including a TLB flush fix that affects processors
without nested page tables.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDAVpQUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkOgf9F97eFxAdod3/wbW9EbsUPR5bMTLE
 +R6Hmvw+yCm/W2cycVGdCSh1BEKNuZN/XfHln2cYVfVr6ndog58A4Y0urFAhTROv
 IHs8TCA5biQitoZ716l88ExOitnqJiSmMhGex969+zm1Lb9MQo1KA/zxERlqCi3s
 Pfcxb6I8VbD9LEb6NaQdDgQoslJo1tzhe9gGYAYrpMOZujpj1RPeIOZIfeII0MP/
 g14/JSar8cXc9QJ6zbiKn8HhpmzGJnaIsyFFL2RMIBlKvxsnpOU6VmisLTL9407o
 P246Vq59BM8pdRCVUW9W9hLr2ho8lmi+ZYXASCm+qfn8cLaHyRCqSK56ZQ==
 =nW43
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Bugfixes, including a TLB flush fix that affects processors without
  nested page tables"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: fix previous commit for 32-bit builds
  kvm: avoid speculation-based attacks from out-of-range memslot accesses
  KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync
  KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message
  selftests: kvm: Add support for customized slot0 memory size
  KVM: selftests: introduce P47V64 for s390x
  KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior
  KVM: X86: MMU: Use the correct inherited permissions to get shadow page
  KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer
  KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca821c
2021-06-09 13:09:57 -07:00
Ricky Wu
3df4fce739 misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG
aspm (Active State Power Management)
rtsx_comm_set_aspm: this function is for driver to make sure
not enter power saving when processing of init and card_detcct
ASPM_MODE_CFG: 8411 5209 5227 5229 5249 5250
Change back to use original way to control aspm
ASPM_MODE_REG: 5227A 524A 5250A 5260 5261 5228
Keep the new way to control aspm

Fixes: 121e9c6b5c ("misc: rtsx: modify and fix init_hw function")
Reported-by: Chris Chiu <chris.chiu@canonical.com>
Tested-by: Gordon Lack <gordon.lack@dsl.pipex.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Ricky Wu <ricky_wu@realtek.com>
Link: https://lore.kernel.org/r/20210607101634.4948-1-ricky_wu@realtek.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-09 19:10:22 +02:00
Tom Rix
895ec9c09a fpga-mgr: change FPGA indirect article to an
Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210608212350.3029742-9-trix@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-09 14:51:25 +02:00
Tom Rix
e7555cf6c2 fpga: bridge: change FPGA indirect article to an
Change use of 'a fpga' to 'an fpga'

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210608212350.3029742-8-trix@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-09 14:51:25 +02:00
Shaokun Zhang
a69008475f vt: vt_kern.h, remove the repeated declaration
Function 'vt_set_led_state' is declared twice, so remove the
repeated declaration.

Cc: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Link: https://lore.kernel.org/r/1623062933-52943-1-git-send-email-zhangshaokun@hisilicon.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-09 14:44:31 +02:00
Thierry Reding
48a74b1147 reset: Add compile-test stubs
Add stubs for the reset controller registration functions to allow
building reset controller provider drivers with the COMPILE_TEST
Kconfig option enabled.

Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Suggested-by: Dmitry Osipenko <digetx@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Link: https://lore.kernel.org/r/20210609112806.3565057-3-thierry.reding@gmail.com
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
2021-06-09 13:37:16 +02:00
Greg Kroah-Hartman
4ccf359849
spi: remove spi_set_cs_timing()
No one seems to be using this global and exported function, so remove it
as it is no longer needed.

Cc: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20210609071918.2852069-1-gregkh@linuxfoundation.org
Signed-off-by: Mark Brown <broonie@kernel.org>
2021-06-09 11:55:46 +01:00
Greg Kroah-Hartman
6771fb0b94 1st set of new IIO/counter device support, features and cleanup for 5.14
There are a couple of large cleanup sets in here alongside a number of new
 drivers.
 
 Note an immutable branch merged to add a stub for i2c_verify_client()
 as needed to avoid a build issue in the fxls8962af driver as a result of a
 workaround for a device errata that only applies when i2c interface is used.
 
 Counters
 ========
 
 New device support
 * intel,quadrature encoder peripheral found on Elkhart Lake platforms.
   - New driver.
 
 IIO
 ===
 
 New device support
 * amstaos,tsl2591 ambient light sensor.
   - New driver + bindings
   - Follow up fix to ensure some local variables are suitable for error
     handling.
 * fsl,fxls8962af + fsl,fxls8964af accelerometers
   - New driver + bindings
   - Includes an errata work around that cause a build bot failure fixed
     by adding a stub to i2c.
 * kionix,kxcjk-1013
   - Add support for KX023-1025 device.  Mostly a different register map
     that needed to be supported.
 * murata,sca3300 accelerometer
   - New driver + bindings
 * st,lsm9ds0 IMU
   - Rework of st,sensors driver to cleanly support this new glue driver
     that enables the two parts of the LSM9DS0.
 * ti,tsc2046 touchscreen controller ADC.
   - New driver. Intent here is to use this with existing IIO consumer
     drivers such as resistive-adc-touch.
   - Follow up fix to avoid an issue with unsigned subtraction in error
     check.
 * ti,tmp117 digital temperature sensor
   - New driver + bindings
 
 Features
 * adi,ad5755
   - Add missing dt-binding doc
 * adi,ad7298
   - Add ACPI ID used on Intel Galileo Gen 1 boards.
   - Add missing dt-binding doc
 * adi,ad7476
   - Add missing dt-binding doc
 * adi,ad7746
   - Add missing dt-binding doc for this driver that will hopefully move out
     of staging shortly. Update staging driver to use the binding instead of
     platform data.
 * adi,adis16201 + adis16209
   - Add missing dt-binding doc
 * adi,adis16480
   - Support burst mode for adis16495 and adis16497 parts.
 * bosch,bma220
   - Add missing dt-binding doc
 * fsl,mma7455
   - Add missing dt-binding doc
 * iio-rescale
   - Support handling of processed channels from provider.  Some ADCs
     require (typically non linear) calibration functions to be applied,
     and so provide only IIO_CHAN_INFO_PROCESSED read back. They can be
     used as providers to the iio-rescale driver which needs to handle them
     somewhat differently from IIO_CHAN_INFO_RAW
 * sensiron,sps30
   - Support the serial interface.  Note this required significant
     refactoring of existing driver.
 * st,st-sensors
   - Add mount matrix support for normal dt-binding whilst continuing to
     support the odd ACPI approach for accelerometers.
 * ti,dac082s085 + similar
   - Add missing dt-binding doc
 * trivial-devices - add entries for
   - memsic,mx4005, memsic,mx6255 and memsic,mxc6655
   - sensortek,stk8312 and sensortek,stk8ba50
 
 Cleanup / minor fixes
 * core
   - Use devm_add_action_or_reset() to replace boilerplate in several
     driver and core IIO devm_* functions.
   - Fix an error path issue introduced by above, that could return an
     error pointer rather than the expected null from dev_iio_device_alloc()
   - Move more IIO internals related fields from struct iio_dev to
     struct iio_dev_opaque.
   - Drop unused final update of in_loc in demux setup.
 * Docs
   - Move some docs from driver specific to core dos to avoid replication
     of names which the documentation builder does not allow.
     Note this means adding a few device specific notes to the general docs
     to cover the more unusual uses of the ABI.
   - ABI: Move old buffer/* and scan_elements/* docs to obsolete as now we
     have the bufferX/* variant.  Not we are not getting rid of these
     interfaces, just encouraging new code to use the new interface.
 * IIO wide:
   - Tidy up new cases of dev.parent etc being set in drivers as the core
     now does it.
   - Fix more cases of possible miss-aligned buffers when passed to
     iio_push_to_buffers_with_timestamp().  Note we only have one known
     instance of anyone seeing this bug actually happening, so this has been
     a low priority cleanup effort for several cycles.
   - sysfs_emit() used in more drivers.
   - Runtime pm tidy up and use of pm_runtime_resume_and_get()
   - Buffer alignment fixes as iio_push_to_buffers_with_timestamp requires
     that the timestamp when inserted by naturally aligned + consumers can
     assume that all fields are naturally aligned. Part of a long running
     effort, with at least 2 more series to come tackling additional
     variants.
   - Stop specifying "mount-matrix" property name in every lookup of the
     mount matrix from firmware by hard coding it in the core.
 * adi,ad7476
   - Handle the variety of different regulators used by the parts supported
     by this driver (came up in dt-binding review)
 * adi,ad7746
   - Trivial drop of if (ret) return ret; return 0; pattern
   - Tidy up comments
   - Pull capdac setup out to own function.
 * adi,ad7766
   - Trivial drop of if (ret) return ret; return 0; pattern
 * adi,adis
   - Avoid returning error codes in interrupt handlers.
   - Handle a failure in spi_write in the trigger handler.
   - Rework to add updating of device page after changing it.
   - Don't push data to IIO buffers when read failed.
   - Add configuration of burst max speed to core avoid handling this in
     each driver.
   - Use the adis_dev_lock() helper in adis16260 and adis16136 drivers.
   - Excessive includes cleanup via include-what-you-use static checker
     after zero day highlighted that these needed updating.
 * afe
   - Amend binding to add #io-channel-cells, thus allowing this IIO
     consumer to also be an IIO provider.
 * aosong,am2315
   - Drop ACPI id. Unlikely this one is in the wild and it isn't valid
     ACPI naming.
 * bosch,bma180
   - Adding missing bandwidth settings (500, 1000 Hz)
 * bosch,bme680
   - Drop ACPI id. Unlikely this one is in the wild and it isn't valid
     ACPI naming.
 * ep93xx_adc,
   - Drop a redundant error print.
 * maxim,max118
   - Convert remainder of probe() to devm_ managed functions.
   - Avoid some repeated jumping back and forth between iio_dev and
     spi structures.
 * maxim,max11100
   - Use get_unaligned_be16() instead of open coding.
   - Convert remainder of probe() to devm_ managed functions.
 * samsung,exynos_adc
   - Unused error value dropped.
 * sensiron,sgp30
   - Drop use of %hx in favor of %x and letting the normal type conversion
     work.
 * sensortek,stk8312
   - Add lowercase device id and note uppercase version deprecated.
   - Drop ACPI id. Unlikely this one is in the wild and it isn't valid
     ACPI naming.
 * sprx,sc72xx_adc
   - add MODULE_DEVICE_TABLE
 * st,lsm6dsx
   - Fix docs of valid ODRs
 * st,sensors
   - dt-binding rework.  Two efforts on this crossed in a previous cycle
     so this update cherry picks the best of the two yaml conversions.
   - Don't copy the channel spec array as now ext_info is no longer modified.
 * st,stm32-adc
   - tidy up some docs that were marked as kernel-doc but aren't.
 * ti,adc081c, ti,adc0832, ti,adc108s102 and ti,adc161s626
   - Convert remainder of probe() functions to devm_ managed functions
     to simplify error handing and remove paths.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEEbilms4eEBlKRJoGxVIU0mcT0FogFAmC/mCoRHGppYzIzQGtl
 cm5lbC5vcmcACgkQVIU0mcT0FoidAQ//SqpbBeEy8HATSHccooHwHI3eK+hnj0n9
 9zr6/7o52EQ0lFN6V7OLp0XNL3DNIV8oYAyvzYZ4Qh2NXLYQHDnqiiUyLxCfctqu
 Ii+9NmVmuk/PlPRRubQRZE+Czdtwgsp7dRQOYJiuxUeKVD/EUVjl1FmpsiPtGeaa
 iU6JaYtdF3ie0b1zQCwQTYYsM8lZ2/ovKW8F29K5ALnrDd9h6Ad0p5QDvyDxyajp
 VyLRJa7nwAfK5rP6efuNsDfzbMycTPtHkcC+Pgec/2RrXL5mDz4EXHI1nOUZAGdb
 UaN/uDpytAxJZk6Fs2f+RdgevlQgpBxAXGDHE2YHkcZi7X0ppWOjeIZFSDbDiaHO
 XlSQgOelUqKtHhRZ3MYHxbSOgO3Vif6ecCDMNCN78E0YE3kQHHSwY0JMGgUeHIGG
 hQPKGaD1AKzh7AsbPbazYW6VX4dDDWcr8pQ8D9wWLUKikcZLKqRH5uAwvjZ+NjuC
 Bfnjx/QhmIhbs0gFaw4Q5mvYQ3Zmfh7nzcW98jwcbR6pOqKvIeqzw9OARRHaimrd
 /GRCiccxKtU8J7f5l+MSzYQt4hT0Ef1vuq9Ak5SDCr3Fwnix5ipFVLkipWvgJ7JD
 OqubcwwW5EfgZPY/X7nsK/U6v8SlqF4XrvCVky4MUt0x1YXxc/tjYak8oLEqpMVC
 gQP3KUZIYeA=
 =Zved
 -----END PGP SIGNATURE-----

Merge tag 'iio-for-5.14a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio into staging-next

Jonathan writes:

1st set of new IIO/counter device support, features and cleanup for 5.14

There are a couple of large cleanup sets in here alongside a number of new
drivers.

Note an immutable branch merged to add a stub for i2c_verify_client()
as needed to avoid a build issue in the fxls8962af driver as a result of a
workaround for a device errata that only applies when i2c interface is used.

Counters
========

New device support
* intel,quadrature encoder peripheral found on Elkhart Lake platforms.
  - New driver.

IIO
===

New device support
* amstaos,tsl2591 ambient light sensor.
  - New driver + bindings
  - Follow up fix to ensure some local variables are suitable for error
    handling.
* fsl,fxls8962af + fsl,fxls8964af accelerometers
  - New driver + bindings
  - Includes an errata work around that cause a build bot failure fixed
    by adding a stub to i2c.
* kionix,kxcjk-1013
  - Add support for KX023-1025 device.  Mostly a different register map
    that needed to be supported.
* murata,sca3300 accelerometer
  - New driver + bindings
* st,lsm9ds0 IMU
  - Rework of st,sensors driver to cleanly support this new glue driver
    that enables the two parts of the LSM9DS0.
* ti,tsc2046 touchscreen controller ADC.
  - New driver. Intent here is to use this with existing IIO consumer
    drivers such as resistive-adc-touch.
  - Follow up fix to avoid an issue with unsigned subtraction in error
    check.
* ti,tmp117 digital temperature sensor
  - New driver + bindings

Features
* adi,ad5755
  - Add missing dt-binding doc
* adi,ad7298
  - Add ACPI ID used on Intel Galileo Gen 1 boards.
  - Add missing dt-binding doc
* adi,ad7476
  - Add missing dt-binding doc
* adi,ad7746
  - Add missing dt-binding doc for this driver that will hopefully move out
    of staging shortly. Update staging driver to use the binding instead of
    platform data.
* adi,adis16201 + adis16209
  - Add missing dt-binding doc
* adi,adis16480
  - Support burst mode for adis16495 and adis16497 parts.
* bosch,bma220
  - Add missing dt-binding doc
* fsl,mma7455
  - Add missing dt-binding doc
* iio-rescale
  - Support handling of processed channels from provider.  Some ADCs
    require (typically non linear) calibration functions to be applied,
    and so provide only IIO_CHAN_INFO_PROCESSED read back. They can be
    used as providers to the iio-rescale driver which needs to handle them
    somewhat differently from IIO_CHAN_INFO_RAW
* sensiron,sps30
  - Support the serial interface.  Note this required significant
    refactoring of existing driver.
* st,st-sensors
  - Add mount matrix support for normal dt-binding whilst continuing to
    support the odd ACPI approach for accelerometers.
* ti,dac082s085 + similar
  - Add missing dt-binding doc
* trivial-devices - add entries for
  - memsic,mx4005, memsic,mx6255 and memsic,mxc6655
  - sensortek,stk8312 and sensortek,stk8ba50

Cleanup / minor fixes
* core
  - Use devm_add_action_or_reset() to replace boilerplate in several
    driver and core IIO devm_* functions.
  - Fix an error path issue introduced by above, that could return an
    error pointer rather than the expected null from dev_iio_device_alloc()
  - Move more IIO internals related fields from struct iio_dev to
    struct iio_dev_opaque.
  - Drop unused final update of in_loc in demux setup.
* Docs
  - Move some docs from driver specific to core dos to avoid replication
    of names which the documentation builder does not allow.
    Note this means adding a few device specific notes to the general docs
    to cover the more unusual uses of the ABI.
  - ABI: Move old buffer/* and scan_elements/* docs to obsolete as now we
    have the bufferX/* variant.  Not we are not getting rid of these
    interfaces, just encouraging new code to use the new interface.
* IIO wide:
  - Tidy up new cases of dev.parent etc being set in drivers as the core
    now does it.
  - Fix more cases of possible miss-aligned buffers when passed to
    iio_push_to_buffers_with_timestamp().  Note we only have one known
    instance of anyone seeing this bug actually happening, so this has been
    a low priority cleanup effort for several cycles.
  - sysfs_emit() used in more drivers.
  - Runtime pm tidy up and use of pm_runtime_resume_and_get()
  - Buffer alignment fixes as iio_push_to_buffers_with_timestamp requires
    that the timestamp when inserted by naturally aligned + consumers can
    assume that all fields are naturally aligned. Part of a long running
    effort, with at least 2 more series to come tackling additional
    variants.
  - Stop specifying "mount-matrix" property name in every lookup of the
    mount matrix from firmware by hard coding it in the core.
* adi,ad7476
  - Handle the variety of different regulators used by the parts supported
    by this driver (came up in dt-binding review)
* adi,ad7746
  - Trivial drop of if (ret) return ret; return 0; pattern
  - Tidy up comments
  - Pull capdac setup out to own function.
* adi,ad7766
  - Trivial drop of if (ret) return ret; return 0; pattern
* adi,adis
  - Avoid returning error codes in interrupt handlers.
  - Handle a failure in spi_write in the trigger handler.
  - Rework to add updating of device page after changing it.
  - Don't push data to IIO buffers when read failed.
  - Add configuration of burst max speed to core avoid handling this in
    each driver.
  - Use the adis_dev_lock() helper in adis16260 and adis16136 drivers.
  - Excessive includes cleanup via include-what-you-use static checker
    after zero day highlighted that these needed updating.
* afe
  - Amend binding to add #io-channel-cells, thus allowing this IIO
    consumer to also be an IIO provider.
* aosong,am2315
  - Drop ACPI id. Unlikely this one is in the wild and it isn't valid
    ACPI naming.
* bosch,bma180
  - Adding missing bandwidth settings (500, 1000 Hz)
* bosch,bme680
  - Drop ACPI id. Unlikely this one is in the wild and it isn't valid
    ACPI naming.
* ep93xx_adc,
  - Drop a redundant error print.
* maxim,max118
  - Convert remainder of probe() to devm_ managed functions.
  - Avoid some repeated jumping back and forth between iio_dev and
    spi structures.
* maxim,max11100
  - Use get_unaligned_be16() instead of open coding.
  - Convert remainder of probe() to devm_ managed functions.
* samsung,exynos_adc
  - Unused error value dropped.
* sensiron,sgp30
  - Drop use of %hx in favor of %x and letting the normal type conversion
    work.
* sensortek,stk8312
  - Add lowercase device id and note uppercase version deprecated.
  - Drop ACPI id. Unlikely this one is in the wild and it isn't valid
    ACPI naming.
* sprx,sc72xx_adc
  - add MODULE_DEVICE_TABLE
* st,lsm6dsx
  - Fix docs of valid ODRs
* st,sensors
  - dt-binding rework.  Two efforts on this crossed in a previous cycle
    so this update cherry picks the best of the two yaml conversions.
  - Don't copy the channel spec array as now ext_info is no longer modified.
* st,stm32-adc
  - tidy up some docs that were marked as kernel-doc but aren't.
* ti,adc081c, ti,adc0832, ti,adc108s102 and ti,adc161s626
  - Convert remainder of probe() functions to devm_ managed functions
    to simplify error handing and remove paths.

* tag 'iio-for-5.14a' of https://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio: (171 commits)
  i2c: core: Add stub for i2c_verify_client() if !CONFIG_I2C
  iio: adis: Cleanout unused headers
  iio: accel: bma180: Add missing 500 Hz / 1000 Hz bandwidth
  counter: Add support for Intel Quadrature Encoder Peripheral
  staging: iio: cdc: ad7746: extract capac setup to own function
  staging: iio: cdc: ad7746: clean up probe return
  staging: iio: cdc: ad7746: remove ordinary comments
  iio: adc: ti-adc161s626: Use devm managed functions for all of probe.
  iio: adc: ti-adc108s102: Use devm managed functions for all of probe()
  iio: adc: ti-adc0832: Use devm managed functions for all of probe()
  iio: adc: ti-adc081c: Use devm managed functions for all of probe()
  iio: adc: max1118: Avoid jumping back and forth between spi and iio structures
  iio: adc: max1118: Use devm_ managed functions for all of probe
  iio: adc: max11100: Use devm_ functions for rest of probe()
  iio: adc: max11100: Use get_unaligned_be16() rather than opencoding.
  iio: chemical: sgp30: Drop use of %hx in format string.
  iio: gyro: st_gyro: Support mount matrix
  iio: magnetometer: st_magn: Support mount matrix
  iio: accel: st_sensors: Stop copying channels
  iio: accel: st_sensors: Support generic mounting matrix
  ...
2021-06-09 12:11:49 +02:00
Paolo Bonzini
4422829e80 kvm: fix previous commit for 32-bit builds
array_index_nospec does not work for uint64_t on 32-bit builds.
However, the size of a memory slot must be less than 20 bits wide
on those system, since the memory slot must fit in the user
address space.  So just store it in an unsigned long.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-09 01:49:13 -04:00
Dario Binacchi
0899431f95 clk: ti: add am33xx/am43xx spread spectrum clock support
The patch enables spread spectrum clocking (SSC) for MPU and LCD PLLs.
As reported by the TI spruh73x/spruhl7x RM, SSC is only supported for
the DISP/LCD and MPU PLLs on am33xx/am43xx. SSC is not supported for
DDR, PER, and CORE PLLs.

Calculating the required values and setting the registers accordingly
was taken from the set_mpu_spreadspectrum routine contained in the
arch/arm/mach-omap2/am33xx/clock_am33xx.c file of the u-boot project.

In locked condition, DPLL output clock = CLKINP *[M/N]. In case of
SSC enabled, the reference manual explains that there is a restriction
of range of M values. Since the omap2_dpll_round_rate routine attempts
to select the minimum possible N, the value of M obtained is not
guaranteed to be within the range required. With the new "ti,min-div"
parameter it is possible to increase N and consequently M to satisfy the
constraint imposed by SSC.

Signed-off-by: Dario Binacchi <dariobin@libero.it>
Reviewed-by: Tero Kristo <kristo@kernel.org>
Link: https://lore.kernel.org/r/20210606202253.31649-6-dariobin@libero.it
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2021-06-08 17:49:16 -07:00
Matthew Hagan
e67f325e9c net: stmmac: explicitly deassert GMAC_AHB_RESET
We are currently assuming that GMAC_AHB_RESET will already be deasserted
by the bootloader. However if this has not been done, probing of the GMAC
will fail. To remedy this we must ensure GMAC_AHB_RESET has been deasserted
prior to probing.

v2 changes:
 - remove NULL condition check for stmmac_ahb_rst in stmmac_main.c
 - unwrap dev_err() message in stmmac_main.c
 - add PTR_ERR() around plat->stmmac_ahb_rst in stmmac_platform.c

v3 changes:
 - add error pointer to dev_err() output
 - add reset_control_assert(stmmac_ahb_rst) in stmmac_dvr_remove
 - revert PTR_ERR() around plat->stmmac_ahb_rst since this is performed
   on the returned value of ret by the calling function

Signed-off-by: Matthew Hagan <mnhagan88@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 16:42:12 -07:00
Sergey Ryazanov
b64d76b782 net: wwan: make WWAN_PORT_MAX meaning less surprised
It is quite unusual when some value can not be equal to a defined range
max value. Also most subsystems defines FOO_TYPE_MAX as a maximum valid
value. So turn the WAN_PORT_MAX meaning from the number of supported
port types to the maximum valid port type.

Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:33:43 -07:00
Voon Weifeng
46682cb86a net: stmmac: enable Intel mGbE 2.5Gbps link speed
The Intel mGbE supports 2.5Gbps link speed by increasing the clock rate by
2.5 times of the original rate. In this mode, the serdes/PHY operates at a
serial baud rate of 3.125 Gbps and the PCS data path and GMII interface of
the MAC operate at 312.5 MHz instead of 125 MHz.

For Intel mGbE, the overclocking of 2.5 times clock rate to support 2.5G is
only able to be configured in the BIOS during boot time. Kernel driver has
no access to modify the clock rate for 1Gbps/2.5G mode. The way to
determined the current 1G/2.5G mode is by reading a dedicated adhoc
register through mdio bus. In short, after the system boot up, it is either
in 1G mode or 2.5G mode which not able to be changed on the fly.

Compared to 1G mode, the 2.5G mode selects the 2500BASEX as PHY interface and
disables the xpcs_an_inband. This is to cater for some PHYs that only
supports 2500BASEX PHY interface with no autonegotiation.

v2: remove MAC supported link speed masking
v3: Restructure  to introduce intel_speed_mode_2500() to read serdes registers
    for max speed supported and select the appropritate configuration.
    Use max_speed to determine the supported link speed mask.

Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Michael Sit Wei Hong <michael.wei.hong.sit@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:31:43 -07:00
Voon Weifeng
f27abde304 net: pcs: add 2500BASEX support for Intel mGbE controller
XPCS IP supports 2500BASEX as PHY interface. It is configured as
autonegotiation disable to cater for PHYs that does not supports 2500BASEX
autonegotiation.

v2: Add supported link speed masking.
v3: Restructure to introduce xpcs_config_2500basex() used to configure the
    xpcs for 2.5G speeds. Added 2500BASEX specific information for
    configuration.
v4: Fix indentation error

Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Michael Sit Wei Hong <michael.wei.hong.sit@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-08 14:31:43 -07:00
Jan Kara
11c7aa0dde rq-qos: fix missed wake-ups in rq_qos_throttle try two
Commit 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:

CPU1 (waiter1)		CPU2 (waiter2)		CPU3 (waker)
rq_qos_wait()		rq_qos_wait()
  acquire_inflight_cb() -> fails
			  acquire_inflight_cb() -> fails

						completes IOs, inflight
						  decreased
  prepare_to_wait_exclusive()
			  prepare_to_wait_exclusive()
  has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
			  has_sleeper = !wq_has_single_sleeper() -> true
  io_schedule()		  io_schedule()

Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.

Fixes: 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-08 15:12:57 -06:00
Paolo Bonzini
da27a83fd6 kvm: avoid speculation-based attacks from out-of-range memslot accesses
KVM's mechanism for accessing guest memory translates a guest physical
address (gpa) to a host virtual address using the right-shifted gpa
(also known as gfn) and a struct kvm_memory_slot.  The translation is
performed in __gfn_to_hva_memslot using the following formula:

      hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE

It is expected that gfn falls within the boundaries of the guest's
physical memory.  However, a guest can access invalid physical addresses
in such a way that the gfn is invalid.

__gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first
retrieves a memslot through __gfn_to_memslot.  While __gfn_to_memslot
does check that the gfn falls within the boundaries of the guest's
physical memory or not, a CPU can speculate the result of the check and
continue execution speculatively using an illegal gfn. The speculation
can result in calculating an out-of-bounds hva.  If the resulting host
virtual address is used to load another guest physical address, this
is effectively a Spectre gadget consisting of two consecutive reads,
the second of which is data dependent on the first.

Right now it's not clear if there are any cases in which this is
exploitable.  One interesting case was reported by the original author
of this patch, and involves visiting guest page tables on x86.  Right
now these are not vulnerable because the hva read goes through get_user(),
which contains an LFENCE speculation barrier.  However, there are
patches in progress for x86 uaccess.h to mask kernel addresses instead of
using LFENCE; once these land, a guest could use speculation to read
from the VMM's ring 3 address space.  Other architectures such as ARM
already use the address masking method, and would be susceptible to
this same kind of data-dependent access gadgets.  Therefore, this patch
proactively protects from these attacks by masking out-of-bounds gfns
in __gfn_to_hva_memslot, which blocks speculation of invalid hvas.

Sean Christopherson noted that this patch does not cover
kvm_read_guest_offset_cached.  This however is limited to a few bytes
past the end of the cache, and therefore it is unlikely to be useful in
the context of building a chain of data dependent accesses.

Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08 17:12:05 -04:00
Long Li
c9c9762d4d block: return the correct bvec when checking for gaps
After commit 07173c3ec2 ("block: enable multipage bvecs"), a bvec can
have multiple pages. But bio_will_gap() still assumes one page bvec while
checking for merging. If the pages in the bvec go across the
seg_boundary_mask, this check for merging can potentially succeed if only
the 1st page is tested, and can fail if all the pages are tested.

Later, when SCSI builds the SG list the same check for merging is done in
__blk_segment_map_sg_merge() with all the pages in the bvec tested. This
time the check may fail if the pages in bvec go across the
seg_boundary_mask (but tested okay in bio_will_gap() earlier, so those
BIOs were merged). If this check fails, we end up with a broken SG list
for drivers assuming the SG list not having offsets in intermediate pages.
This results in incorrect pages written to the disk.

Fix this by returning the multi-page bvec when testing gaps for merging.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 07173c3ec2 ("block: enable multipage bvecs")
Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/1623094445-22332-1-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-08 15:06:49 -06:00
Suren Baghdasaryan
3958e2d0c3 cgroup: make per-cgroup pressure stall tracking configurable
PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-06-08 14:59:02 -04:00