Commit graph

781771 commits

Author SHA1 Message Date
Boris Pismenny
4799ac81e5 tls: Add rx inline crypto offload
This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.

This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.

The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:11 -07:00
Boris Pismenny
b190a587c6 tls: Fill software context without allocation
This patch allows tls_set_sw_offload to fill the context in case it was
already allocated previously.

We will use it in TLS_DEVICE to fill the RX software context.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:11 -07:00
Boris Pismenny
39f56e1a78 tls: Split tls_sw_release_resources_rx
This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.

In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:11 -07:00
Boris Pismenny
dafb67f3bb tls: Split decrypt_skb to two functions
Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.

Later, in the tls_device Rx flow, we will use decrypt_skb directly.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:13:10 -07:00
Boris Pismenny
d80a1b9d18 tls: Refactor tls_offload variable names
For symmetry, we rename tls_offload_context to
tls_offload_context_tx before we add tls_offload_context_rx.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
Boris Pismenny
41ed9c04aa tcp: Don't coalesce decrypted and encrypted SKBs
Prevent coalescing of decrypted and encrypted SKBs in GRO
and TCP layer.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
Boris Pismenny
16e4edc297 net: Add TLS rx resync NDO
Add new netdev tls op for resynchronizing HW tls context

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
Ilya Lesokhin
14136564c8 net: Add TLS RX offload feature
This patch adds a netdev feature to configure TLS RX inline crypto offload.

Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
Boris Pismenny
784abe24c9 net: Add decrypted field to skb
The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:12:09 -07:00
David S. Miller
cc98419a57 Merge branch 'mvpp2-add-debugfs-interface'
Maxime Chevallier says:

====================
net: mvpp2: add debugfs interface

The PPv2 Header Parser and Classifier are not straightforward to debug,
having easy access to some of the many lookup tables configuration is
helpful during development and debug.

This series adds a basic debugfs interface, allowing to read data from
the Header Parser and some of the Classifier tables.

For now, the interface is read-only, and contains only some basic info.

This was actually used during RSS development, and might be useful to
troubleshoot some issues we might find.

The first patch of the series converts the mvpp2 files to SPDX, which
eases adding the new debugfs dedicated file.

The second patch adds the interface, and exposes basic Header Parser data.

The 3rd patch adds a hit counter for the Header Parser TCAM.

The 4th patch exposes classifier info.

The 5th patch adds some hit counters for some of the classifier engines.

Changes since V1:
- Rebased on the lastest net-next
- Made cls_flow_get non static so that it can be used in mvpp2_debugfs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:10:01 -07:00
Maxime Chevallier
f9d30d5bd5 net: mvpp2: debugfs: add classifier hit counters
The classification operations that are used for RSS make use of several
lookup tables. Having hit counters for these tables is really helpful
to determine what flows were matched by ingress traffic, and see the
path of packets among all the classifier tables.

This commit adds hit counters for the 3 tables used at the moment :

 - The decoding table (also called lookup_id table), that links flows
   identified by the Header Parser to the flow table.

   There's one entry per flow, located at :
   .../mvpp2/<controller>/flows/XX/dec_hits

   Note that there are 21 flows in the decoding table, whereas there are
   52 flows in the Header Parser. That's because there are several kind
   of traffic that will match a given flow. Reading the hit counter from
   one sub-flow will clear all hit counter that have the same flow_id.

   This also applies to the flow_hits.

 - The flow table, that contains all the different lookups to be
   performed by the classifier for each packet of a given flow. The match
   is done on the first entry of the flow sequence.

 - The C2 engine entries, that are used to assign the default rx queue,
   and enable or disable RSS for a given port.

   There's one entry per flow, located at:
   .../mvpp2/<controller>/flows/XX/flow_hits

   There is one C2 entry per port, so the c2 hit counter is located at :
   .../mvpp2/<controller>/ethX/c2_hits

All hit counter values are 16-bits clear-on-read values.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:10:01 -07:00
Maxime Chevallier
dba1d918da net: mvpp2: debugfs: add entries for classifier flows
The classifier configuration for RSS is quite complex, with several
lookup tables being used. This commit adds useful info in debugfs to
see how the different tables are configured :

Added 2 new entries in the per-port directory :

  - .../eth0/default_rxq : The default rx queue on that port
  - .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry

Added the 'flows' directory :

  It contains one entry per sub-flow. a 'sub-flow' is a unique path from
  Header Parser to the flow table. Multiple sub-flows can point to the
  same 'flow' (each flow has an id from 8 to 29, which is its index in the
  Lookup Id table) :

  - .../flows/00/...
             /01/...
             ...
             /51/id : The flow id. There are 21 unique flows. There's one
                       flow per combination of the following parameters :
                       - L4 protocol (TCP, UDP, none)
                       - L3 protocol (IPv4, IPv6)
                       - L3 parameters (Fragmented or not)
                       - L2 parameters (Vlan tag presence or not)
              .../type : The flow type. This is an even higher level flow,
                         that we manipulate with ethtool. It can be :
                         "udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other".
              .../eth0/...
              .../eth1/engine : The hash generation engine used for this
	                        flow on the given port
                  .../hash_opts : The hash generation options indicating on
                                  what data we base the hash (vlan tag, src
                                  IP, src port, etc.)

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:10:01 -07:00
Maxime Chevallier
1203341cc9 net: mvpp2: debugfs: add hit counter stats for Header Parser entries
One helpful feature to help debug the Header Parser TCAM filter in PPv2
is to be able to see if the entries did match something when a packet
comes in. This can be done by using the built-in hit counter for TCAM
entries.

This commit implements reading the counter, and exposing its value on
debugfs for each filter entry.

The counter is a 16-bits clear-on-read value, located at:
 .../mvpp2/<controller>/parser/XXX/hits

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:10:01 -07:00
Maxime Chevallier
21da57a231 net: mvpp2: add a debugfs interface for the Header Parser
Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not
trivial to configure and debug. Being able to dump TCAM entries from
userspace can be really helpful to help development of new features
and debug existing ones.

This commit adds a basic debugfs interface for the PPv2 driver, focusing
on TCAM related features.

<mnt>/mvpp2/ --- f2000000.ethernet
              \- f4000000.ethernet --- parser --- 000 ...
                                    |          \- 001
                                    |          \- ...
                                    |          \- 255 --- ai
                                    |                  \- header_data
                                    |                  \- lookup_id
                                    |                  \- sram
                                    |                  \- valid
                                    \- eth1 ...
                                    \- eth2 --- mac_filter
                                             \- parser_entries
                                             \- vid_filter

There's one directory per PPv2 instance, named after pdev->name to make
sure names are uniques. In each of these directories, there's :

 - one directory per interface on the controller, each containing :

   - "mac_filter", which lists all filtered addresses for this port
     (based on TCAM, not on the kernel's uc / mc lists)

   - "parser_entries", which lists the indices of all valid TCAM
      entries that have this port in their port map

   - "vid_filter", which lists the vids allowed on this port, based on
     TCAM

 - one "parser" directory (the parser is common to all ports), containing :

   - one directory per TCAM entry (256 of them, from 0 to 255), each
     containing :

     - "ai" : Contains the 1 byte Additional Info field from TCAM, and

     - "header_data" : Contains the 8 bytes Header Data extracted from
       the packet

     - "lookup_id" : Contains the 4 bits LU_ID

     - "sram" : contains the raw SRAM data, which is the result of the TCAM
		lookup. This readonly at the moment.

     - "valid" : Indicates if the entry is valid of not.

All entries are read-only, and everything is output in hex form.

Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:10:00 -07:00
Antoine Tenart
f1e37e3101 net: mvpp2: switch to SPDX identifiers
Use the appropriate SPDX license identifiers and drop the license text.
This patch is only cosmetic.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16 00:10:00 -07:00
Greg Kroah-Hartman
500f0716b5 Merge 4.18-rc5 into usb-next
We need the USB fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-16 09:09:24 +02:00
Greg Kroah-Hartman
956f004a04 Merge 4.18-rc5 into staging-next
We need the staging fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-16 09:06:57 +02:00
Greg Kroah-Hartman
83cf9cd6d5 Merge 4.18-rc5 into char-misc-next
We want the char-misc fixes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-16 09:04:54 +02:00
Nicholas Piggin
2bf1071a8d powerpc/64s: Remove POWER9 DD1 support
POWER9 DD1 was never a product. It is no longer supported by upstream
firmware, and it is not effectively supported in Linux due to lack of
testing.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
[mpe: Remove arch_make_huge_pte() entirely]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-16 11:37:21 +10:00
Dave Airlie
bf642e3a19 - GVT fix for KBL vGPU hang to update virtual register from LRI.
- Fix hotplug irq ack on i965/g4x (Ville)
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbSE5yAAoJEPpiX2QO6xPKI6kIALfg7r2e7nAv0wOmjfQSrE0o
 8wNseKgizWzv4XF0MqYB6l1fFpQddhOLRhcgcPg9LwDhYvUjOs2PvMoRY1c5g9r6
 0Luvcg/gzKG+BVhIIky5GnUpUaPHatAwgSKJ6sV8cwqkplt3eCd/pka+q0eGqOTa
 t0ko7ZjRVWGdeVh8A59EzlBfEgxZkWw0B7pojMCFHQ6GlL10cCtwOnEyIv+JvzuS
 l+pVsGVwcKh8v9Ngi5+MSGFPHieRFKdi+WbI3V+0Bm+VBT2LjZTG+ne9WBV75sKI
 /KiMEi+1SdEIhjaJpJsSziqzN9zvyJAnsxBIkoiYW3Z7jdOav2rC1vZWt9kCdv0=
 =+bAc
 -----END PGP SIGNATURE-----

Merge tag 'drm-intel-fixes-2018-07-12' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

I already pulled the first fix, pull the GVT fixes.

- GVT fix for KBL vGPU hang to update virtual register from LRI.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180713070922.GA19840@intel.com
2018-07-16 10:32:28 +10:00
Dave Airlie
a929b32537 Merge branch 'drm-armada-fixes' of git://git.armlinux.org.uk/~rmk/linux-arm into drm-fixes
Two armada fixes.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180713075427.GA16160@rmk-PC.armlinux.org.uk
2018-07-16 09:57:35 +10:00
Dave Airlie
990187537b Fixes for v4.18-rc5:
- Single fix for a build error when the driver is builtin,
   but the backend is a loadable module.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAltIVAQACgkQ/lWMcqZw
 E8NmGBAAsD4G2E5Z2xlGdi22+jxgabpk7qmud3+obkaEAlADP2Tdho7pRPyW5Nri
 G9fxWv1Qeio8TJ0ka/vsAlF9fc54ovavP1wDNNveuZPg6GHzqJD/zZS99i+npYKl
 kAVS8KC6FKrOcGgk4DFmypKhlrsgGupV5sJ9EpRmX63dIdoCIj4EAuDqY/3cMW/e
 nwPG1TLvDeo3TR5vCj9dbXLJDHcKSdMflz0SLeZAzJSbo/GP1kv2JEvrSTczvdSE
 OBY2Y2BFLbWbxWCW0nv4EwlziMY7Y+nCO1rbU2YtkdfB7VMcLPJ2Y9pYe56RoeW+
 /sKarC8SaUTi+RTpxQxWoItxmPL8uBKoNmYig9tvcgcx2jmXtmMDbCVEwl5RXMFo
 6ZSLHhLoJCLUKTeKTjR4rYSpHihwXuaWVUD8Z//7wnnbwMK78urcst0TJHH1cQS0
 38wwNzUFtfrDbx9TF8V0qEKdswPNCiYNr7k1iGDBPQNjCatAvm0R6Lhph0F8Xj0U
 Ra3+DKVM95eS1nlkxjbzFL0L/9bR/NEBtXi5bEgxlnnJFsnkup/PMXqBh3h9s3OB
 HvYsj+Z+hXkMpUDX3waax+6hXYCa/RxSknLV+OZHnI8KzfGIIVnwM8WIxXcGvmz8
 ZgyGe0hAgEPntD7PIxFur/AsXqOg8sPoFcohFiL7Tanr4Sflk8Y=
 =It28
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2018-07-13' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Fixes for v4.18-rc5:
- Single fix for a build error when the driver is builtin,
  but the backend is a loadable module.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/9c596cf5-3f24-070e-74f2-c59bfbaf68fa@linux.intel.com
2018-07-16 09:51:01 +10:00
Dave Airlie
2757de4c09 drm/tegra: Fixes for v4.18-rc5
This contains a couple of one- or two-line fixes for various minor
 issues in the Tegra driver.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEiOrDCAFJzPfAjcif3SOs138+s6EFAltG/EMTHHRyZWRpbmdA
 bnZpZGlhLmNvbQAKCRDdI6zXfz6zoTXAD/9Qe1TWImaE7htgFgB4iatzjDf+Voph
 j0ikofcD2fH+hQObnh46nruhxhBR5/pbSePN7WL5tLPNfJ9rSYXNzahqk35WH2fm
 rmG7F2I6lTezaScrgHKwf8YuVS0ioCd1McNp2gLr8tQ2TqzRSEOXZ19+YiOFzUNj
 IeIu0J986shDa573/+6ILK9xW2D4zxJYSZbr9+894sDoA49rpTjvOnKVvKJuyt+2
 HmyDHANw8imQLMlVhYo0LKPUiUSAuh5Sx+ZanxhAtpGqCEbX0sim6DvAcWORF5aA
 GGyEBz1gbZJy/EfTB8Sze+yEz0hNcbGSzxsJYVqfBlnE1VYf9egMOzPa5kjeUXJ+
 CcyRMORComECEHSg8xtMpD2tWKzDuhrqmg2eUR1YBfHqVawfxPXgdGe1TfsbQiVd
 fznMzX6ps55bp8z2u58R3M37fFf0H4fndGF3cELD4avZrub+YIIvRrJ1KQQV8NNT
 6DZoGBy5ybfaVYuchZoFBfHeB5pvIKAV1SyBrWoB3fjKbIMazVzXRSfsiQLpaIoG
 3fcV/lzrRLCz5gDmwbecG4YAhYhgsseR4NCfO6u6vwOZ12udD06vFLbAVY0pEr81
 zk02z/NmuVfkUNl88ov6rcP7qDg2pcXm2vW84595+lX+IYJnQ+JypwBwCZ96L3hd
 AsUJDTSwBvUZcw==
 =eBgW
 -----END PGP SIGNATURE-----

Merge tag 'drm/tegra/for-4.18-rc5' of git://anongit.freedesktop.org/tegra/linux into drm-fixes

drm/tegra: Fixes for v4.18-rc5

This contains a couple of one- or two-line fixes for various minor
issues in the Tegra driver.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180712070142.15571-1-thierry.reding@gmail.com
2018-07-16 09:49:22 +10:00
Dave Airlie
e280057762 Merge branch 'drm-fixes-4.18' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
A few display and GPUVM fixes for 4.18.

A few more fixes for 4.18. Two display fixes and a fix to avoid a segfault if
the GPU does not power up properly on resume.  These are on top of my pull
from earlier this week.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180712043820.2877-1-alexander.deucher@amd.com
2018-07-16 09:46:21 +10:00
Dave Airlie
f88147e4e1 - Fix hotplug irq ack on i965/g4x (Ville)
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJbRSVYAAoJEPpiX2QO6xPKi9IH/ijPmDmo74CkIlvBPHvRaS6q
 rx19c2XMumzekjkl+UEGkkGSHEImuiuVG69HkyvmOY50gJtu1kkImzwSZOox4YJ+
 GpWefB1LrIjg6anb5q7JP6GpiXmTIbRCO/JZLiMLpOfGUHAgEorVShYwPPKFwyHL
 N6GBraWnVZAdJYTLAnaqGDmBA3JxhknjDrnLNgtPb6QXaMHp+OSpCT6I/tSYWbY4
 nw3EipbTcmHeMX2ngwfHGNR8xwvfIsiplQDUH2xxUqFSLg+CkQoHl8ZfDzI6N0+b
 wQYBH2j53AHccKxek6tpdi3g3sX/tF86UBEAPwgjRp3j8R8fhRigWr8lgldY7Yw=
 =CnuU
 -----END PGP SIGNATURE-----

Merge tag 'drm-intel-fixes-2018-07-10' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

- Fix hotplug irq ack on i965/g4x (Ville)

Signed-off-by: Dave Airlie <airlied@redhat.com>

Link: https://patchwork.freedesktop.org/patch/msgid/20180710213249.GA16479@intel.com
2018-07-16 09:43:51 +10:00
Ard Biesheuvel
38ac0287b7 fbdev/efifb: Honour UEFI memory map attributes when mapping the FB
If the framebuffer address provided by the Graphics Output Protocol
(GOP) is covered by the UEFI memory map, it will tell us which memory
attributes are permitted when mapping this region. In some cases,
(KVM guest on ARM), violating this will result in loss of coherency,
which means that updates sent to the framebuffer by the guest will
not be observeable by the host, and the emulated display simply does
not work.

So if the memory map contains such a description, take the attributes
field into account, and add support for creating WT or WB mappings of
the framebuffer region.

Tested-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-9-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Ard Biesheuvel
7e1550b8f2 efi: Drop type and attribute checks in efi_mem_desc_lookup()
The current implementation of efi_mem_desc_lookup() includes the
following check on the memory descriptor it returns:

    if (!(md->attribute & EFI_MEMORY_RUNTIME) &&
        md->type != EFI_BOOT_SERVICES_DATA &&
        md->type != EFI_RUNTIME_SERVICES_DATA) {
            continue;
    }

This means that only EfiBootServicesData or EfiRuntimeServicesData
regions are considered, or any other region type provided that it
has the EFI_MEMORY_RUNTIME attribute set.

Given what the name of the function implies, and the fact that any
physical address can be described in the UEFI memory map only a single
time, it does not make sense to impose this condition in the body of the
loop, but instead, should be imposed by the caller depending on the value
that is returned to it.

Two such callers exist at the moment:

- The BGRT code when running on x86, via efi_mem_reserve() and
  efi_arch_mem_reserve(). In this case, the region is already known to
  be EfiBootServicesData, and so the check is redundant.

- The ESRT handling code which introduced this function, which calls it
  both directly from efi_esrt_init() and again via efi_mem_reserve() and
  efi_arch_mem_reserve() [on x86].

So let's move this check into the callers instead. This preserves the
current behavior both for BGRT and ESRT handling, and allows the lookup
routine to be reused by other [upcoming] users that don't have this
limitation.

In the ESRT case, keep the entire condition, so that platforms that
deviate from the UEFI spec and use something other than
EfiBootServicesData for the ESRT table will keep working as before.

For x86's efi_arch_mem_reserve() implementation, limit the type to
EfiBootServicesData, since it is the only type the reservation code
expects to operate on in the first place.

While we're at it, drop the __init annotation so that drivers can use it
as well.

Tested-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Jones <pjones@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-8-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Ard Biesheuvel
3d7ee348aa efi/libstub/arm: Add opt-in Kconfig option for the DTB loader
There are various ways a platform can provide a device tree binary
to the kernel, with different levels of sophistication:

- ideally, the UEFI firmware, which is tightly coupled with the
  platform, provides a device tree image directly as a UEFI
  configuration table, and typically permits the contents to be
  manipulated either via menu options or via UEFI environment
  variables that specify a replacement image,

- GRUB for ARM has a 'devicetree' directive which allows a device
  tree image to be loaded from any location accessible to GRUB, and
  supersede the one provided by the firmware,

- the EFI stub implements a dtb= command line option that allows a
  device tree image to be loaded from a file residing in the same
  file system as the one the kernel image was loaded from.

The dtb= command line option was never intended to be more than a
development feature, to allow the other options to be implemented
in parallel. So let's make it an opt-in feature that is disabled
by default, but can be re-enabled at will.

Note that we already disable the dtb= command line option when we
detect that we are running with UEFI Secure Boot enabled.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Acked-by: Leif Lindholm <leif.lindholm@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-7-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Sai Praneeth
f5dcc214aa efi: Remove the declaration of efi_late_init() as the function is unused
The following commit:

  7b0a911478 ("efi/x86: Move the EFI BGRT init code to early init code")

... removed the implementation and all the references to
efi_late_init() but the function is still declared at
include/linux/efi.h.

Hence, remove the unnecessary declaration.

Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-6-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Arnd Bergmann
7bb497092a efi/cper: Avoid using get_seconds()
get_seconds() is deprecated because of the 32-bit time overflow
in y2038/y2106 on 32-bit architectures. The way it is used in
cper_next_record_id() causes an overflow in 2106 when unsigned UTC
seconds overflow, even on 64-bit architectures.

This starts using ktime_get_real_seconds() to give us more than 32 bits
of timestamp on all architectures, and then changes the algorithm to use
39 bits for the timestamp after the y2038 wrap date, plus an always-1
bit at the top. This gives us another 127 epochs of 136 years, with
strictly monotonically increasing sequence numbers across boots.

This is almost certainly overkill, but seems better than just extending
the deadline from 2038 to 2106.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-5-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Sai Praneeth
3eb420e70d efi: Use a work queue to invoke EFI Runtime Services
Presently, when a user process requests the kernel to execute any
UEFI runtime service, the kernel temporarily switches to a separate
set of page tables that describe the virtual mapping of the UEFI
runtime services regions in memory. Since UEFI runtime services are
typically invoked with interrupts enabled, any code that may be called
during this time, will have an incorrect view of the process's address
space. Although it is unusual for code running in interrupt context to
make assumptions about the process context it runs in, there are cases
(such as the perf subsystem taking samples) where this causes problems.

So let's set up a work queue for calling UEFI runtime services, so that
the actual calls are made when the work queue items are dispatched by a
work queue worker running in a separate kernel thread. Such threads are
not expected to have userland mappings in the first place, and so the
additional mappings created for the UEFI runtime services can never
clash with any.

The ResetSystem() runtime service is not covered by the work queue
handling, since it is not expected to return, and may be called at a
time when the kernel is torn down to the point where we cannot expect
work queues to still be operational.

The non-blocking variants of SetVariable() and QueryVariableInfo()
are also excluded: these are intended to be used from atomic context,
which obviously rules out waiting for a completion to be signalled by
another thread. Note that these variants are currently only used for
UEFI runtime services calls that occur very early in the boot, and
for ones that occur in critical conditions, e.g., to flush kernel logs
to UEFI variables via efi-pstore.

Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
[ardb: exclude ResetSystem() from the workqueue treatment
       merge from 2 separate patches and rewrite commit log]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-4-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Sai Praneeth
5a58bc1b1e efi/x86: Use non-blocking SetVariable() for efi_delete_dummy_variable()
Presently, efi_delete_dummy_variable() uses set_variable() which might
block, which the scheduler is rightfully upset about when used from
the idle thread, producing this splat:

  "bad: scheduling from the idle thread!"

So, make efi_delete_dummy_variable() use set_variable_nonblocking(),
which, as the name suggests, doesn't block.

Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-3-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:12 +02:00
Ingo Molnar
90a2186b7d efi/x86: Clean up the eboot code
Various small cleanups:

 - Standardize printk messages:

     'alloc' => 'allocate'
     'mem'   => 'memory'

   also put variable names in printk messages between quotes.

 - Align mass-assignments vertically for better readability

 - Break multi-line function prototypes at the name where possible,
   not in the middle of the parameter list

 - Use a newline before return statements consistently.

 - Use curly braces in a balanced fashion.

 - Remove stray newlines.

No change in functionality.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180711094040.12506-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:43:05 +02:00
Masahiro Yamada
bdc9c3e5ed x86/build: Remove old -funit-at-a-time GCC quirk
The following commit:

  e501ce957a ("x86: Force asm-goto")

... bumped the minimum GCC version to 4.5 for building the x86 kernel.

arch/x86/Makefile no longer needs to take care of older GCC versions,
such as this pre-4.0 -funit-at-a-time quirk.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1531138041-24200-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:25:10 +02:00
Tobias Tefke
788faab70d perf, tools: Use correct articles in comments
Some of the comments in the perf events code use articles incorrectly,
using 'a' for words beginning with a vowel sound, where 'an' should be
used.

Signed-off-by: Tobias Tefke <tobias.tefke@tutanota.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/20180709105715.22938-1-tobias.tefke@tutanota.com
[ Fix a few more perf related 'a event' typo fixes from all around the kernel and tooling tree. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:21:03 +02:00
Sebastian Andrzej Siewior
af0fffd930 sched/core: Remove get_cpu() from sched_fork()
get_cpu() disables preemption for the entire sched_fork() function.
This get_cpu() was introduced in commit:

  dd41f596cd ("sched: cfs core code")

... which also invoked sched_balance_self() and this function
required preemption do be off.

Today, sched_balance_self() seems to be moved to ->task_fork callback
which is invoked while the ->pi_lock is held.

set_load_weight() could invoke reweight_task() which then via $callchain
might end up in smp_processor_id() but since `update_load' is false
this won't happen.

I didn't find any this_cpu*() or similar usage during the initialisation
of the task_struct.

The `cpu' value (from get_cpu()) is only used later in __set_task_cpu()
while the ->pi_lock lock is held.

Based on this it is possible to remove get_cpu() and use
smp_processor_id() for the `cpu' variable without breaking anything.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180706130615.g2ex2kmfu5kcvlq6@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Peter Zijlstra
45f5519ec5 sched/cpufreq: Clarify sugov_get_util()
Add a few comments to (hopefully) clarifying some of the magic in
sugov_get_util().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/20180705123617.GM2458@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Vincent Guittot
5fd778915a sched/sysctl: Remove unused sched_time_avg_ms sysctl
/proc/sys/kernel/sched_time_avg_ms entry is not used anywhere,
remove it.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-12-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Vincent Guittot
bbb62c0b02 sched/core: Remove the rt_avg code
rt_avg is not used anywhere anymore, so we can remove all related code.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:29 +02:00
Vincent Guittot
523e979d31 sched/core: Use PELT for scale_rt_capacity()
The utilization of the CPU by RT, DL and IRQs are now tracked with
PELT so we can use these metrics instead of rt_avg to evaluate the remaining
capacity available for CFS class.

scale_rt_capacity() behavior has been changed and now returns the remaining
capacity available for CFS instead of a scaling factor because RT, DL and
IRQ provide now absolute utilization value.

The same formula as schedutil is used:

  IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

but the implementation is different because it doesn't return the same value
and doesn't benefit of the same optimization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:16:25 +02:00
Dan Williams
092b31aa20 x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
All copy_to_user() implementations need to be prepared to handle faults
accessing userspace. The __memcpy_mcsafe() implementation handles both
mmu-faults on the user destination and machine-check-exceptions on the
source buffer. However, the memcpy_mcsafe() wrapper may silently
fallback to memcpy() depending on build options and cpu-capabilities.

Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when
available, and otherwise disable all of the copy_to_user_mcsafe()
infrastructure when __memcpy_mcsafe() is not available, i.e.
CONFIG_X86_MCE=n.

This fixes crashes of the form:
    run fstests generic/323 at 2018-07-02 12:46:23
    BUG: unable to handle kernel paging request at 00007f0d50001000
    RIP: 0010:__memcpy+0x12/0x20
    [..]
    Call Trace:
     copyout_mcsafe+0x3a/0x50
     _copy_to_iter_mcsafe+0xa1/0x4a0
     ? dax_alive+0x30/0x50
     dax_iomap_actor+0x1f9/0x280
     ? dax_iomap_rw+0x100/0x100
     iomap_apply+0xba/0x130
     ? dax_iomap_rw+0x100/0x100
     dax_iomap_rw+0x95/0x100
     ? dax_iomap_rw+0x100/0x100
     xfs_file_dax_read+0x7b/0x1d0 [xfs]
     xfs_file_read_iter+0xa7/0xc0 [xfs]
     aio_read+0x11c/0x1a0

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Fixes: 8780356ef6 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:05:05 +02:00
Dan Williams
ca146f6f09 lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()
By mistake the ITER_PIPE early-exit / warning from copy_from_iter() was
cargo-culted in _copy_to_iter_mcsafe() rather than a machine-check-safe
version of copy_to_iter_pipe().

Implement copy_pipe_to_iter_mcsafe() being careful to return the
indication of short copies due to a CPU exception.

Without this regression-fix all splice reads to dax-mode files fail.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Fixes: 8780356ef6 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
Link: http://lkml.kernel.org/r/153108277278.37979.3327916996902264102.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:05:05 +02:00
Dan Williams
abd08d7d24 lib/iov_iter: Document _copy_to_iter_flushcache()
Add some theory of operation documentation to _copy_to_iter_flushcache().

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/153108276767.37979.9462477994086841699.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:05:05 +02:00
Dan Williams
bf3eeb9b5f lib/iov_iter: Document _copy_to_iter_mcsafe()
Add some theory of operation documentation to _copy_to_iter_mcsafe().

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/153108276256.37979.1689794213845539316.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16 00:05:05 +02:00
Vincent Guittot
dfa444dc2f sched/cpufreq: Remove sugov_aggregate_util()
There is no reason why sugov_get_util() and sugov_aggregate_util()
were in fact separate functions.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[ Rebased after adding irq tracking and fixed some compilation errors. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-9-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
9033ea1188 cpufreq/schedutil: Take time spent in interrupts into account
The time spent executing IRQ handlers can be significant but it is not reflected
in the utilization of CPU when deciding to choose an OPP. Now that we have
access to this metric, schedutil can take it into account when selecting
the OPP for a CPU.

RQS utilization don't see the time spend under interrupt context and report
their value in the normal context time window. We need to compensate this when
adding interrupt utilization

The CPU utilization is:

  IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

A test with iperf on hikey (octo arm64) gives the following speedup:

 iperf -c server_address -r -t 5

 w/o patch		w/ patch
 Tx 276 Mbits/sec	304 Mbits/sec +10%
 Rx 299 Mbits/sec	328 Mbits/sec  +9%

 8 iterations
 stdev is lower than 1%

Only WFI idle state is enabled (shallowest idle state).

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
91c27493e7 sched/irq: Add IRQ utilization tracking
interrupt and steal time are the only remaining activities tracked by
rt_avg. Like for sched classes, we can use PELT to track their average
utilization of the CPU. But unlike sched class, we don't track when
entering/leaving interrupt; Instead, we take into account the time spent
under interrupt context when we update rqs' clock (rq_clock_task).
This also means that we have to decay the normal context time and account
for interrupt time during the update.

That's also important to note that because:

  rq_clock == rq_clock_task + interrupt time

and rq_clock_task is used by a sched class to compute its utilization, the
util_avg of a sched class only reflects the utilization of the time spent
in normal context and not of the whole time of the CPU. The utilization of
interrupt gives an more accurate level of utilization of CPU.

The CPU utilization is:

  avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq

Most of the time, avg_irq is small and neglictible so the use of the
approximation CPU utilization = /Sum avg_rq was enough.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
8cc90515a4 cpufreq/schedutil: Use DL utilization tracking
Now that we have both the DL class bandwidth requirement and the DL class
utilization, we can detect when CPU is fully used so we should run at max.
Otherwise, we keep using the DL bandwidth requirement to define the
utilization of the CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:21 +02:00
Vincent Guittot
3727e0e163 sched/dl: Add dl_rq utilization tracking
Similarly to what happens with RT tasks, CFS tasks can be preempted by DL
tasks and the CFS's utilization might no longer describes the real
utilization level.

Current DL bandwidth reflects the requirements to meet deadline when tasks are
enqueued but not the current utilization of the DL sched class. We track
DL class utilization to estimate the system utilization.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Cc: viresh.kumar@linaro.org
Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:20 +02:00
Vincent Guittot
3ae117c6cd cpufreq/schedutil: Use RT utilization tracking
Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT
can preempt and steal CFS's running time.

RT util_avg is used to take into account the utilization of RT tasks
on the CPU when selecting OPP. If a RT task migrate, the RT utilization
will not migrate but will decay over time. On an overloaded CPU, CFS
utilization reflects the remaining utilization avialable on CPU. When RT
task migrates, the CFS utilization will increase when tasks will start to
use the newly available capacity. At the same pace, RT utilization will
decay and both variations will compensate each other to keep unchanged
overall utilization and will prevent any OPP drop.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: claudio@evidence.eu.com
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: joel@joelfernandes.org
Cc: juri.lelli@redhat.com
Cc: luca.abeni@santannapisa.it
Cc: patrick.bellasi@arm.com
Cc: quentin.perret@arm.com
Cc: rjw@rjwysocki.net
Cc: valentin.schneider@arm.com
Link: http://lkml.kernel.org/r/1530200714-4504-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15 23:51:20 +02:00