Commit graph

68773 commits

Author SHA1 Message Date
Kristian Klausen
d507a54f58 platform/x86: asus-wmi: Add support for charge threshold
Most newer ASUS laptops supports limiting the battery charge level, which
help prolonging the battery life.

Tested on a Zenbook UX430UNR.

Signed-off-by: Kristian Klausen <kristian@klausen.dk>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2019-08-16 12:38:48 +03:00
Sudarsana Reddy Kalluru
0dabbe1bb3 qed: Add driver API for flashing the config attributes.
The patch adds driver interface for reading the config attributes from user
provided buffer, and updates these values on nvm config flash partition.

This is basically an expansion of our existing ethtool -f implementation.
The management FW has exposed an additional method of configuring some of
the nvram options, and this makes use of that. This implementation will
come into use when newer FW files which contain configuration directives
employing this API will be provided to ethtool -f.

Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15 12:54:45 -07:00
Arnd Bergmann
d64a1fd852 Merge branch 'lpc32xx/multiplatform' into arm/soc
I revisited some older patches here, getting two of the remaining
ARM platforms to build with ARCH_MULTIPLATFORM like most others do.

In case of lpc32xx, I created a new set of patches, which seemed
easier than digging out what I did for an older release many
years ago.

* lpc32xx/multiplatform:
  ARM: lpc32xx: allow multiplatform build
  ARM: lpc32xx: clean up header files
  serial: lpc32xx: allow compile testing
  net: lpc-enet: allow compile testing
  net: lpc-enet: fix printk format strings
  net: lpc-enet: fix badzero.cocci warnings
  net: lpc-enet: move phy setup into platform code
  net: lpc-enet: factor out iram access
  gpio: lpc32xx: allow building on non-lpc32xx targets
  serial: lpc32xx_hs: allow compile-testing
  watchdog: pnx4008_wdt: allow compile-testing
  usb: udc: lpc32xx: allow compile-testing
  usb: ohci-nxp: enable compile-testing

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-15 21:35:22 +02:00
Arnd Bergmann
ffba29c9eb serial: lpc32xx: allow compile testing
The lpc32xx_loopback_set() function in hte lpc32xx_hs driver is the
one thing that relies on platform header files. Move that into the
core platform code so we only need a variable declaration for it,
and enable COMPILE_TEST building.

Link: https://lore.kernel.org/r/20190809144043.476786-12-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-15 21:34:02 +02:00
Arnd Bergmann
ecca1a6277 net: lpc-enet: move phy setup into platform code
Setting the phy mode requires touching a platform specific
register, which prevents us from building the driver without
its header files.

Move it into a separate function in arch/arm/mach/lpc32xx
to hide the core registers from the network driver.

Link: https://lore.kernel.org/r/20190809144043.476786-8-arnd@arndb.de
Acked-by: Sylvain Lemieux <slemieux.tyco@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-15 21:33:52 +02:00
Arnd Bergmann
9dc03ffd99 net: lpc-enet: factor out iram access
The lpc_eth driver uses a platform specific method to find
the internal sram. This prevents building it on other machines.

Rework to only use one function call and keep the other platform
internals where they belong. Ideally this would look up the
sram location from DT, but as this is a rarely used driver,
I want to keep the modifications to a minimum.

Link: https://lore.kernel.org/r/20190809144043.476786-7-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-15 21:33:49 +02:00
Sylwester Nawrocki
40d8aff614 soc: samsung: chipid: Convert exynos-chipid driver to use the regmap API
Convert the driver to use regmap API in order to allow other
drivers, like ASV, to access the CHIPID registers.

Add definition of selected CHIPID register offsets and register bit
fields for Exynos5422 SoC.

Signed-off-by: Sylwester Nawrocki <s.nawrocki@samsung.com>
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2019-08-15 20:25:25 +02:00
Jens Axboe
7b6620d7db block: remove REQ_NOWAIT_INLINE
We had a few issues with this code, and there's still a problem around
how we deal with error handling for chained/split bios. For now, just
revert the code and we'll try again with a thoroug solution. This
reverts commits:

e15c2ffa10 ("block: fix O_DIRECT error handling for bio fragments")
0eb6ddfb86 ("block: Fix __blkdev_direct_IO() for bio fragments")
6a43074e2f ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
893a1c9720 ("blk-mq: allow REQ_NOWAIT to return an error inline")

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-15 11:09:16 -06:00
Christoph Hellwig
edfbcb321f usb: add a hcd_uses_dma helper
The USB buffer allocation code is the only place in the usb core (and in
fact the whole kernel) that uses is_device_dma_capable, while the URB
mapping code uses the uses_dma flag in struct usb_bus.  Switch the buffer
allocation to use the uses_dma flag used by the rest of the USB code,
and create a helper in hcd.h that checks this flag as well as the
CONFIG_HAS_DMA to simplify the caller a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20190811080520.21712-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-15 15:18:05 +02:00
Linus Walleij
fdd61a013a gpio: Add support for hierarchical IRQ domains
Hierarchical IRQ domains can be used to stack different IRQ
controllers on top of each other.

Bring hierarchical IRQ domains into the GPIOLIB core with the
following basic idea:

Drivers that need their interrupts handled hierarchically
specify a callback to translate the child hardware IRQ and
IRQ type for each GPIO offset to a parent hardware IRQ and
parent hardware IRQ type.

Users have to pass the callback, fwnode, and parent irqdomain
before calling gpiochip_irqchip_add().

We use the new method of just filling in the struct
gpio_irq_chip before adding the gpiochip for all hierarchical
irqchips of this type.

The code path for device tree is pretty straight-forward,
while the code path for old boardfiles or anything else will
be more convoluted requireing upfront allocation of the
interrupts when adding the chip.

One specific use-case where this can be useful is if a power
management controller has top-level controls for wakeup
interrupts. In such cases, the power management controller can
be a parent to other interrupt controllers and program
additional registers when an IRQ has its wake capability
enabled or disabled.

The hierarchical irqchip helper code will only be available
when IRQ_DOMAIN_HIERARCHY is selected to GPIO chips using
this should select or depend on that symbol. When using
hierarchical IRQs, the parent interrupt controller must
also be hierarchical all the way up to the top interrupt
controller wireing directly into the CPU, so on systems
that do not have this we can get rid of all the extra
code for supporting hierarchical irqs.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Lina Iyer <ilina@codeaurora.org>
Cc: Jon Hunter <jonathanh@nvidia.com>
Cc: Sowjanya Komatineni <skomatineni@nvidia.com>
Cc: Bitan Biswas <bbiswas@nvidia.com>
Cc: linux-tegra@vger.kernel.org
Cc: David Daney <david.daney@cavium.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Brian Masney <masneyb@onstation.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Brian Masney <masneyb@onstation.org>
Co-developed-by: Brian Masney <masneyb@onstation.org>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/20190808123242.5359-1-linus.walleij@linaro.org
2019-08-15 09:44:19 +02:00
Arnd Bergmann
738590a3fe ARM SCMI updates/fixes for v5.4
Handful of fixes/updates including:
 1. SCMI v2.0(recently released) support for:
 	- Performance protocol fast channels
 	- Reset Management Protocol
 2. SCMI infrastructure/core support for recieve(Rx) channels,
    asynchronous commands and delayed response
 3. Usage of asynchronous commands for clock rate setting and sensor
    reading based on the attributes read from the firmware
 4. Miscellaneous cleanups(typos, naming alignment with specification,
    and SPDX License identifier)
 5. Couple of fixes: removal of extra check for invalid length and
    additional check to ensure platform/firmware has released shared
    memory before using it in OSPM
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEunHlEgbzHrJD3ZPhAEG6vDF+4pgFAl1T52AACgkQAEG6vDF+
 4phUtA/+MQvUnOShrsO7pLcWm8cZd5OlLElX109tSnGtlzUXZgUk8H1ccAYqLqCV
 png1fkNOU1JhsJY8Jr0yK9ufgicbIcwZeGrR8QfRzDVBhiOxMhrmg/tSqJxmju2T
 6nnUE5/7Q4F4Hba8W+JF8x28aZ9zFgj2MHW6ARCm8ZoKqYUVu5j2yJhRzEqDypQ4
 ZOA8hAyyey3PZ6C4DNWqrxVvVxxG1ZmV9SmbtfKHVfj8cCXAbwn0Xi6XUAxTbGaU
 GsWZa4uUU+cI+7tOci3nfdGGgDOglmAVv6/O1lfQtU/M6tdwEShVRoZ0pD2cym/A
 Pqy/utxVPoyoEgKH7xqozbcLA0UXb9EpmpEzDux577QrQP8jtDnvSd4rihDSWXgd
 5wxOdwqS2LaFf2x745jndxvAqUTmf83ZmUeT5EGmdzM1AgvvoW6OJdjzpjr8QdTv
 uY9RTFEymXekJ/ndGanVYbO2+BnzjGo/pR/a7yZOLTrw+sPoUT2lBAt05jV4ivgN
 xcZUY/F/xOyuZbAIDo24TY4X1YUy+j2gSJpo8PFuc2E1cUTNasmIPVxLq8JAB6iq
 /JwArSk0CqkH3B2rshOHw24vxuHFTkszR1YvwHhiRF8q9Q0WWb/66ArdWybdLSZ2
 hGDHssv+m1pp5QM+1b8HSMkiLOnHnuE+a6qxkCCruS2wnxMKldQ=
 =Rnle
 -----END PGP SIGNATURE-----

Merge tag 'scmi-updates-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/drivers

ARM SCMI updates/fixes for v5.4

Handful of fixes/updates including:
1. SCMI v2.0(recently released) support for:
	- Performance protocol fast channels
	- Reset Management Protocol
2. SCMI infrastructure/core support for recieve(Rx) channels,
   asynchronous commands and delayed response
3. Usage of asynchronous commands for clock rate setting and sensor
   reading based on the attributes read from the firmware
4. Miscellaneous cleanups(typos, naming alignment with specification,
   and SPDX License identifier)
5. Couple of fixes: removal of extra check for invalid length and
   additional check to ensure platform/firmware has released shared
   memory before using it in OSPM

* tag 'scmi-updates-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux: (22 commits)
  reset: Add support for resets provided by SCMI
  firmware: arm_scmi: Add RESET protocol in SCMI v2.0
  dt-bindings: arm: Extend SCMI to support new reset protocol
  firmware: arm_scmi: Make use SCMI v2.0 fastchannel for performance protocol
  firmware: arm_scmi: Add discovery of SCMI v2.0 performance fastchannels
  firmware: arm_scmi: Use {get,put}_unaligned_le{32,64} accessors
  firmware: arm_scmi: Use asynchronous CLOCK_RATE_SET when possible
  firmware: arm_scmi: Drop config flag in clk_ops->rate_set
  firmware: arm_scmi: Add asynchronous sensor read if it supports
  firmware: arm_scmi: Drop async flag in sensor_ops->reading_get
  firmware: arm_scmi: Add support for asynchronous commands and delayed response
  firmware: arm_scmi: Add mechanism to unpack message headers
  firmware: arm_scmi: Separate out tx buffer handling and prepare to add rx
  firmware: arm_scmi: Add receive channel support for notifications
  firmware: arm_scmi: Segregate tx channel handling and prepare to add rx
  firmware: arm_scmi: Reorder some functions to avoid forward declarations
  firmware: arm_scmi: Check if platform has released shmem before using
  firmware: arm_scmi: Use the term 'message' instead of 'command'
  firmware: arm_scmi: Fix few trivial typos in comments
  firmware: arm_scmi: Remove extra check for invalid length message responses
  ...

Link: https://lore.kernel.org/r/20190814172454.26191-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-14 20:46:38 +02:00
Linus Torvalds
e83b009c5c dma-mapping fixes for 5.3-rc
- fix the handling of the bus_dma_mask in dma_get_required_mask, which
    caused a regression in this merge window (Lucas Stach)
  - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)
  - fix dma_mmap_coherent to not cause page attribute mismatches on
    coherent architectures like x86 (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl1UFhILHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOjexAAjPKLo4WGBGO1nd0btwXcI9A7jQTQlXrokmorDVzx
 5++GmTUBeEgvUJath5D3qpQTRZXo9Wb9oGMdS5U6bWJB+SbWtErM304t905TJoDM
 Cs7xcB1ZQeG/5OrQ+qGPgQCo6WO1dOl9FpaIptjNm4dn+OYhyO/YA+dgrJDwgkiA
 140RYUWa+Zhq3df4YqP4M4EnezLN1c4uE80wUxVQKDcq59sxCJek0QT0pUAMbdmQ
 /cUd2XSU113o1llmIRUh0Oj6VSEhWKHb+bdb8JfGndLzxvDcXZKl60tikWe6xpy2
 Ue0kkHRk6OPVRIxWkRjt8D+mlrCyNqN6HWx6eBmVnRKHxZ4ia2hYOFuYN9FFLLK+
 kCUlu5P/HUabBedKIxk4rbWITUqcRSviPD2WdnH2RWblvXNSDoSAufYuJ/9IGSoL
 P6a43DVKFesVF/MxeH9Ko8bnxMUO9Zn97GHcQIUplRwaqrnrCEPlvLVf/teswSQG
 C13rTnouZ0FA4z/uV96G6HfGIj87MLe/RovmLCMTeiSKrDpbcO7szP037Km73M+V
 UBmatoYCioVLxBjw3NkxCRc9UpDPdRUu31uVHrAarh4tutUASEWLrb6s9vFlGyED
 zis9IHWtIAYP3VfFtkXdZ7oDlqC/3KdEErHZuT+z4PK3Wj/QtQVfQ8SB79xFMneD
 V2E=
 =Jzmo
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix the handling of the bus_dma_mask in dma_get_required_mask, which
   caused a regression in this merge window (Lucas Stach)

 - fix a regression in the handling of DMA_ATTR_NO_KERNEL_MAPPING (me)

 - fix dma_mmap_coherent to not cause page attribute mismatches on
   coherent architectures like x86 (me)

* tag 'dma-mapping-5.3-4' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: fix page attributes for dma_mmap_*
  dma-direct: don't truncate dma_required_mask to bus addressing capabilities
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING
2019-08-14 10:31:11 -07:00
Johannes Weiner
b8e24a9300 block: annotate refault stalls from IO submission
psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.

Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-14 08:50:01 -06:00
Arnd Bergmann
aad7ad2a01 dma: iop-adma: allow building without platform headers
Now that iop3xx and iop13xx are gone, the iop-adma driver no
longer needs to deal with incompatible register layout defined
in machine specific header files.

Move the iop32x specific definitions into drivers/dma/iop-adma.h
and the platform_data into include/linux/platform_data/dma-iop32x.h,
and change the machine code to no longer reference those.

The DMA0_ID/DMA1_ID/AAU_ID macros are required as part of the
platform data interface and still need to be visible, so move
those from one header to the other.

Link: https://lore.kernel.org/r/20190809163334.489360-4-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-14 15:36:22 +02:00
Marek Behún
5bc7f990cd bus: Add support for Moxtet bus
On the Turris Mox router different modules can be connected to the main
CPU board: currently a module with a SFP cage, a module with MiniPCIe
connector, a PCIe pass-through MiniPCIe connector module, a 4-port
switch module, an 8-port switch module, and a 4-port USB3 module.

For example:
  [CPU]-[PCIe-pass-through]-[PCIe]-[8-port switch]-[8-port switch]-[SFP]

Each of this modules has an input and output shift register, and these
are connected via SPI to the CPU board.

Via SPI we are able to discover which modules are connected, in which
order, and we can also read some information about the modules (eg.
their interrupt status), and configure them.
From each module 8 bits can be read (of which low 4 bits identify the
module) and 8 bits can be written.

For example from the module with a SFP cage we can read the LOS,
TX-FAULT and MOD-DEF0 signals, while we can write TX-DISABLE and
RATE-SELECT signals.

This driver creates a new bus type, called "moxtet". For each Mox module
it finds via SPI, it creates a new device on the moxtet bus so that
drivers can be written for them.

It also implements a virtual interrupt controller for the modules which
send their interrupt status over the SPI shift register. These modules
do this in addition to sending their interrupt status via the shared
interrupt line. When the shared interrupt is triggered, we read from the
shift register and handle IRQs for all devices which are in interrupt.

The topology of how Mox modules are connected can then be read by
listing /sys/bus/moxtet/devices.

Link: https://lore.kernel.org/r/20190812161118.21476-2-marek.behun@nic.cz
Signed-off-by: Marek Behún <marek.behun@nic.cz>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-08-14 15:30:35 +02:00
Wolfram Sang
af80559b4d i2c: replace i2c_new_secondary_device with an ERR_PTR variant
In the general move to have i2c_new_*_device functions which return
ERR_PTR instead of NULL, this patch converts i2c_new_secondary_device().

There are only few users, so this patch converts the I2C core and all
users in one go. The function gets renamed to i2c_new_ancillary_device()
so out-of-tree users will get a build failure to understand they need to
adapt their error checking code.

Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> # adv748x
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> # adv7511 + adv7604
Reviewed-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> # adv7604
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
2019-08-14 14:54:55 +02:00
Will Deacon
d06fa5a118 Merge tag 'common/for-v5.4-rc1/cpu-topology' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux into for-next/cpu-topology
Pull in generic CPU topology changes from Paul Walmsley (RISC-V).

* tag 'common/for-v5.4-rc1/cpu-topology' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  MAINTAINERS: Add an entry for generic architecture topology
  base: arch_topology: update Kconfig help description
  RISC-V: Parse cpu topology during boot.
  arm: Use common cpu_topology structure and functions.
  cpu-topology: Move cpu topology code to common code.
  dt-binding: cpu-topology: Move cpu-map to a common binding.
  Documentation: DT: arm: add support for sockets defining package boundaries
2019-08-14 10:07:00 +01:00
YueHaibing
68e03b8547 gpio: Fix build error of function redefinition
when do randbuilding, I got this error:

In file included from drivers/hwmon/pmbus/ucd9000.c:19:0:
./include/linux/gpio/driver.h:576:1: error: redefinition of gpiochip_add_pin_range
 gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
 ^~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/hwmon/pmbus/ucd9000.c:18:0:
./include/linux/gpio.h:245:1: note: previous definition of gpiochip_add_pin_range was here
 gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
 ^~~~~~~~~~~~~~~~~~~~~~

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 964cb34188 ("gpio: move pincontrol calls to <linux/gpio/driver.h>")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190731123814.46624-1-yuehaibing@huawei.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-08-14 10:57:18 +02:00
Jakub Kicinski
c162610c7d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Rename mss field to mss_option field in synproxy, from Fernando Mancera.

2) Use SYSCTL_{ZERO,ONE} definitions in conntrack, from Matteo Croce.

3) More strict validation of IPVS sysctl values, from Junwei Hu.

4) Remove unnecessary spaces after on the right hand side of assignments,
   from yangxingwu.

5) Add offload support for bitwise operation.

6) Extend the nft_offload_reg structure to store immediate date.

7) Collapse several ip_set header files into ip_set.h, from
   Jeremy Sowden.

8) Make netfilter headers compile with CONFIG_KERNEL_HEADER_TEST=y,
   from Jeremy Sowden.

9) Fix several sparse warnings due to missing prototypes, from
   Valdis Kletnieks.

10) Use static lock initialiser to ensure connlabel spinlock is
    initialized on boot time to fix sched/act_ct.c, patch
    from Florian Westphal.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 18:22:57 -07:00
Heiner Kallweit
65b27995a4 net: phy: let phy_speed_down/up support speeds >1Gbps
So far phy_speed_down/up can be used up to 1Gbps only. Remove this
restriction by using new helper __phy_speed_down. New member adv_old
in struct phy_device is used by phy_speed_up to restore the advertised
modes before calling phy_speed_down. Don't simply advertise what is
supported because a user may have intentionally removed modes from
advertisement.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 17:14:06 -07:00
Heiner Kallweit
331c56ac73 net: phy: add phy_speed_down_core and phy_resolve_min_speed
phy_speed_down_core provides most of the functionality for
phy_speed_down. It makes use of new helper phy_resolve_min_speed that is
based on the sorting of the settings[] array. In certain cases it may be
helpful to be able to exclude legacy half duplex modes, therefore
prepare phy_resolve_min_speed() for it.

v2:
- rename __phy_speed_down to phy_speed_down_core

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 17:14:06 -07:00
Jakub Kicinski
708852dcac Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):

        for (i = 1; i <= btf__get_nr_types(btf); i++) {
                t = (struct btf_type *)btf__type_by_id(btf, i);

                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
  <<<<<<< HEAD
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
  =======
                        t->size = sizeof(int);
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                } else if (!has_datasec && btf_is_datasec(t)) {
  >>>>>>> 72ef80b5ee
                        /* replace DATASEC with STRUCT */

Conflict is between the two commits 1d4126c4e1 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:

  [...]
                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && btf_is_datasec(t)) {
                        /* replace DATASEC with STRUCT */
  [...]

The main changes are:

1) Addition of core parts of compile once - run everywhere (co-re) effort,
   that is, relocation of fields offsets in libbpf as well as exposure of
   kernel's own BTF via sysfs and loading through libbpf, from Andrii.

   More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
   and http://vger.kernel.org/lpc-bpf2018.html#session-2

2) Enable passing input flags to the BPF flow dissector to customize parsing
   and allowing it to stop early similar to the C based one, from Stanislav.

3) Add a BPF helper function that allows generating SYN cookies from XDP and
   tc BPF, from Petar.

4) Add devmap hash-based map type for more flexibility in device lookup for
   redirects, from Toke.

5) Improvements to XDP forwarding sample code now utilizing recently enabled
   devmap lookups, from Jesper.

6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
   and Takshak.

7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.

8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.

9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.

10) Add perf event output helper also for other skb-based program types, from Allan.

11) Fix a co-re related compilation error in selftests, from Yonghong.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 16:24:57 -07:00
Andrea Arcangeli
a8282608c8 Revert "mm, thp: restore node-local hugepage allocations"
This reverts commit 2f0799a0ff ("mm, thp: restore node-local
hugepage allocations").

commit 2f0799a0ff was rightfully applied to avoid the risk of a
severe regression that was reported by the kernel test robot at the end
of the merge window.  Now we understood the regression was a false
positive and was caused by a significant increase in fairness during a
swap trashing benchmark.  So it's safe to re-apply the fix and continue
improving the code from there.  The benchmark that reported the
regression is very useful, but it provides a meaningful result only when
there is no significant alteration in fairness during the workload.  The
removal of __GFP_THISNODE increased fairness.

__GFP_THISNODE cannot be used in the generic page faults path for new
memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
behavior significantly deviates from what the MPOL_DEFAULT semantics are
supposed to be for THP and 4k allocations alike.

Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
set to "madvise") has never meant to provide an implicit MPOL_BIND on
the "current" node the task is running on, causing swap storms and
providing a much more aggressive behavior than even zone_reclaim_node =
3.

Any workload who could have benefited from __GFP_THISNODE has now to
enable zone_reclaim_mode=1||2||3.  __GFP_THISNODE implicitly provided
the zone_reclaim_mode behavior, but it only did so if THP was enabled:
if THP was disabled, there would have been no chance to get any 4k page
from the current node if the current node was full of pagecache, which
further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
must work exactly the same with MADV_HUGEPAGE set or not.

The performance characteristic of memory depends on the hardware
details.  The numbers below are obtained on Naples/EPYC architecture and
the N/A projection extends them to show what we should aim for in the
future as a good THP NUMA locality default.  The benchmark used
exercises random memory seeks (note: the cost of the page faults is not
part of the measurement).

  D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
  0%     | +43%  | +45%   | +106% | +131%  | +224% | N/A    | N/A

D0 means distance zero (i.e.  local memory), D1 means distance one (i.e.
intra socket memory), D2 means distance two (i.e.  inter socket memory),
etc...

For the guest physical memory allocated by qemu and for guest mode
kernel the performance characteristic of RAM is more complex and an
ideal default could be:

  D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
  0%     | +58%   | +101% | N/A    | +222% | N/A    | N/A   | N/A

NOTE: the N/A are projections and haven't been measured yet, the
measurement in this case is done on a 1950x with only two NUMA nodes.
The THP case here means THP was used both in the host and in the guest.

After applying this commit the THP NUMA locality order that we'll get
out of MADV_HUGEPAGE is this:

  D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

Before this commit it was:

  D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

Even if we ignore the breakage of large workloads that can't fit in a
single node that the __GFP_THISNODE implicit "current node" mbind
caused, the THP NUMA locality order provided by __GFP_THISNODE was still
not the one we shall aim for in the long term (i.e.  the first one at
the top).

After this commit is applied, we can introduce a new allocator multi
order API and to replace those two alloc_pages_vmas calls in the page
fault path, with a single multi order call:

        unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
        page = alloc_pages_multi_order(..., &order);
        if (!page)
        	goto out;
        if (!(order & (1 << 0))) {
        	VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
        	/* THP fault */
        } else {
        	VM_WARN_ON(order != 1 << 0);
        	/* 4k fallback */
        }

The page allocator logic has to be altered so that when it fails on any
zone with order 9, it has to try again with a order 0 before falling
back to the next zone in the zonelist.

After that we need to do more measurements and evaluate if adding an
opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
with "DN+1 THP | DN 4k" at every NUMA distance crossing.

Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Andrea Arcangeli
92717d429b Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

The fixes for what was originally reported as "pathological THP
behavior" we rightfully reverted to be sure not to introduced
regressions at end of a merge window after a severe regression report
from the kernel bot.  We can safely re-apply them now that we had time
to analyze the problem.

The mm process worked fine, because the good fixes were eventually
committed upstream without excessive delay.

The regression reported by the kernel bot however forced us to revert
the good fixes to be sure not to introduce regressions and to give us
the time to analyze the issue further.  The silver lining is that this
extra time allowed to think more at this issue and also plan for a
future direction to improve things further in terms of THP NUMA
locality.

This patch (of 2):

This reverts commit 356ff8a9a7 ("Revert "mm, thp: consolidate THP
gfp handling into alloc_hugepage_direct_gfpmask").  So it reapplies
89c83fb539 ("mm, thp: consolidate THP gfp handling into
alloc_hugepage_direct_gfpmask").

Consolidation of the THP allocation flags at the same place was meant to
be a clean up to easier handle otherwise scattered code which is
imposing a maintenance burden.  There were no real problems observed
with the gfp mask consolidation but the reversion was rushed through
without a larger consensus regardless.

This patch brings the consolidation back because this should make the
long term maintainability easier as well as it should allow future
changes to be less error prone.

[mhocko@kernel.org: changelog additions]
Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Roman Gushchin
ec9f02384f mm: workingset: fix vmstat counters for shadow nodes
Memcg counters for shadow nodes are broken because the memcg pointer is
obtained in a wrong way. The following approach is used:
        virt_to_page(xa_node)->mem_cgroup

Since commit 4d96ba3530 ("mm: memcg/slab: stop setting
page->mem_cgroup pointer for slab pages") page->mem_cgroup pointer isn't
set for slab pages, so memcg_from_slab_page() should be used instead.

Also I doubt that it ever worked correctly: virt_to_head_page() should
be used instead of virt_to_page().  Otherwise objects residing on tail
pages are not accounted, because only the head page contains a valid
mem_cgroup pointer.  That was a case since the introduction of these
counters by the commit 68d48e6a2d ("mm: workingset: add vmstat counter
for shadow nodes").

Link: http://lkml.kernel.org/r/20190801233532.138743-1-guro@fb.com
Fixes: 4d96ba3530 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Ralph Campbell
76470ccd62 mm: document zone device struct page field usage
Patch series "mm/hmm: fixes for device private page migration", v3.

Testing the latest linux git tree turned up a few bugs with page
migration to and from ZONE_DEVICE private and anonymous pages.
Hopefully it clarifies how ZONE_DEVICE private struct page uses the same
mapping and index fields from the source anonymous page mapping.

This patch (of 3):

Struct page for ZONE_DEVICE private pages uses the page->mapping and and
page->index fields while the source anonymous pages are migrated to
device private memory.  This is so rmap_walk() can find the page when
migrating the ZONE_DEVICE private page back to system memory.
ZONE_DEVICE pmem backed fsdax pages also use the page->mapping and
page->index fields when files are mapped into a process address space.

Add comments to struct page and remove the unused "_zd_pad_1" field to
make this more clear.

Link: http://lkml.kernel.org/r/20190724232700.23327-2-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-08-13 16:06:52 -07:00
Paul E. McKenney
eda669a6a2 rcu/nocb: Atomic ->len field in rcu_segcblist structure
Upcoming ->nocb_lock contention-reduction work requires that the
rcu_segcblist structure's ->len field be concurrently manipulated,
but only if there are no-CBs CPUs in the kernel.  This commit
therefore makes this ->len field be an atomic_long_t, but only
in CONFIG_RCU_NOCB_CPU=y kernels.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
ce5215c134 rcu/nocb: Use separate flag to indicate offloaded ->cblist
RCU callback processing currently uses rcu_is_nocb_cpu() to determine
whether or not the current CPU's callbacks are to be offloaded.
This works, but it is not so good for cache locality.  Plus use of
->cblist for offloaded callbacks will greatly increase the frequency
of these checks.  This commit therefore adds a ->offloaded flag to the
rcu_segcblist structure to provide a more flexible and cache-friendly
means of checking for callback offloading.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:35:49 -07:00
Paul E. McKenney
1bb5f9b95a rcu/nocb: Use separate flag to indicate disabled ->cblist
NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but
forward-progress considerations would require that this pointer be both
NULL and non-NULL, which, absent a quantum-computer port of the Linux
kernel, simply won't happen.  This commit therefore creates as separate
->enabled flag to replace the current NULL checks.

[ paulmck: Add include files per 0day test robot and -next. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
2019-08-13 14:34:50 -07:00
Georgi Djakov
cbd5a9c28b interconnect: Add pre_aggregate() callback
Introduce an optional callback in interconnect provider drivers. It can be
used for implementing actions, that need to be executed before the actual
aggregation of the bandwidth requests has started.

The benefit of this for now is that it will significantly simplify the code
in provider drivers.

Suggested-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
2019-08-13 23:02:48 +03:00
Georgi Djakov
127ab2cc5f interconnect: Add support for path tags
Consumers may have use cases with different bandwidth requirements based
on the system or driver state. The consumer driver can append a specific
tag to the path and pass this information to the interconnect platform
driver to do the aggregation based on this state.

Introduce icc_set_tag() function that will allow the consumers to append
an optional tag to each path. The aggregation of these tagged paths is
platform specific.

Reviewed-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org>
2019-08-13 23:02:44 +03:00
Yishai Hadas
972d7560ee IB/mlx5: Add legacy events to DEVX list
Add two events that were defined in the device specification but were
not exposed in the driver list.

Post this patch those events can be read over the DEVX events interface
once be reported by the firmware.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Edward Srouji <edwards@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190808084358.29517-4-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-13 12:24:17 -04:00
Doug Ledford
749b9eef50 Merge remote-tracking branch 'mlx5-next/mlx5-next' into wip/dl-for-next
Merging tip of mlx5-next in order to get changes related to adding
XRQ support to the DEVX interface needed prior to the following two
patches.

Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-08-13 12:19:19 -04:00
Thomas Gleixner
4b950bb9ac Kbuild: Handle PREEMPT_RT for version string and magic
Update the build scripts and the version magic to reflect when
CONFIG_PREEMPT_RT is enabled in the same way as CONFIG_PREEMPT is treated.

The resulting version strings:

  Linux m 5.3.0-rc1+ #100 SMP Fri Jul 26 ...
  Linux m 5.3.0-rc1+ #101 SMP PREEMPT Fri Jul 26 ...
  Linux m 5.3.0-rc1+ #102 SMP PREEMPT_RT Fri Jul 26 ...

The module vermagic:

  5.3.0-rc1+ SMP mod_unload modversions
  5.3.0-rc1+ SMP preempt mod_unload modversions
  5.3.0-rc1+ SMP preempt_rt mod_unload modversions

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-08-14 01:10:42 +09:00
Suman Anna
b58056da2e bus: ti-sysc: Add missing kerneldoc comments
A few fields in various structures is missing the corresponding
kerneldoc comments. Add them. Also, fixed the comment for sidlemodes.

Signed-off-by: Suman Anna <s-anna@ti.com>
Acked-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
2019-08-13 04:13:32 -07:00
Suman Anna
54d662227c bus: ti-sysc: Switch to SPDX license identifier
Use the appropriate SPDX license identifier in the TI sysc
interconnect target driver source files and drop the previous
boilerplate license text. Also, add the the SPDX license
identifier in the associated ti-sysc header files.

Signed-off-by: Suman Anna <s-anna@ti.com>
Acked-by: Roger Quadros <rogerq@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
2019-08-13 04:13:32 -07:00
Jeremy Sowden
20a9379d9a netfilter: remove "#ifdef __KERNEL__" guards from some headers.
A number of non-UAPI Netfilter header-files contained superfluous
"#ifdef __KERNEL__" guards.  Removed them.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:15:28 +02:00
Jeremy Sowden
78458e3e08 netfilter: add missing IS_ENABLED(CONFIG_NETFILTER) checks to some header-files.
linux/netfilter.h defines a number of struct and inline function
definitions which are only available is CONFIG_NETFILTER is enabled.
These structs and functions are used in declarations and definitions in
other header-files.  Added preprocessor checks to make sure these
headers will compile if CONFIG_NETFILTER is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:15:18 +02:00
Jeremy Sowden
a1b2f04ea5 netfilter: add missing includes to a number of header-files.
A number of netfilter header-files used declarations and definitions
from other headers without including them.  Added include directives to
make those declarations and definitions available.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:14:39 +02:00
Jeremy Sowden
bd96b4c756 netfilter: inline four headers files into another one.
linux/netfilter/ipset/ip_set.h included four other header files:

  include/linux/netfilter/ipset/ip_set_comment.h
  include/linux/netfilter/ipset/ip_set_counter.h
  include/linux/netfilter/ipset/ip_set_skbinfo.h
  include/linux/netfilter/ipset/ip_set_timeout.h

Of these the first three were not included anywhere else.  The last,
ip_set_timeout.h, was included in a couple of other places, but defined
inline functions which call other inline functions defined in ip_set.h,
so ip_set.h had to be included before it.

Inlined all four into ip_set.h, and updated the other files that
included ip_set_timeout.h.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:14:26 +02:00
Yishai Hadas
b1635ee612 net/mlx5: Add XRQ legacy commands opcodes
Add XRQ legacy commands opcodes, will be used via the DEVX interface.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-08-13 12:58:11 +03:00
Christian König
52791eeec1 dma-buf: rename reservation_object to dma_resv
Be more consistent with the naming of the other DMA-buf objects.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/323401/
2019-08-13 09:09:30 +02:00
Christian König
5d344f58da dma-buf: nuke reservation_object seq number
The only remaining use for this is to protect against setting a new exclusive
fence while we grab both exclusive and shared. That can also be archived by
looking if the exclusive fence has changed or not after completing the
operation.

v2: switch setting excl fence to rcu_assign_pointer

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Link: https://patchwork.freedesktop.org/patch/322380/
2019-08-13 09:07:58 +02:00
John Garry
b884e2de2a lib: logic_pio: Add logic_pio_unregister_range()
Add a function to unregister a logical PIO range.

Logical PIO space can still be leaked when unregistering certain
LOGIC_PIO_CPU_MMIO regions, but this acceptable for now since there are no
callers to unregister LOGIC_PIO_CPU_MMIO regions, and the logical PIO
region allocation scheme would need significant work to improve this.

Cc: stable@vger.kernel.org
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Wei Xu <xuwei5@hisilicon.com>
2019-08-13 14:54:24 +08:00
Eric Biggers
4dd893d832 fs-verity: implement FS_IOC_MEASURE_VERITY ioctl
Add a function for filesystems to call to implement the
FS_IOC_MEASURE_VERITY ioctl.  This ioctl retrieves the file measurement
that fs-verity calculated for the given file and is enforcing for reads;
i.e., reads that don't match this hash will fail.  This ioctl can be
used for authentication or logging of file measurements in userspace.

See the "FS_IOC_MEASURE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:33:50 -07:00
Eric Biggers
3fda4c617e fs-verity: implement FS_IOC_ENABLE_VERITY ioctl
Add a function for filesystems to call to implement the
FS_IOC_ENABLE_VERITY ioctl.  This ioctl enables fs-verity on a file.

See the "FS_IOC_ENABLE_VERITY" section of
Documentation/filesystems/fsverity.rst for the documentation.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:33:50 -07:00
Eric Biggers
78a1b96bcf fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl
Add a root-only variant of the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl which
removes all users' claims of the key, not just the current user's claim.
I.e., it always removes the key itself, no matter how many users have
added it.

This is useful for forcing a directory to be locked, without having to
figure out which user ID(s) the key was added under.  This is planned to
be used by a command like 'sudo fscrypt lock DIR --all-users' in the
fscrypt userspace tool (http://github.com/google/fscrypt).

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:18:50 -07:00
Eric Biggers
5dae460c22 fscrypt: v2 encryption policy support
Add a new fscrypt policy version, "v2".  It has the following changes
from the original policy version, which we call "v1" (*):

- Master keys (the user-provided encryption keys) are only ever used as
  input to HKDF-SHA512.  This is more flexible and less error-prone, and
  it avoids the quirks and limitations of the AES-128-ECB based KDF.
  Three classes of cryptographically isolated subkeys are defined:

    - Per-file keys, like used in v1 policies except for the new KDF.

    - Per-mode keys.  These implement the semantics of the DIRECT_KEY
      flag, which for v1 policies made the master key be used directly.
      These are also planned to be used for inline encryption when
      support for it is added.

    - Key identifiers (see below).

- Each master key is identified by a 16-byte master_key_identifier,
  which is derived from the key itself using HKDF-SHA512.  This prevents
  users from associating the wrong key with an encrypted file or
  directory.  This was easily possible with v1 policies, which
  identified the key by an arbitrary 8-byte master_key_descriptor.

- The key must be provided in the filesystem-level keyring, not in a
  process-subscribed keyring.

The following UAPI additions are made:

- The existing ioctl FS_IOC_SET_ENCRYPTION_POLICY can now be passed a
  fscrypt_policy_v2 to set a v2 encryption policy.  It's disambiguated
  from fscrypt_policy/fscrypt_policy_v1 by the version code prefix.

- A new ioctl FS_IOC_GET_ENCRYPTION_POLICY_EX is added.  It allows
  getting the v1 or v2 encryption policy of an encrypted file or
  directory.  The existing FS_IOC_GET_ENCRYPTION_POLICY ioctl could not
  be used because it did not have a way for userspace to indicate which
  policy structure is expected.  The new ioctl includes a size field, so
  it is extensible to future fscrypt policy versions.

- The ioctls FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY,
  and FS_IOC_GET_ENCRYPTION_KEY_STATUS now support managing keys for v2
  encryption policies.  Such keys are kept logically separate from keys
  for v1 encryption policies, and are identified by 'identifier' rather
  than by 'descriptor'.  The 'identifier' need not be provided when
  adding a key, since the kernel will calculate it anyway.

This patch temporarily keeps adding/removing v2 policy keys behind the
same permission check done for adding/removing v1 policy keys:
capable(CAP_SYS_ADMIN).  However, the next patch will carefully take
advantage of the cryptographically secure master_key_identifier to allow
non-root users to add/remove v2 policy keys, thus providing a full
replacement for v1 policies.

(*) Actually, in the API fscrypt_policy::version is 0 while on-disk
    fscrypt_context::format is 1.  But I believe it makes the most sense
    to advance both to '2' to have them be in sync, and to consider the
    numbering to start at 1 except for the API quirk.

Reviewed-by: Paul Crowley <paulcrowley@google.com>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:18:50 -07:00
Eric Biggers
5a7e29924d fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl
Add a new fscrypt ioctl, FS_IOC_GET_ENCRYPTION_KEY_STATUS.  Given a key
specified by 'struct fscrypt_key_specifier' (the same way a key is
specified for the other fscrypt key management ioctls), it returns
status information in a 'struct fscrypt_get_key_status_arg'.

The main motivation for this is that applications need to be able to
check whether an encrypted directory is "unlocked" or not, so that they
can add the key if it is not, and avoid adding the key (which may
involve prompting the user for a passphrase) if it already is.

It's possible to use some workarounds such as checking whether opening a
regular file fails with ENOKEY, or checking whether the filenames "look
like gibberish" or not.  However, no workaround is usable in all cases.

Like the other key management ioctls, the keyrings syscalls may seem at
first to be a good fit for this.  Unfortunately, they are not.  Even if
we exposed the keyring ID of the ->s_master_keys keyring and gave
everyone Search permission on it (note: currently the keyrings
permission system would also allow everyone to "invalidate" the keyring
too), the fscrypt keys have an additional state that doesn't map cleanly
to the keyrings API: the secret can be removed, but we can be still
tracking the files that were using the key, and the removal can be
re-attempted or the secret added again.

After later patches, some applications will also need a way to determine
whether a key was added by the current user vs. by some other user.
Reserved fields are included in fscrypt_get_key_status_arg for this and
other future extensions.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:18:50 -07:00
Eric Biggers
b1c0ec3599 fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
Add a new fscrypt ioctl, FS_IOC_REMOVE_ENCRYPTION_KEY.  This ioctl
removes an encryption key that was added by FS_IOC_ADD_ENCRYPTION_KEY.
It wipes the secret key itself, then "locks" the encrypted files and
directories that had been unlocked using that key -- implemented by
evicting the relevant dentries and inodes from the VFS caches.

The problem this solves is that many fscrypt users want the ability to
remove encryption keys, causing the corresponding encrypted directories
to appear "locked" (presented in ciphertext form) again.  Moreover,
users want removing an encryption key to *really* remove it, in the
sense that the removed keys cannot be recovered even if kernel memory is
compromised, e.g. by the exploit of a kernel security vulnerability or
by a physical attack.  This is desirable after a user logs out of the
system, for example.  In many cases users even already assume this to be
the case and are surprised to hear when it's not.

It is not sufficient to simply unlink the master key from the keyring
(or to revoke or invalidate it), since the actual encryption transform
objects are still pinned in memory by their inodes.  Therefore, to
really remove a key we must also evict the relevant inodes.

Currently one workaround is to run 'sync && echo 2 >
/proc/sys/vm/drop_caches'.  But, that evicts all unused inodes in the
system rather than just the inodes associated with the key being
removed, causing severe performance problems.  Moreover, it requires
root privileges, so regular users can't "lock" their encrypted files.

Another workaround, used in Chromium OS kernels, is to add a new
VFS-level ioctl FS_IOC_DROP_CACHE which is a more restricted version of
drop_caches that operates on a single super_block.  It does:

        shrink_dcache_sb(sb);
        invalidate_inodes(sb, false);

But it's still a hack.  Yet, the major users of filesystem encryption
want this feature badly enough that they are actually using these hacks.

To properly solve the problem, start maintaining a list of the inodes
which have been "unlocked" using each master key.  Originally this
wasn't possible because the kernel didn't keep track of in-use master
keys at all.  But, with the ->s_master_keys keyring it is now possible.

Then, add an ioctl FS_IOC_REMOVE_ENCRYPTION_KEY.  It finds the specified
master key in ->s_master_keys, then wipes the secret key itself, which
prevents any additional inodes from being unlocked with the key.  Then,
it syncs the filesystem and evicts the inodes in the key's list.  The
normal inode eviction code will free and wipe the per-file keys (in
->i_crypt_info).  Note that freeing ->i_crypt_info without evicting the
inodes was also considered, but would have been racy.

Some inodes may still be in use when a master key is removed, and we
can't simply revoke random file descriptors, mmap's, etc.  Thus, the
ioctl simply skips in-use inodes, and returns -EBUSY to indicate that
some inodes weren't evicted.  The master key *secret* is still removed,
but the fscrypt_master_key struct remains to keep track of the remaining
inodes.  Userspace can then retry the ioctl to evict the remaining
inodes.  Alternatively, if userspace adds the key again, the refreshed
secret will be associated with the existing list of inodes so they
remain correctly tracked for future key removals.

The ioctl doesn't wipe pagecache pages.  Thus, we tolerate that after a
kernel compromise some portions of plaintext file contents may still be
recoverable from memory.  This can be solved by enabling page poisoning
system-wide, which security conscious users may choose to do.  But it's
very difficult to solve otherwise, e.g. note that plaintext file
contents may have been read in other places than pagecache pages.

Like FS_IOC_ADD_ENCRYPTION_KEY, FS_IOC_REMOVE_ENCRYPTION_KEY is
initially restricted to privileged users only.  This is sufficient for
some use cases, but not all.  A later patch will relax this restriction,
but it will require introducing key hashes, among other changes.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-08-12 19:18:49 -07:00