Commit graph

57824 commits

Author SHA1 Message Date
Sagi Grimberg
24c5dc6610 block: Add rdma affinity based queue mapping helper
Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.

Reviewed-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-08 14:58:03 -04:00
Sagi Grimberg
a435393aca mlx5: move affinity hints assignments to generic code
generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).

The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.

Also, remove mlx5e_get_cpu and introduce mlx5e_get_node
(used for allocation purposes) and mlx5_get_vector_affinity
(for indirection table construction) as they provide the needed
information. Luckily, we have generic helpers to get cpumask
and node given a irq vector. mlx5_get_vector_affinity will
be used by mlx5_ib in a subsequent patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-08 14:55:56 -04:00
Sagi Grimberg
78249c4215 mlx5: convert to generic pci_alloc_irq_vectors
Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-08 14:53:05 -04:00
Linus Torvalds
bfa738cf3d Third set of -rc fixes for 4.13 cycle
- small set of miscellanous fixes
 - a reasonably sizable set of IPoIB fixes that deal with multiple long
   standing issues
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZiKRfAAoJELgmozMOVy/dU6cP/00pqRgSkywN+6Duclqofdvs
 D2vK1Mp+ehrzhbq0tdlkBizG+cM6EUHkFOcUmUMhsbTCxsiSjWqqHQvz/6EmPK/m
 mSbFkiYOsEzVOuZ+65VQRLhwXe3WnQ0GLJZ58+RLML7NkSK2/AElzrlDmVzfm2Ve
 yQfYM4QGGymhHyHjpKzA955Q9Tt/TS62dUTLAJZbpJfPcR7eErLaLZLsoUlcpDeP
 zyybbP+YCPu03w9eC8O3Jd8r9s0o4w8qWPo8ROaIzIkJGbou5yrt7mQNP2gSASta
 rHbLF96p9myIv4u60+UdsZt2GAthUHxqyEDJpQNfB1tAjb8JtZ7cy4xpXYAelU8P
 PYs86PUZiCj2kPhOzhaI6OooNZANDK+4IBu8pVE68V0aWVnu84qA3ZjAfRBtaUgA
 eNRXa/M+ivjNcJ4bGDqQVAnulTVpSD7PNPWlmGcZahv+kXHT1Bs4dgjUW/RWtFkS
 A2XP23gyOHMrK9aEaiAhcE2Vky+0RuMiMbuyNzzSuB6XgUP+kgO9z6OkZWhpEOXu
 SLgVf0fgqo9rlmoQCwaSh/saYSHg4F6XQit9Gk37tpOlnx2OQbzFBHXLJ9zAf9UU
 CALh9QzroalR0140vfIqOGCcDKFKputZNMIBpKnsYio+JLMUwdUZA2UwtNsKLBYr
 Ih7p6tC9jMqYusAWAyFg
 =2fQS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma

Pull rdma fixes from Doug Ledford:
 "Third set of -rc fixes for 4.13 cycle

   - small set of miscellanous fixes

   - a reasonably sizable set of IPoIB fixes that deal with multiple
     long standing issues"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
  IB/hns: checking for IS_ERR() instead of NULL
  RDMA/mlx5: Fix existence check for extended address vector
  IB/uverbs: Fix device cleanup
  RDMA/uverbs: Prevent leak of reserved field
  IB/core: Fix race condition in resolving IP to MAC
  IB/ipoib: Notify on modify QP failure only when relevant
  Revert "IB/core: Allow QP state transition from reset to error"
  IB/ipoib: Remove double pointer assigning
  IB/ipoib: Clean error paths in add port
  IB/ipoib: Add get statistics support to SRIOV VF
  IB/ipoib: Add multicast packets statistics
  IB/ipoib: Set IPOIB_NEIGH_TBL_FLUSH after flushed completion initialization
  IB/ipoib: Prevent setting negative values to max_nonsrq_conn_qp
  IB/ipoib: Make sure no in-flight joins while leaving that mcast
  IB/ipoib: Use cancel_delayed_work_sync when needed
  IB/ipoib: Fix race between light events and interface restart
2017-08-08 11:42:33 -07:00
Linus Torvalds
de70be0ae3 SCSI fixes on 20170808
Two small fixes, one re-fix of a previous fix and five patches sorting
 out hotplug in the bnx2X class of drivers.  The latter is rather
 involved, but necessary because these drivers have started dropping
 lockdep recursion warnings on the hotplug lock because of its
 conversion to a percpu rwsem.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJZidgyAAoJEAVr7HOZEZN4tzEQAJsrO6rW6ng/LGIbf+oYdNFY
 lPcdC3wCJHpPJLzvd0yPu0Davu+m1bCtke8XctWKRcj2QO9YvDmwCrJ6nVNMPO8p
 LFz0rYSma5VDXyPVcUn+rtMDY3eRVQG8oukVdMzFRtAAlXMDpDl4gaITp4X2HX78
 7hSt/SdmpKAgidrvgln7hHvVJKE+Z3Tzp2asS5ygfle+1/u0ByMskiyo6tR/zz8I
 uf8Y2s5GeiHBe5K6bi4j+xSmC7evxqnDZpQAHeeT/weJQ8l96SIrPN66W9IjJWnD
 D0P10ctFYS5WfQAB6Fd3UHilVDYOeGOXHOIdWBqEbpi2OtsstihsQy95tQmRxgN7
 uBBkCanPbcLm1J9Vg/K08wY8TUdsjg+zJbq+Q+u4q37xJUjz6SYDSMzQyUNT4W1u
 3+dB9ImjBZjisg1stokuUTM5lbd9V/SvZngs7nzzi0siRXQI4oTKbGkS1PZUYnly
 rBfSpnakuraKnQqs0s0c1y0+bcGT4AVMGkzaZ0ivtk4JGWmmkXSoAurByfu9Eq87
 7feVyl9QxTBQAD1bl5IaRA/cxBhqe31FwvIpM8ik9zUF5hqt8vAZszB0Yi9MqyvM
 3ZCR0poMk8e2ZE+vMcH304uQ9CW3jG7cgWwZgfvWBmd/NgMOKgFhY3PAScmGkFex
 EgeDmxx+/uctEHQYYV3g
 =Lg/x
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Two small fixes, one re-fix of a previous fix and five patches sorting
  out hotplug in the bnx2X class of drivers. The latter is rather
  involved, but necessary because these drivers have started dropping
  lockdep recursion warnings on the hotplug lock because of its
  conversion to a percpu rwsem"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sg: only check for dxfer_len greater than 256M
  scsi: aacraid: reading out of bounds
  scsi: qedf: Limit number of CQs
  scsi: bnx2i: Simplify cpu hotplug code
  scsi: bnx2fc: Simplify CPU hotplug code
  scsi: bnx2i: Prevent recursive cpuhotplug locking
  scsi: bnx2fc: Prevent recursive cpuhotplug locking
  scsi: bnx2fc: Plug CPU hotplug race
2017-08-08 09:38:41 -07:00
Elaine Zhang
ec52e46256 clk: fractional-divider: allow overriding of approximation
Fractional dividers may have special requirements concerning numerator
and denominator selection that differ from just getting the best
approximation.

For example on Rockchip socs the denominator must be at least 20 times
larger than the numerator to generate precise clock frequencies.

Therefore add the ability to provide custom approximation functions.

Signed-off-by: Elaine Zhang <zhangqing@rock-chips.com>
Acked-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
2017-08-08 17:39:48 +02:00
Rafael J. Wysocki
d6344d4b56 cpufreq: Simplify cpufreq_can_do_remote_dvfs()
The if () in cpufreq_can_do_remote_dvfs() is superfluous, so drop
it and simply return the value of the expression under it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-08 17:09:02 +02:00
Chuck Lever
41c8f70f5a xprtrdma: Harden backchannel call decoding
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08 10:52:00 -04:00
Chi-Hsien Lin
0ec9eb90fe brcmfmac: Add support for CYW4373 SDIO/USB chipset
Add support for CYW4373 SDIO/USB chipset.
CYW4373 is a 1x1 dual-band 11ac chipset with 20/40/80Mhz channel support.
It's a WiFi/BT combo device.

Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Reviewed-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
2017-08-08 14:51:12 +03:00
Longpeng(Mike)
199b5763d3 KVM: add spinlock optimization framework
If a vcpu exits due to request a user mode spinlock, then
the spinlock-holder may be preempted in user mode or kernel mode.
(Note that not all architectures trap spin loops in user mode,
only AMD x86 and ARM/ARM64 currently do).

But if a vcpu exits in kernel mode, then the holder must be
preempted in kernel mode, so we should choose a vcpu in kernel mode
as a more likely candidate for the lock holder.

This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
new argument says the same of the spinning VCPU.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08 10:57:43 +02:00
Linus Torvalds
d16b9d223b MTD fixes for 4.13-rc5:
Just NAND fixes (in both the core handling, and a few drivers). Notes stolen
 from Boris:
 
 Core fixes:
 - Fix data interface setup for ONFI NANDs that do not support the SET
   FEATURES command
 - Fix a kernel doc header
 - Fix potential integer overflow when retrieving timing information
   from the parameter page
 - Fix wrong OOB layout for small page NANDs
 
 Driver fixes:
 - Fix potential division-by-zero bug
 - Fix backward compat with old atmel-nand DT bindings
 - Fix ->setup_data_interface() in the atmel NAND driver
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJZiOaPAAoJEFySrpd9RFgtI+oQAK048jerO6VqjwgesmgUkxc6
 sbxWlTeDcXo5FP5E81s88ywtcvI9ArQg+Oq6wLuWEejix9boJYZy9ljrmbZTTEIC
 TScyXy3ZJnZk2GdMm0XOZ9oE8BknvCcoIwbDKfLXkdW/K2jxBPWBr1M/8wLoLiOG
 WGANSjvBE0gjkItIQw9dE4U7Kr9A53nHfIzLPtRIL9p34QYioKx45vXQh/owwO6P
 hndAAuAZPwq5jEtRVJN8MvEMb8aZWQXSG7t4xHg2LULiFwe6glHicvi5ZfImvN45
 Dw3DXFaljNP+QxSV34nCkfo4Rp4sjLssNij093qUncuBOUQgMWxJtY/sVjaA79uv
 DI1iO40D1A9mWSf19GKQvW92cXpp0me+hwzsRP2njIChyJbRIF8Z6oqZCnkjKPFQ
 eCNdJy30pedAeUtkxnfvi588MmKyl9qAmYjDAhk/xaT5qj4ZaZAJLYzDanicg/F4
 bQz71F+Yocx7xPGor5gvaAqZ+UAPEbzrfmziBUQb0U4wbK6EcyRE+lGIImMl4Uir
 K6WAPySWN7Og/Zjpf4fGG+lfbLBveVYkOiuqqUnjriCaAyio+NjfD5MxIVWJIikB
 3DKwMCIk3qoayT5IHxqPgN1Uy9adl8wintZrcfPkJfBziQ+hTA9dn/aevPgQrqdG
 lx6cwAy3I/iUgDkYqOVN
 =KcS2
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20170807' of git://git.infradead.org/linux-mtd

Pull MTD fixes from Brian Norris:
 "I missed getting these out for rc4, but here are some MTD fixes.

  Just NAND fixes (in both the core handling, and a few drivers). Notes
  stolen from Boris:

  Core fixes:

   - fix data interface setup for ONFI NANDs that do not support the SET
     FEATURES command

   - fix a kernel doc header

   - fix potential integer overflow when retrieving timing information
     from the parameter page

   - fix wrong OOB layout for small page NANDs

  Driver fixes:

   - fix potential division-by-zero bug

   - fix backward compat with old atmel-nand DT bindings

   - fix ->setup_data_interface() in the atmel NAND driver"

* tag 'for-linus-20170807' of git://git.infradead.org/linux-mtd:
  mtd: nand: atmel: Fix EDO mode check
  mtd: nand: Declare tBERS, tR and tPROG as u64 to avoid integer overflow
  mtd: nand: Fix timing setup for NANDs that do not support SET FEATURES
  mtd: nand: Fix a docs build warning
  mtd: nand: sunxi: fix potential divide-by-zero error
  nand: fix wrong default oob layout for small pages using soft ecc
  mtd: nand: atmel: Fix DT backward compatibility in pmecc.c
2017-08-07 18:40:18 -07:00
David Lebrun
d1df6fd8a1 ipv6: sr: define core operations for seg6local lightweight tunnel
This patch implements a new type of lightweight tunnel named seg6local.
A seg6local lwt is defined by a type of action and a set of parameters.
The action represents the operation to perform on the packets matching the
lwt's route, and is not necessarily an encapsulation. The set of parameters
are arguments for the processing function.

Each action is defined in a struct seg6_action_desc within
seg6_action_table[]. This structure contains the action, mandatory
attributes, the processing function, and a static headroom size required by
the action. The mandatory attributes are encoded as a bitmask field. The
static headroom is set to a non-zero value when the processing function
always add a constant number of bytes to the skb (e.g. the header size for
encapsulations).

To facilitate rtnetlink-related operations such as parsing, fill_encap,
and cmp_encap, each type of action parameter is associated to three
function pointers, in seg6_action_params[].

All actions defined in seg6_local.h are detailed in [1].

[1] https://tools.ietf.org/html/draft-filsfils-spring-srv6-network-programming-01

Signed-off-by: David Lebrun <david.lebrun@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 14:16:22 -07:00
Yonghong Song
cf5f5cea27 bpf: add support for sys_enter_* and sys_exit_* tracepoints
Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The iovisor/bcc issue #748
(https://github.com/iovisor/bcc/issues/748) documents this issue.
For example, if you try to attach a bpf program to tracepoints
syscalls/sys_enter_newfstat, you will get the following error:
   # ./tools/trace.py t:syscalls:sys_enter_newfstat
   Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
   Failed to attach BPF to tracepoint

The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.

This patch adds bpf support for these syscalls tracepoints by
  . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
  . calling bpf programs in perf_syscall_enter and perf_syscall_exit

The legality of bpf program ctx access is also checked.
Function trace_event_get_offsets returns correct max offset for each
specific syscall tracepoint, which is compared against the maximum offset
access in bpf program.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 14:09:48 -07:00
David Ahern
1801b570dd net: ipv6: add second dif to udp socket lookups
Add a second device index, sdif, to udp socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Early demux lookups are handled in the next patch as part of INET_MATCH
changes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:22 -07:00
David Ahern
60d9b03141 net: ipv4: add second dif to multicast source filter
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:22 -07:00
David S. Miller
fde6af4729 mlx5-shared-2017-08-07
This series includes some mlx5 updates for both net-next and rdma trees.
 
 From Saeed,
 Core driver updates to allow selectively building the driver with
 or without some large driver components, such as
 	- E-Switch (Ethernet SRIOV support).
 	- Multi-Physical Function Switch (MPFs) support.
 For that we split E-Switch and MPFs functionalities into separate files.
 
 From Erez,
 Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
 is taking place and until it completes.
 
 From Rabie,
 Increase the maximum supported flow counters.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJZiDoAAAoJEEg/ir3gV/o+594H/RH5kRwC719s/5YQFJXvGsVC
 fjtj3UUJPLrWB8XBh7a4PRcxXPIHaFKJuY3MU7KHFIeZQFklJcit3njjpxDlUINo
 F5S1LHBSYBkeMD/ksWBA8OLCBprNGN6WQ2tuFfAjZlQQ44zqv8LJmegoDtW9bGRy
 aGAkjUmALEblQsq81y0BQwN2/8DA8HAywrs8L2dkH1LHwijoIeYMZFOtKugv1FbB
 ABSKxcU7D/NYw6rsVdZG59fHFQ+eKOspDFqBZrUzfQ+zUU2hFFo96ovfXBfIqYCV
 7BtJuKXu2LeGPzFLsuw4h1131iqFT1iSMy9fEhf/4OwaL/KPP/+Umy8vP/XfM+U=
 =wCpd
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-shared-2017-08-07

This series includes some mlx5 updates for both net-next and rdma trees.

From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
	- E-Switch (Ethernet SRIOV support).
	- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.

From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.

From Rabie,
Increase the maximum supported flow counters.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 10:42:09 -07:00
Jiri Pirko
de4784ca03 net: sched: get rid of struct tc_to_netdev
Get rid of struct tc_to_netdev which is now just unnecessary container
and rather pass per-type structures down to drivers directly.
Along with that, consolidate the naming of per-type structure variables
in cls_*.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko
5fd9fc4e20 net: sched: push cls related args into cls_common structure
As ndo_setup_tc is generic offload op for whole tc subsystem, does not
really make sense to have cls-specific args. So move them under
cls_common structurure which is embedded in all cls structs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:37 -07:00
Jiri Pirko
3e0e826643 net: sched: make egress_dev flag part of flower offload struct
Since this is specific to flower now, make it part of the flower offload
struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Jiri Pirko
ade9b65884 net: sched: rename TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL
In order to be aligned with the rest of the types, rename
TC_SETUP_MATCHALL to TC_SETUP_CLSMATCHALL.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Jiri Pirko
2572ac53c4 net: sched: make type an argument for ndo_setup_tc
Since the type is always present, push it to be a separate argument to
ndo_setup_tc. On the way, name the type enum and use it for arg type.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 09:42:35 -07:00
Linus Walleij
2606706e4d backlight: gpio_backlight: Delete pdata inversion
The option to invert the output of the GPIO (active low) is
not used by the only platform still using platform data to
set up a GPIO backlight (one SH board). Delete the option
as we do not expect to expand the use of board files for
this driver, and GPIO descriptors intrinsically keep track
of any signal inversion.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Lee Jones <lee.jones@linaro.org>
2017-08-07 17:11:28 +01:00
Arnd Bergmann
dfc8e9165c fbcon: mark dummy functions 'inline'
The newly introduced fb_console_init/exit declarations have a
dummy alternative that is marked 'static' in the header but not
inline, leading to a warning whenever the header gets included:

In file included from drivers/video/fbdev/core/fbmem.c:35:0:
include/linux/fbcon.h:9:13: error: 'fb_console_exit' defined but not used [-Werror=unused-function]

This adds the intended 'inline'.

Fixes: 6104c37094 ("fbcon: Make fbcon a built-time depency for fbdev")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Sean Paul <seanpaul@chromium.org>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
2017-08-07 17:22:14 +02:00
David Hildenbrand
5e2f30b756 KVM: nVMX: get rid of nested_get_page()
nested_get_page() just sounds confusing. All we want is a page from G1.
This is even unrelated to nested.

Let's introduce kvm_vcpu_gpa_to_page() so we don't get too lengthy
lines.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Squash pasto fix from Wanpeng Li. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07 15:27:00 +02:00
Ludovic Desroches
0cca6c8920 pinctrl: generic: update references to Documentation/pinctrl.txt
Update deprecated references to Documentation/pinctrl.txt since it has been
moved to Documentation/driver-api/pinctl.rst.

Signed-off-by: Ludovic Desroches <ludovic.desroches@o2linux.fr>
Fixes: 5a9b73832e ("pinctrl.txt: move it to the driver-api book")
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2017-08-07 15:26:34 +02:00
Lorenzo Pieralisi
7ad4263980 ACPI: Make acpi_dma_configure() DMA regions aware
Current ACPI DMA configuration set-up device DMA capabilities through
kernel defaults that do not take into account platform specific DMA
configurations reported by firmware.

By leveraging the ACPI acpi_dev_get_dma_resources() API, add code
in acpi_dma_configure() to retrieve the DMA regions to correctly
set-up PCI devices DMA parameters.

Rework the ACPI IORT kernel API to make sure they can accommodate
the DMA set-up required by firmware. By making PCI devices DMA set-up
ACPI IORT specific, the kernel is shielded from unwanted regressions
that could be triggered by parsing DMA resources on arches that were
previously ignoring them (ie x86/ia64), leaving kernel behaviour
unchanged on those arches.

Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Tested-by: Nate Watterson <nwatters@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-07 14:28:51 +02:00
Lorenzo Pieralisi
c04ac679c6 ACPI: Introduce DMA ranges parsing
Some devices have limited addressing capabilities and cannot
reference the whole memory address space while carrying out DMA
operations (eg some devices with bus address bits range smaller than
system bus - which prevents them from using bus addresses that are
otherwise valid for the system).

The ACPI _DMA object allows bus devices to define the DMA window that is
actually addressable by devices that sit upstream the bus, therefore
providing a means to parse and initialize the devices DMA masks and
addressable DMA range size.

By relying on the generic ACPI kernel layer to retrieve and parse
resources, introduce ACPI core code to parse the _DMA object.

Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Tested-by: Nate Watterson <nwatters@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-08-07 14:28:51 +02:00
Rabie Loulou
a8ffcc741a net/mlx5: Increase the maximum flow counters supported
Read new NIC capability field which represnts 16 MSBs of the max flow
counters number supported (max_flow_counter_31_16).

Backward compatibility with older firmware is preserved, the modified
driver reads max_flow_counter_31_16 as 0 from the older firmware and
uses up to 64K counters.

Changed flow counter id from 16 bits to 32 bits. Backward compatibility
with older firmware is preserved as we kept the 16 LSBs of the counter
id in place and added 16 MSBs from reserved field.

Changed the background bulk reading of flow counters to work in chunks
of at most 32K counters, to make sure we don't attempt to allocate very
large buffers.

Signed-off-by: Rabie Loulou <rabiel@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-07 10:47:07 +03:00
Rabie Loulou
61690e09c3 net/mlx5: Fix counter list hardware structure
The counter list hardware structure doesn't contain a clear and
num_of_counters fields, remove them.

These wrong fields were never used by the driver hence no other driver
changes.

Fixes: a351a1b03b ("net/mlx5: Introduce bulk reading of flow counters")
Signed-off-by: Rabie Loulou <rabiel@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-07 10:47:07 +03:00
Erez Shitrit
97834eba7c net/mlx5: Delay events till ib registration ends
When mlx5_ib registers itself to mlx5_core as an interface, it will
call mlx5_add_device which will call mlx5_ib interface add callback,
in case the latter successfully returns, only then mlx5_core will add
it to the interface list and async events will be forwarded to mlx5_ib.
Between mlx5_ib interface add callback and mlx5_core adding the mlx5_ib
interface to its devices list, arriving mlx5_core events can be missed
by the new mlx5_ib registering interface.

In other words:
thread 1: mlx5_ib: mlx5_register_interface(dev)
thread 1: mlx5_core: mlx5_add_device(dev)
thread 1: mlx5_core: ctx = dev->add => (mlx5_ib)->mlx5_ib_add
thread 2: mlx5_core_event: **new event arrives, forward to dev_list
thread 1: mlx5_core: add_ctx_to_dev_list(ctx)
/* previous event was missed by the new interface.*/
It is ok to miss events before dev->add (mlx5_ib)->mlx5_ib_add_device
but not after.

We fix this race by accumulating the events that come between the
ib_register_device (inside mlx5_add_device->(dev->add)) till the adding
to the list completes and fire them to the new registering interface
after that.

Fixes: f1ee87fe55 ("net/mlx5: Organize device list API in one place")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-08-07 10:47:07 +03:00
Saeed Mahameed
eeb66cdb68 net/mlx5: Separate between E-Switch and MPFS
Multi-Physical Function Switch (MPFs) is required for when multi-PF
configuration is enabled to allow passing user configured unicast MAC
addresses to the requesting PF.

Before this patch eswitch.c used to manage the HW MPFS l2 table,
E-Switch always (regardless of sriov) enabled vport(0) (NIC PF) vport's
contexts update on unicast mac address list changes, to populate the PF's
MPFS L2 table accordingly.

In downstream patch we would like to allow compiling the driver without
E-Switch functionalities, for that we move MPFS l2 table logic out
of eswitch.c into its own file, and provide Kconfig flag (MLX5_MPFS) to
allow compiling out MPFS for those who don't want Multi-PF support.

NIC PF netdevice will now directly update MPFS l2 table via the new MPFS
API. VF netdevice has no access to MPFS L2 table, so E-Switch will remain
responsible of updating its MPFS l2 table on behalf of its VFs.

Due to this change we also don't require enabling vport(0) (PF vport)
unicast mac changes events anymore, for when SRIOV is not enabled.
Which means E-Switch is now activated only on SRIOV activation, and not
required otherwise.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jes Sorensen <jsorensen@fb.com>
Cc: kernel-team@fb.com
2017-08-07 10:47:06 +03:00
Yuchung Cheng
4faf783998 tcp: fix cwnd undo in Reno and HTCP congestion controls
Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno.  This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.html

Suggested-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Wei Sun <unlcsewsun@gmail.com>
Signed-off-by: Yuchung Cheng <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 21:25:10 -07:00
Russell King
770a1ad557 phylink: add module EEPROM support
Add support for reading module EEPROMs through phylink.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
ce0aa27ff3 sfp: add sfp-bus to bridge between network devices and sfp cages
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
9525ae8395 phylink: add phylink infrastructure
The link between the ethernet MAC and its PHY has become more complex
as the interface evolves.  This is especially true with serdes links,
where the part of the PHY is effectively integrated into the MAC.

Serdes links can be connected to a variety of devices, including SFF
modules soldered down onto the board with the MAC, a SFP cage with
a hotpluggable SFP module which may contain a PHY or directly modulate
the serdes signals onto optical media with or without a PHY, or even
a classical PHY connection.

Moreover, the negotiation information on serdes links comes in two
varieties - SGMII mode, where the PHY provides its speed/duplex/flow
control information to the MAC, and 1000base-X mode where both ends
exchange their abilities and each resolve the link capabilities.

This means we need a more flexible means to support these arrangements,
particularly with the hotpluggable nature of SFP, where the PHY can
be attached or detached after the network device has been brought up.

Ethtool information can come from multiple sources:
- we may have a PHY operating in either SGMII or 1000base-X mode, in
  which case we take ethtool/mii data directly from the PHY.
- we may have a optical SFP module without a PHY, with the MAC
  operating in 1000base-X mode - the ethtool/mii data needs to come
  from the MAC.
- we may have a copper SFP module with a PHY whic can't be accessed,
  which means we need to take ethtool/mii data from the MAC.

Phylink aims to solve this by providing an intermediary between the
MAC and PHY, providing a safe way for PHYs to be hotplugged, and
allowing a SFP driver to reconfigure the serdes connection.

Phylink also takes over support of fixed link connections, where the
speed/duplex/flow control are fixed, but link status may be controlled
by a GPIO signal.  By avoiding the fixed-phy implementation, phylink
can provide a faster response to link events: fixed-phy has to wait for
phylib to operate its state machine, which can take several seconds.
In comparison, phylink takes milliseconds.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

- remove sync status
- rework supported and advertisment handling
- add 1000base-x speed for fixed links
- use functionality exported from phy-core, reworking
  __phylink_ethtool_ksettings_set for it
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
a81497bee7 net: phy: provide a hook for link up/link down events
Sometimes, we need to do additional work between the PHY coming up and
marking the carrier present - for example, we may need to wait for the
PHY to MAC link to finish negotiation.  This changes phylib to provide
a notification function pointer which avoids the built-in
netif_carrier_on() and netif_carrier_off() functions.

Standard ->adjust_link functionality is provided by hooking a helper
into the new ->phy_link_change method.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
0ccb4fc65d net: phy: move phy_lookup_setting() and guts of phy_supported_speeds() to phy-core
phy_lookup_setting() provides useful functionality in ethtool code
outside phylib.  Move it to phy-core and allow it to be re-used (eg,
in phylink) rather than duplicated elsewhere.  Note that this supports
the larger linkmode space.

As we move the phy settings table, we also need to move the guts of
phy_supported_speeds() as well.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
da4625ac26 net: phy: split out PHY speed and duplex string generation
Other code would like to make use of this, so make the speed and duplex
string generation visible, and place it in a separate file.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Dan Williams
401c0a19c6 nfit, libnvdimm, region: export 'position' in mapping info
It is useful to be able to know the position of a DIMM in an
interleave-set. Consider the case where the order of the DIMMs changes
causing a namespace to be invalidated because the interleave-set cookie no
longer matches. If the before and after state of each DIMM position is
known this state debugged by the system owner.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-04 17:20:16 -07:00
Rafael J. Wysocki
e870c6c87c ACPI / PM: Prefer suspend-to-idle over S3 on some systems
Modify the ACPI system sleep support setup code to select
suspend-to-idle as the default system sleep state if
(1) the ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and
(2) the Low Power Idle S0 _DSM interface has been discovered and
(3) the default sleep state was not selected from the kernel command
line.

The main motivation for this change is that systems where the (1) and
(2) conditions are met typically ship with OSes that don't exercise
the S3 path in the platform firmware which remains untested and turns
out to be non-functional at least in some cases.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2017-08-05 01:51:26 +02:00
Linus Torvalds
6999507416 KVM fixes for v4.13-rc4
ARM:
  - Yet another race with VM destruction plugged
  - A set of small vgic fixes
 
 x86:
  - Preserve pending INIT
  - RCU fixes in paravirtual async pf, VM teardown, and VMXOFF emulation
  - nVMX interrupt injection and dirty tracking fixes
  - initialize to make UBSAN happy
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJZhNZ2AAoJEED/6hsPKofoKDEH/iIw1pcgdEW2NP/kFtKXSMCK
 josdFwGPQMjBGzx6No4tfMCNDOjW2FKYXapN6CASAqMJo5H2krj8VHMVwm0h3lUl
 4RdbbkFTdfl/Znp8M39efFheWrjX+L37AltKV7xAgA7n8cO39KV4RReimzSc7aVq
 5dDt4k0dbF9/zXHxkGiKEhwaSSbZEEznQQ/09annSoOVe6om5esUrUtnUF5P99uz
 IhAsmJbZxE5VmowjT5MjaR1mXSLLNL55HWKvkf3B3ZGnyxQU+3Vz7IGf2Ma2j+jV
 IrdXA11NHDY1anDYgDhFlr3rTCPu9CBmTv4O8zsDRlX9TGpr8bBX2dvjRKl7uOo=
 =KM80
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "ARM:

   - Yet another race with VM destruction plugged

   - A set of small vgic fixes

  x86:

   - Preserve pending INIT

   - RCU fixes in paravirtual async pf, VM teardown, and VMXOFF
     emulation

   - nVMX interrupt injection and dirty tracking fixes

   - initialize to make UBSAN happy"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: arm/arm64: vgic: Use READ_ONCE fo cmpxchg
  KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"
  KVM: nVMX: mark vmcs12 pages dirty on L2 exit
  kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown
  KVM: nVMX: do not pin the VMCS12
  KVM: avoid using rcu_dereference_protected
  KVM: X86: init irq->level in kvm_pv_kick_cpu_op
  KVM: X86: Fix loss of pending INIT due to race
  KVM: async_pf: make rcu irq exit if not triggered from idle task
  KVM: nVMX: fixes to nested virt interrupt injection
  KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12
  KVM: arm/arm64: Handle hva aging while destroying the vm
  KVM: arm/arm64: PMU: Fix overflow interrupt injection
  KVM: arm/arm64: Fix bug in advertising KVM_CAP_MSI_DEVID capability
2017-08-04 15:18:27 -07:00
Nate Watterson
8eede5bc4e ata: ahci_platform: Add shutdown handler
The newly introduced ahci_platform_shutdown() method is called during
system shutdown to disable host controller DMA and interrupts in order
to avoid potentially corrupting or otherwise interfering with a new
kernel being started with kexec.

Signed-off-by: Nate Watterson <nwatters@codeaurora.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-08-04 13:34:14 -07:00
Leon Romanovsky
931b3c1a83 RDMA/mlx5: Fix existence check for extended address vector
The extended address vector is the highest bit in be32 variable,
but it was compared with the lowest. This patch fixes the endianness
of that check and removes already declared define.

Fixes: 17d2f88f92 ("IB/mlx5: Add ODP atomics support")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-04 14:24:05 -04:00
Linus Torvalds
c63716ab4d A bunch of fixes and follow-ups for -rc1 Luminous patches: issues with
->reencode_message() and last minute RADOS semantic changes in v12.1.2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZhI6QAAoJEEp/3jgCEfOLeQYH/0b92RFzmsqwPI+U7iXdD06O
 r0EXbT5dydMngJkWz/i3jBX8cMBvZyNhBh77VPDYXoFUp8//8uv5w73BkXe8JE08
 +gLZU4oP/k7kl/YBYXgCcJYj7eIBFzqNvsWurKKHY/X3xrvEZ0HT+oub92xOUgRM
 IBnZb1gZ4TJQT1MxqKOwb5aqcxaXlrOGfX7Di0aU3PFQXj5VnBI25NUQF1bgd9+A
 MbhHpob6cbWZWzVdf0fTl28q9pStq4qggevRSM/5ZH/bETO8C80XYTuaPoLcQ0pY
 VfpwgWIAPwotw9KU7W+ane13BURw76+pWMHaUZgiJKRyuRMBOT/gaER+AUeuR1o=
 =XH7K
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc4' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A bunch of fixes and follow-ups for -rc1 Luminous patches: issues with
  ->reencode_message() and last minute RADOS semantic changes in
  v12.1.2"

* tag 'ceph-for-4.13-rc4' of git://github.com/ceph/ceph-client:
  libceph: make RECOVERY_DELETES feature create a new interval
  libceph: upmap semantic changes
  crush: assume weight_set != null imples weight_set_size > 0
  libceph: fallback for when there isn't a pool-specific choose_arg
  libceph: don't call ->reencode_message() more than once per message
  libceph: make encode_request_*() work with r_mempool requests
2017-08-04 10:15:11 -07:00
Jerome Forissier
999616b853 tee: add forward declaration for struct device
tee_drv.h references struct device, but does not include device.h nor
platform_device.h. Therefore, if tee_drv.h is included by some file
that does not pull device.h nor platform_device.h beforehand, we have a
compile warning. Fix this by adding a forward declaration.

Signed-off-by: Jerome Forissier <jerome.forissier@linaro.org>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
2017-08-04 10:30:27 +02:00
Willem de Bruijn
a91dbff551 sock: ulimit on MSG_ZEROCOPY pages
Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
4ab6c99d99 sock: MSG_ZEROCOPY notification coalescing
In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause send() calls to append new data to an
existing sk_buff and, thus, ubuf_info. In that case the notification
must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
and add skb_zerocopy_realloc() to optionally extend an existing range.

Also coalesce notifications in this common case: if a notification
[1, 1] is about to be queued while [0, 0] is the queue tail, just modify
the head of the queue to read [0, 1].

Coalescing is limited to a few TSO frames worth of data to bound
notification latency.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
1f8b977ab3 sock: enable MSG_ZEROCOPY
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
52267790ef sock: add MSG_ZEROCOPY
The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00
Willem de Bruijn
3ece782693 sock: skb_copy_ubufs support for compound pages
Refine skb_copy_ubufs to support compound pages. With upcoming TCP
zerocopy sendmsg, such fragments may appear.

The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.

Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages:
1. compute bytelength and minimum number of pages to store this.
2. allocate
3. copy, filling each page except the last to PAGE_SIZE bytes
4. update skb frag array

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:29 -07:00