Commit graph

60873 commits

Author SHA1 Message Date
Florian Fainelli
c59530d0d5 net: Move PHY statistics code into PHY library helpers
In order to make it possible for network device drivers that do not
necessarily have a phy_device attached, but still report PHY statistics,
have a preliminary refactoring consisting in creating helper functions
that encapsulate the PHY device driver knowledge within PHYLIB.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27 11:53:02 -04:00
Joel Pepper
ed76952072 usb: gadget: composite Allow for larger configuration descriptors
The composite framework allows us to create gadgets composed from many
different functions, which need to fit into a single configuration
descriptor.

Some functions (like uvc) can produce configuration descriptors upwards
of 2500 bytes on their own.

This patch increases the limit from 1024 bytes to 4096.

Signed-off-by: Joel Pepper <joel.pepper@rwth-aachen.de>
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com>
2018-04-27 10:17:10 +03:00
Linus Torvalds
0644f186fc virtio: fixups
Latest header update will break QEMU (if it's rebuilt with the new
 header) - and it seems that the code there is so fragile that any change
 in this header will break it.  Add a better interface so users do not
 need to change their code every time that header changes.
 
 Fix virtio console for spec compliance.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJa4L3jAAoJECgfDbjSjVRpJiAIAMLVjPeMTsES6BX4duG/jhhc
 QmAflHg73Qmgvanbpqit/B1TRRsOsVnUGQ/4SubfQdEFZld8u/1ZNur9LKDika7h
 qhCM1HN9KN3O7E4IIF45i8jmsXoqBWOIb3BqBdAyeqNDWH4q48524IvYizPMgkDd
 ZnEZ/2pRi2HRstlwBD/JTcsfWRp/nUjarxnj8ZhUEUDFbJfjr7sPTeDwPSDShuIQ
 PrC9U8gliNRuxuq1v5Afn9F6mQptgvMxMLmtUqvYydlYgwu7cJUQ+Qxp8i7rNfM8
 kCKkn/24UdUYHft4596bEEgDWR6nriMFCQAYKWlsCtwIvbZnURURl5TKT5ceI7Y=
 =N0il
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixups from Michael Tsirkin:

 - Latest header update will break QEMU (if it's rebuilt with the new
   header) - and it seems that the code there is so fragile that any
   change in this header will break it. Add a better interface so users
   do not need to change their code every time that header changes.

 - Fix virtio console for spec compliance.

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_console: reset on out of memory
  virtio_console: move removal code
  virtio_console: drop custom control queue cleanup
  virtio_console: free buffers after reset
  virtio: add ability to iterate over vqs
  virtio_console: don't tie bufs to a vq
  virtio_balloon: add array of stat names
2018-04-26 16:36:11 -07:00
Israel Rukshin
6082d9c9c9 net/mlx5: Fix mlx5_get_vector_affinity function
Adding the vector offset when calling to mlx5_vector2eqn() is wrong.
This is because mlx5_vector2eqn() checks if EQ index is equal to vector number
and the fact that the internal completion vectors that mlx5 allocates
don't get an EQ index.

The second problem here is that using effective_affinity_mask gives the same
CPU for different vectors.
This leads to unmapped queues when calling it from blk_mq_rdma_map_queues().
This doesn't happen when using affinity_hint mask.

Fixes: 2572cf57d7 ("mlx5: fix mlx5_get_vector_affinity to start from completion vector 0")
Fixes: 05e0cc84e0 ("net/mlx5: Fix get vector affinity helper function")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-04-26 12:43:19 -07:00
Willem de Bruijn
83aa025f53 udp: add gso support to virtual devices
Virtual devices such as tunnels and bonding can handle large packets.
Only segment packets when reaching a physical or loopback device.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:09:12 -04:00
Willem de Bruijn
bec1f6f697 udp: generate gso with UDP_SEGMENT
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.

To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.

A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.

Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.

The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.

Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

    tcp tso
     3197 MB/s 54232 msg/s 54232 calls/s
         6,457,754,262      cycles

    tcp gso
     1765 MB/s 29939 msg/s 29939 calls/s
        11,203,021,806      cycles

    tcp without tso/gso *
      739 MB/s 12548 msg/s 12548 calls/s
        11,205,483,630      cycles

    udp
      876 MB/s 14873 msg/s 624666 calls/s
        11,205,777,429      cycles

    udp gso
     2139 MB/s 36282 msg/s 36282 calls/s
        11,204,374,561      cycles

   [*] after reverting commit 0a6b2a1dc2
       ("tcp: switch to GSO being always on")

Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:

  perf stat -a -C 12 -e cycles \
    ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:08:04 -04:00
Willem de Bruijn
ee80d1ebe5 udp: add udp gso
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.

UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.

The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).

IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.

[ Export __udp_gso_segment for ipv6. -DaveM ]

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26 15:07:42 -04:00
Omar Sandoval
bf0ddaba65 blk-mq: fix sysfs inflight counter
When the blk-mq inflight implementation was added, /proc/diskstats was
converted to use it, but /sys/block/$dev/inflight was not. Fix it by
adding another helper to count in-flight requests by data direction.

Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-26 09:02:01 -06:00
Arnd Bergmann
21f2db5c73 Two fixes for v4.17-rc cycle
Fix a build regression with split object directories reported by Russell
 and fix range sizes for omap4 cm2 and prm modules.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEEkgNvrZJU/QSQYIcQG9Q+yVyrpXMFAlraJXYRHHRvbnlAYXRv
 bWlkZS5jb20ACgkQG9Q+yVyrpXNV5g//Y7bnLVOPGTu73EiB4erJr6OHlZjtzBE/
 O/QQ0UwHZvmugzztPAfEvJg+s2O9IT6nloxupJHtmGpE43b7Bz47z7PAqSaI10vT
 9CJ9xwmRyobkAPnYc9deQpQwmsg4pOYFjtsFTzWB/88AgadhqjRDzIjwGIM1SDvN
 EKxcS+LA33erebbpgiLAIf+4IGvu+meENEHxBYIA/5KLdcYUTw0dVXSkpR301iV6
 R4wW5a1nrqac8HORu+CBmehs0VI3YMJw9tMcIrWDm//ZsPVoXGP61kM6lZxlCB0S
 FbOMVGO7GmcdrdhY0BaAKa7/KqSXEVBjPtZjZdOlnCDq1YNoUvrpIGn+k5x2jt1d
 NI03+FaCVqAVGWQ11UywnM55aAmLDYMkY3kUG6HySJL8zKw8m0xGHVFN8JgLI1JU
 ag3JlCbd7WNkAffLgUO+fobta6P0ASaxBXQ+88aOh9Yp6evuHBLVd/maC1+qNp7I
 YEVw5HupVpCukPlNmSVpypH9+vfVdRcmrxGZiCoskmwoW+8JnmvPWjsvulFc1nqh
 89lnz0XAMzHOTOmaK93s+kiJlZDoKJgrDs9B20Jtunur6El7oChR+f5z/AVmNfMr
 zessucoRlQ2u4kYMqw/oDKoyE6bkWXhwFB5vjZaz8kXE5HGWF0HCVTnmBF79H/9B
 C8Nx3K9FNyA=
 =5SNP
 -----END PGP SIGNATURE-----

Merge tag 'omap-for-v4.17/fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes

Pull "Two fixes for v4.17-rc cycle" from Tony Lindgren:

Fix a build regression with split object directories reported by Russell
and fix range sizes for omap4 cm2 and prm modules.

* tag 'omap-for-v4.17/fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
  ARM: OMAP2+: Fix build when using split object directories
  ARM: dts: Fix cm2 and prm sizes for omap4
2018-04-26 16:54:12 +02:00
Thomas Gleixner
a3ed0e4393 Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME
Revert commits

92af4dcb4e ("tracing: Unify the "boot" and "mono" tracing clocks")
127bfa5f43 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior")
7250a4047a ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior")
d6c7270e91 ("timekeeping: Remove boot time specific code")
f2d6fdbfd2 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior")
d6ed449afd ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock")
72199320d4 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock")

As stated in the pull request for the unification of CLOCK_MONOTONIC and
CLOCK_BOOTTIME, it was clear that we might have to revert the change.

As reported by several folks systemd and other applications rely on the
documented behaviour of CLOCK_MONOTONIC on Linux and break with the above
changes. After resume daemons time out and other timeout related issues are
observed. Rafael compiled this list:

* systemd kills daemons on resume, after >WatchdogSec seconds
  of suspending (Genki Sky).  [Verified that that's because systemd uses
  CLOCK_MONOTONIC and expects it to not include the suspend time.]

* systemd-journald misbehaves after resume:
  systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal
corrupted or uncleanly shut down, renaming and replacing.
  (Mike Galbraith).

* NetworkManager reports "networking disabled" and networking is broken
  after resume 50% of the time (Pavel).  [May be because of systemd.]

* MATE desktop dims the display and starts the screensaver right after
  system resume (Pavel).

* Full system hang during resume (me).  [May be due to systemd or NM or both.]

That happens on debian and open suse systems.

It's sad, that these problems were neither catched in -next nor by those
folks who expressed interest in this change.

Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Reported-by: Genki Sky <sky@genki.is>,
Reported-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kevin Easton <kevin@guarana.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
2018-04-26 14:53:32 +02:00
Linus Torvalds
69bfd470f4 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlrg99EACgkQnJ2qBz9k
 QNme6Qf/TxZve1OySo02ZEVz2MilmmYdEadZbL4muwwagI0FyfKXcmLmPBSXDbjN
 kwTubMaBEv3vCX5R6d9eAXa1knTjm7Wg7j/CyuYXJ46yn2LJRzvNix7/ZtC7rlnS
 vBDWvEUKjCtP/3gfSSOhz46vcs9GBC3O0733v84F9erFobcH8ccMLONoU7tG+GxP
 Zrl32w5xggbqF2zGOrt1uylpk4oqCy2mzZ5egTafPezIHo6HT1HiLku2YB5KBTXQ
 bbcEN/gH9z6hvjjUoY8MDfTZ/UcF2j5L4QLZa2PwRjuUVEBVIGRt3txu5d4mvyfi
 e/f5QeE+TcO92xZkR8qZeqafh4KWpg==
 =8BK7
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fix from Jan Kara:
 "A fix of a fsnotify race causing panics / softlockups"

* tag 'for_v4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Fix fsnotify_mark_connector race
2018-04-25 21:23:38 -07:00
Linus Torvalds
3442097b76 SCSI fixes on 20180425
8 bug fixes, one spelling update and one tracepoint addition.  The
 most serious is probably the mpt3sas write same fix because it means
 anyone using these controllers sees errors when modern filesystems try
 to issue discards.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCWuDoQyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishU0eAP0QvCH8
 NF2L35OadCr7I1Nvcb8h/OKsVtF6IIpFDD/0DAEA/FwV9wxTknA2OoSWhFzxPfMY
 EkQR56i7DQAvX3Agrno=
 =jMLe
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Eight bug fixes, one spelling update and one tracepoint addition.

  The most serious is probably the mptsas write same fix because it
  means anyone using these controllers sees errors when modern
  filesystems try to issue discards"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: target: fix crash with iscsi target and dvd
  scsi: sd_zbc: Avoid that resetting a zone fails sporadically
  scsi: sd: Defer spinning up drive while SANITIZE is in progress
  scsi: megaraid_sas: Do not log an error if FW successfully initializes.
  scsi: ufs: add trace event for ufs upiu
  scsi: core: remove reference to scsi_show_extd_sense()
  scsi: mptsas: Disable WRITE SAME
  scsi: fnic: fix spelling mistake in fnic stats "Abord" -> "Abort"
  scsi: scsi_debug: IMMED related delay adjustments
  scsi: iscsi: respond to netlink with unicast when appropriate
2018-04-25 21:13:40 -07:00
Linus Torvalds
8fba70b085 for-linus-20180425
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJa4LCXAAoJEPfTWPspceCm4DkQAMT9oc4gVTooyRdjFkUwwYQp
 oPQr8cNwp1ieB9n85sCVvo8AqP+hcRrp99/sNtMUfdTRcSAumHoILfJF8NcBHWW1
 6t609oi73runWbKnMvv+cTNFjwkUT30HHlIZRG5Qtn4wsy6vZAQEiv7J0z7PWays
 Lz7Hc1dfu2EbUoE4Phbe/wAqkQmw5HMGCCBltA9YhaApHTudR0i0bLyEN39zNyeF
 GifGkqETJ847HD00x25Uko008K0Mg6C+bkBwXBXw/E0mzO+QBG14U1WcLCw7USP2
 0KwaO1/F5SiaAUFBbW6TGCAYr/0M8JH9PRqAnGQ0sFlrnsh4gYrYEhnXtzMAbJb5
 AfFZxGc12XzTHcrNF+Genx81NpVCCblgEPxkZky8FXUZS6p91X0kNZmtfsaFCl7e
 3rc3reORz+9FaM481kY52Acw1J3gZZwdryXS040911yWRvtdS1dk80Q0FnL1qq44
 WvoXawk78x59+tcGwyC57dPmTAaFMGHiFQvx4zM5EkoBqTSTtfHkHQMeyFk3Er6U
 eSl+2Cp1FWiWZo4sJsmQtmhOshvBeIENyU4H3HaovsWFcOacFyLQSks2FSPquM4G
 gUsg6yADbamotdiFpbALM97cEcN4se38WuMNbXsqk3gTBHsCML52m7f8IWpgU3ZW
 hWuEU++093bAHADmhG81
 =335E
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180425' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "I ended up sitting on this about a week longer than I wanted to, since
  we were hashing out details with a timeout change. I've now killed
  that patch, so we can flush the existing queue in due time.

  This contains:

   - Fix for an old regression, where entering the queue can be
     disturbed by a signal to the process. This can cause spurious EIO.
     Fix from Alan Jenkins.

   - cdrom information leak fix from Dan.

   - Trivial helper for testing queue FUA from Dave Chinner, part of his
     O_DIRECT FUA series.

   - Series of swim fixes from Finn that actually makes it work again.

   - Loop O_DIRECT corruption fix, which caused data corruption in
     production for us. From me.

   - BFQ crash fix from me.

   - bcache maintainer update. Michael no longer has the time to do it,
     Coly has stepped up to serve as the new maintainer.

   - blkcg locking fixes from Jiang Biao.

   - Revert of a change from this merge window from Ming, that causes an
     issue on some hardware.

   - Minor clarification doc addition from Linus Walleij"

* tag 'for-linus-20180425' of git://git.kernel.dk/linux-block: (22 commits)
  Revert "blk-mq: remove code for dealing with remapping queue"
  block: mq: Add some minor doc for core structs
  bcache: mark Coly Li as bcache maintainer
  MAINTAINERS: Remove me as maintainer of bcache
  blkcg: init root blkcg_gq under lock
  blkcg: small fix on comment in blkcg_init_queue
  blkcg: don't hold blkcg lock when deactivating policy
  block: add blk_queue_fua() helper function
  cdrom: information leak in cdrom_ioctl_media_changed()
  bfq-iosched: ensure to clear bic/bfqq pointers when preparing request
  blk-mq: start request gstate with gen 1
  block/swim: Select appropriate drive on device open
  block/swim: Fix IO error at end of medium
  block/swim: Check drive type
  block/swim: Rename macros to avoid inconsistent inverted logic
  block/swim: Don't log an error message for an invalid ioctl
  block/swim: Remove extra put_disk() call from error path
  block/swim: Fix array bounds check
  m68k/mac: Don't remap SWIM MMIO region
  loop: handle short DIO reads
  ...
2018-04-25 21:05:15 -07:00
David S. Miller
a9537c937c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merging net into net-next to help the bpf folks avoid
some really ugly merge conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 23:04:22 -04:00
David S. Miller
25eb0ea717 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-04-25

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix to clear the percpu metadata_dst that could otherwise carry
   stale ip_tunnel_info, from William.

2) Fix that reduces the number of passes in x64 JIT with regards to
   dead code sanitation to avoid risk of prog rejection, from Gianluca.

3) Several fixes of sockmap programs, besides others, fixing a double
   page_put() in error path, missing refcount hold for pinned sockmap,
   adding required -target bpf for clang in sample Makefile, from John.

4) Fix to disable preemption in __BPF_PROG_RUN_ARRAY() paths, from Roman.

5) Fix tools/bpf/ Makefile with regards to a lex/yacc build error
   seen on older gcc-5, from John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-25 22:55:33 -04:00
Arnaud Pouliquen
fcd58037f2 remoteproc: fix crashed parameter logic on stop call
Fix rproc_add_subdev parameter name and inverse the crashed logic.

Fixes: 880f5b3882 ("remoteproc: Pass type of shutdown to subdev remove")
Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Arnaud Pouliquen <arnaud.pouliquen@st.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2018-04-25 16:43:55 -07:00
Michael S. Tsirkin
24a7e4d207 virtio: add ability to iterate over vqs
For cleanup it's helpful to be able to simply scan all vqs and discard
all data. Add an iterator to do that.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-04-25 20:33:19 +03:00
Alexander Duyck
53cd4d8e4d macvlan: Provide function for interfaces to release HW offload
This patch provides a basic function to allow a lower device to disable
macvlan offload if it was previously enabled on a given macvlan. The idea
here is to allow for recovery from failure should the lowerdev run out of
resources.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-04-25 08:26:19 -07:00
Alexander Duyck
6cb1937d4e macvlan: Add function to test for destination filtering support
This patch adds a function indicating if a given macvlan can fully supports
destination filtering, especially as it relates to unicast traffic. For
those macvlan interfaces that do not support destination filtering such
passthru or source mode filtering we should not be enabling offload
support.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-04-25 08:26:19 -07:00
Alexander Duyck
a222311283 macvlan: macvlan_count_rx shouldn't be static inline AND extern
It doesn't make sense to define macvlan_count_rx as a static inline and
then add a forward declaration after that as an extern. I am dropping the
extern declaration since it seems like it is something that likely got
missed when the function was made an inline.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-04-25 08:26:19 -07:00
Alexander Duyck
7d775f6347 macvlan: Rename fwd_priv to accel_priv and add accessor function
This change renames the fwd_priv member to accel_priv as this more
accurately reflects the actual purpose of this value. In addition I am
adding an accessor which will allow us to further abstract this in the
future if needed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-04-25 08:26:19 -07:00
Linus Walleij
fe644072df block: mq: Add some minor doc for core structs
As it came up in discussion on the mailing list that the semantic
meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely
obvious to everyone, let's add some minimal kerneldoc for a
starter.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-25 07:58:18 -06:00
David S. Miller
c749fa181b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-04-24 23:59:11 -04:00
Linus Torvalds
24cac7009c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix rtnl deadlock in ipvs, from Julian Anastasov.

 2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls,
    bad MAC address update sequence, request side races on command IO
    timeouts).

 3) Handle seq_file overflow properly in l2tp, from Guillaume Nault.

 4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk.

 5) Packet scheduler ife action fixes (malformed TLV lengths, etc.) from
    Alexander Aring.

 6) Fix out of bounds access in tcp md5 option parser, from Jann Horn.

 7) Missing netlink attribute policies in rtm_ipv6_policy table, from
    Eric Dumazet.

 8) Missing socket address length checks in l2tp and pppoe connect, from
    Guillaume Nault.

 9) Fix netconsole over team and bonding, from Xin Long.

10) Fix race with AF_PACKET socket state bitfields, from Willem de
    Bruijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (51 commits)
  ice: Fix insufficient memory issue in ice_aq_manage_mac_read
  sfc: ARFS filter IDs
  net: ethtool: Add missing kernel doc for FEC parameters
  packet: fix bitfield update race
  ice: Do not check INTEVENT bit for OICR interrupts
  ice: Fix incorrect comment for action type
  ice: Fix initialization for num_nodes_added
  igb: Fix the transmission mode of queue 0 for Qav mode
  ixgbevf: ensure xdp_ring resources are free'd on error exit
  team: fix netconsole setup over team
  amd-xgbe: Only use the SFP supported transceiver signals
  amd-xgbe: Improve KR auto-negotiation and training
  amd-xgbe: Add pre/post auto-negotiation phy hooks
  pppoe: check sockaddr length in pppoe_connect()
  l2tp: check sockaddr length in pppol2tp_connect()
  net: phy: marvell: clear wol event before setting it
  ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy
  bonding: do not set slave_dev npinfo before slave_enable_netpoll in bond_enslave
  tcp: don't read out-of-bounds opsize
  ibmvnic: Clean actual number of RX or TX pools
  ...
2018-04-24 14:16:40 -07:00
Florian Fainelli
d805c52093 net: ethtool: Add missing kernel doc for FEC parameters
While adding support for ethtool::get_fecparam and set_fecparam, kernel
doc for these functions was missed, add those.

Fixes: 1a5f3da20b ("net: ethtool: add support for forward error correction modes")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:38:42 -04:00
NeilBrown
82266e98dd rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter()
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
they do take a spinlock without irq protection.
So revise the comments to accurately state the contexts in which
these functions can be called.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:21:45 -04:00
NeilBrown
0c6f69a5e3 rhashtable: remove outdated comments about grow_decision etc
grow_decision and shink_decision no longer exist, so remove
the remaining references to them.

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 13:21:45 -04:00
Joakim Tjernlund
6510bbc88e mtd: cfi: cmdset_0001: Do not allow read/write to suspend erase block.
Currently it is possible to read and/or write to suspend EB's.
Writing /dev/mtdX or /dev/mtdblockX from several processes may
break the flash state machine.

Signed-off-by: Joakim Tjernlund <joakim.tjernlund@infinera.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Boris Brezillon <boris.brezillon@bootlin.com>
2018-04-24 17:41:18 +02:00
Yafang Shao
a06ac0d67d Revert "net: init sk_cookie for inet socket"
This reverts commit <c6849a3ac1> ("net: init sk_cookie for inet socket")

Per discussion with Eric, when update sock_net(sk)->cookie_gen, the
whole cache cache line will be invalidated, as this cache line is shared
with all cpus, that may cause great performace hit.

Bellow is the data form Eric.
"Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on
my host" when running synflood test.

Have to revert it to prevent from cache line false sharing.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 11:15:32 -04:00
Tal Gilboa
623ad75522 net/dim: Support adaptive TX moderation
Interrupt moderation for TX traffic requires different profiles than RX
interrupt moderation. The main goal here is to reduce interrupt rate and
allow better payload aggregation by keeping SKBs in the TX queue a bit
longer. Ping-pong behavior would get a profile with a short timer, so
latency wouldn't increase for these scenarios. There might be a slight
degradation in bandwidth for single stream with large message sizes, since
net.ipv4.tcp_limit_output_bytes is limiting the allowed TX traffic, but
with many streams performance is always improved.

Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 10:15:07 -04:00
Tal Gilboa
026a807c2d net/dim: Rename *_get_profile() functions to *_get_rx_moderation()
Preparation for introducing adaptive TX to net DIM.

Signed-off-by: Tal Gilboa <talgi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24 10:15:07 -04:00
Taehee Yoo
cd9a5a1580 netfilter: ebtables: remove EBT_MATCH and EBT_NOMATCH
EBT_MATCH and EBT_NOMATCH are used to change return value.
match functions(ebt_xxx.c) return false when received frame is not matched
and returns true when received frame is matched.
but, EBT_MATCH_ITERATE understands oppositely.
so, to change return value, EBT_MATCH and EBT_NOMATCH are used.
but, we can use operation '!' simply.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24 10:29:16 +02:00
John Fastabend
ba6b8de423 bpf: sockmap, map_release does not hold refcnt for pinned maps
Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-24 00:49:45 +02:00
Roman Gushchin
6899b32b5b bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY
Running bpf programs requires disabled preemption,
however at least some* of the BPF_PROG_RUN_ARRAY users
do not follow this rule.

To fix this bug, and also to make it not happen in the future,
let's add explicit preemption disabling/re-enabling
to the __BPF_PROG_RUN_ARRAY code.

* for example:
 [   17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
 ...
 [   17.640890]  inet6_create+0x3eb/0x520
 [   17.641405]  __sock_create+0x242/0x340
 [   17.641939]  __sys_socket+0x57/0xe0
 [   17.642370]  ? trace_hardirqs_off_thunk+0x1a/0x1c
 [   17.642944]  SyS_socket+0xa/0x10
 [   17.643357]  do_syscall_64+0x79/0x220
 [   17.643879]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-04-23 23:20:11 +02:00
Denis Bolotin
1ac4329a1c qed: Add configuration information to register dump and debug data
Configuration information is added to the debug data collection, in
addition to register dump.
Added qed_dbg_nvm_image() that receives an image type, allocates a
buffer and reads the image. The images are saved in the buffers and the
dump size is updated.

Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com>
Signed-off-by: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 12:05:57 -04:00
Yafang Shao
c6849a3ac1 net: init sk_cookie for inet socket
With sk_cookie we can identify a socket, that is very helpful for
traceing and statistic, i.e. tcp tracepiont and ebpf.
So we'd better init it by default for inet socket.
When using it, we just need call atomic64_read(&sk->sk_cookie).

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23 11:56:44 -04:00
Hans de Goede
02cfde67df virt: vbox: Move declarations of vboxguest private functions to private header
Move the declarations of functions from vboxguest_utils.c which are only
meant for vboxguest internal use from include/linux/vbox_utils.h to
drivers/virt/vboxguest/vboxguest_core.h.

Cc: stable@vger.kernel.org
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-23 13:41:55 +02:00
Tetsuo Handa
903f9db10f tty: Don't call panic() at tty_ldisc_init()
syzbot is reporting kernel panic [1] triggered by memory allocation failure
at tty_ldisc_get() from tty_ldisc_init(). But since both tty_ldisc_get()
and caller of tty_ldisc_init() can cleanly handle errors, tty_ldisc_init()
does not need to call panic() when tty_ldisc_get() failed.

[1] https://syzkaller.appspot.com/bug?id=883431818e036ae6a9981156a64b821110f39187

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-23 11:05:52 +02:00
Daniel Kurtz
dd709e72cb earlycon: Use a pointer table to fix __earlycon_table stride
Commit 99492c39f3 ("earlycon: Fix __earlycon_table stride") tried to fix
__earlycon_table stride by forcing the earlycon_id struct alignment to 32
and asking the linker to 32-byte align the __earlycon_table symbol.  This
fix was based on commit 07fca0e57f ("tracing: Properly align linker
defined symbols") which tried a similar fix for the tracing subsystem.

However, this fix doesn't quite work because there is no guarantee that
gcc will place structures packed into an array format.  In fact, gcc 4.9
chooses to 64-byte align these structs by inserting additional padding
between the entries because it has no clue that they are supposed to be in
an array.  If we are unlucky, the linker will assign symbol
"__earlycon_table" to a 32-byte aligned address which does not correspond
to the 64-byte aligned contents of section "__earlycon_table".

To address this same problem, the fix to the tracing system was
subsequently re-implemented using a more robust table of pointers approach
by commits:
 3d56e331b6 ("tracing: Replace syscall_meta_data struct array with pointer array")
 6549864629 ("tracepoints: Fix section alignment using pointer array")
 e4a9ea5ee7 ("tracing: Replace trace_event struct array with pointer array")

Let's use this same "array of pointers to structs" approach for
EARLYCON_TABLE.

Fixes: 99492c39f3 ("earlycon: Fix __earlycon_table stride")
Signed-off-by: Daniel Kurtz <djkurtz@chromium.org>
Suggested-by: Aaron Durbin <adurbin@chromium.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Tested-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-23 10:06:59 +02:00
David S. Miller
986e54cd68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-04-21

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when
   one task is detaching a BPF prog via perf_event_detach_bpf_prog()
   and another one dumping through bpf_prog_array_copy_info(). For
   the latter we move the copy_to_user() out of the bpf_event_mutex
   lock to fix it, from Yonghong.

2) Fix test_sock and test_sock_addr.sh failures. The former was
   hitting rlimit issues and the latter required ping to specify
   the address family, from Yonghong.

3) Remove a dead check in sockmap's sock_map_alloc(), from Jann.

4) Add generated files to BPF kselftests gitignore that were previously
   missed, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-22 21:15:59 -04:00
Linus Torvalds
c1e9dae0a9 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A small set of timer fixes:

   - Evaluate the -ETIME condition correctly in the imx tpm driver

   - Fix the evaluation order of a condition in posix cpu timers

   - Use pr_cont() in the clockevents code to prevent ugly message
     splitting

   - Remove __current_kernel_time() which is now unused to prevent that
     new users show up.

   - Remove a stale forward declaration"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/imx-tpm: Correct -ETIME return condition check
  posix-cpu-timers: Ensure set_process_cpu_timer is always evaluated
  timekeeping: Remove __current_kernel_time()
  timers: Remove stale struct tvec_base forward declaration
  clockevents: Fix kernel messages split across multiple lines
2018-04-22 10:49:02 -07:00
Linus Torvalds
38f0b33e6d Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A larger set of updates for perf.

  Kernel:

   - Handle the SBOX uncore monitoring correctly on Broadwell CPUs which
     do not have SBOX.

   - Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]. The
     percentage of preempting and non-preempting context switches help
     understanding the nature of workloads (CPU or IO bound) that are
     running on a machine. This adds the kernel facility and userspace
     changes needed to show this information in 'perf script' and 'perf
     report -D' (Alexey Budankov)

   - Remove a WARN_ON() in the trace/kprobes code which is pointless
     because the return error code is already telling the caller what's
     wrong.

   - Revert a fugly workaround for clang BPF targets.

   - Fix sample_max_stack maximum check and do not proceed when an error
     has been detect, return them to avoid misidentifying errors (Jiri
     Olsa)

   - Add SPDX idenitifiers and get rid of GPL boilderplate.

  Tools:

   - Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar)

   - Support MAP_FIXED_NOREPLACE, noticed when updating the
     tools/include/ copies (Arnaldo Carvalho de Melo)

   - Add '\n' at the end of parse-options error messages (Ravi Bangoria)

   - Add s390 support for detailed/verbose PMU event description (Thomas
     Richter)

   - perf annotate fixes and improvements:

      * Allow showing offsets in more than just jump targets, use the
        new 'O' hotkey in the TUI, config ~/.perfconfig
        annotate.offset_level for it and for --stdio2 (Arnaldo Carvalho
        de Melo)

      * Use the resolved variable names from objdump disassembled lines
        to make them more compact, just like was already done for some
        instructions, like "mov", this eventually will be done more
        generally, but lets now add some more to the existing mechanism
        (Arnaldo Carvalho de Melo)

   - perf record fixes:

      * Change warning for missing topology sysfs entry to debug, as not
        all architectures have those files, s390 being one of those
        (Thomas Richter)

      * Remove old error messages about things that unlikely to be the
        root cause in modern systems (Andi Kleen)

   - perf sched fixes:

      * Fix -g/--call-graph documentation (Takuya Yamamoto)

   - perf stat:

      * Enable 1ms interval for printing event counters values in
        (Alexey Budankov)

   - perf test fixes:

      * Run dwarf unwind on arm32 (Kim Phillips)

      * Remove unused ptrace.h include from LLVM test, sidesteping older
        clang's lack of support for some asm constructs (Arnaldo
        Carvalho de Melo)

      * Fixup BPF test using epoll_pwait syscall function probe, to cope
        with the syscall routines renames performed in this development
        cycle (Arnaldo Carvalho de Melo)

   - perf version fixes:

      * Do not print info about HAVE_LIBAUDIT_SUPPORT in 'perf version
        --build-options' when HAVE_SYSCALL_TABLE_SUPPORT is true, as
        libaudit won't be used in that case, print info about
        syscall_table support instead (Jin Yao)

   - Build system fixes:

      * Use HAVE_..._SUPPORT used consistently (Jin Yao)

      * Restore READ_ONCE() C++ compatibility in tools/include (Mark
        Rutland)

      * Give hints about package names needed to build jvmti (Arnaldo
        Carvalho de Melo)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs
  perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server"
  coresight: Move to SPDX identifier
  perf test BPF: Fixup BPF test using epoll_pwait syscall function probe
  perf tests mmap: Show which tracepoint is failing
  perf tools: Add '\n' at the end of parse-options error messages
  perf record: Remove suggestion to enable APIC
  perf record: Remove misleading error suggestion
  perf hists browser: Clarify top/report browser help
  perf mem: Allow all record/report options
  perf trace: Support MAP_FIXED_NOREPLACE
  perf: Remove superfluous allocation error check
  perf: Fix sample_max_stack maximum check
  perf: Return proper values for user stack errors
  perf list: Add s390 support for detailed/verbose PMU event description
  perf script: Extend misc field decoding with switch out event type
  perf report: Extend raw dump (-D) out with switch out event type
  perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]
  tools/headers: Synchronize kernel ABI headers, v4.17-rc1
  trace_kprobe: Remove warning message "Could not insert probe at..."
  ...
2018-04-22 10:17:01 -07:00
David S. Miller
e0ada51db9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were simple overlapping changes in microchip
driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 16:32:48 -04:00
David S. Miller
1b80f86ed6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-04-21

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21 15:56:15 -04:00
Andrey Konovalov
12c8f25a01 kasan: add no_sanitize attribute for clang builds
KASAN uses the __no_sanitize_address macro to disable instrumentation of
particular functions.  Right now it's defined only for GCC build, which
causes false positives when clang is used.

This patch adds a definition for clang.

Note, that clang's revision 329612 or higher is required.

[andreyknvl@google.com: remove redundant #ifdef CONFIG_KASAN check]
  Link: http://lkml.kernel.org/r/c79aa31a2a2790f6131ed607c58b0dd45dd62a6c.1523967959.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/4ad725cc903f8534f8c8a60f0daade5e3d674f8d.1523554166.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Paul Lawrence <paullawrence@google.com>
Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Greg Thelen
2e898e4c0a writeback: safer lock nesting
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
the page's memcg is undergoing move accounting, which occurs when a
process leaves its memcg for a new one that has
memory.move_charge_at_immigrate set.

unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
the given inode is switching writeback domains.  Switches occur when
enough writes are issued from a new domain.

This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

If both inode switch and process memcg migration are both in-flight then
unlocked_inode_to_wb_end() will unconditionally enable interrupts while
still holding the lock_page_memcg() irq spinlock.  This suggests the
possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end
    <interrupts mistakenly enabled>
                                    <interrupt>
                                    end_page_writeback
                                    test_clear_page_writeback
                                    lock_page_memcg
                                    <deadlock>
    unlock_page_memcg

Due to configuration limitations this deadlock is not currently possible
because we don't mix cgroup writeback (a cgroupv2 feature) and
memory.move_charge_at_immigrate (a cgroupv1 feature).

If the kernel is hacked to always claim inode switching and memcg
moving_account, then this script triggers lockup in less than a minute:

  cd /mnt/cgroup/memory
  mkdir a b
  echo 1 > a/memory.move_charge_at_immigrate
  echo 1 > b/memory.move_charge_at_immigrate
  (
    echo $BASHPID > a/cgroup.procs
    while true; do
      dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
  ) &
  while true; do
    sync
  done &
  sleep 1h &
  SLEEP=$!
  while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
  done

The deadlock does not seem possible, so it's debatable if there's any
reason to modify the kernel.  I suggest we should to prevent future
surprises.  And Wang Long said "this deadlock occurs three times in our
environment", so there's more reason to apply this, even to stable.
Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
see "[PATCH for-4.4] writeback: safer lock nesting"
https://lkml.org/lkml/2018/4/11/146

Wang Long said "this deadlock occurs three times in our environment"

[gthelen@google.com: v4]
  Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
[akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
Fixes: 682aa8e1a6 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Wang Long <wanglong19@meituan.com>
Acked-by: Wang Long <wanglong19@meituan.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: <stable@vger.kernel.org>	[v4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Kees Cook
e01e80634e fork: unconditionally clear stack on fork
One of the classes of kernel stack content leaks[1] is exposing the
contents of prior heap or stack contents when a new process stack is
allocated.  Normally, those stacks are not zeroed, and the old contents
remain in place.  In the face of stack content exposure flaws, those
contents can leak to userspace.

Fixing this will make the kernel no longer vulnerable to these flaws, as
the stack will be wiped each time a stack is assigned to a new process.
There's not a meaningful change in runtime performance; it almost looks
like it provides a benefit.

Performing back-to-back kernel builds before:
	Run times: 157.86 157.09 158.90 160.94 160.80
	Mean: 159.12
	Std Dev: 1.54

and after:
	Run times: 159.31 157.34 156.71 158.15 160.81
	Mean: 158.46
	Std Dev: 1.46

Instead of making this a build or runtime config, Andy Lutomirski
recommended this just be enabled by default.

[1] A noisy search for many kinds of stack content leaks can be seen here:
https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak

I did some more with perf and cycle counts on running 100,000 execs of
/bin/true.

before:
Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
Mean:  221015379122.60
Std Dev: 4662486552.47

after:
Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
Mean:  217745009865.40
Std Dev: 5935559279.99

It continues to look like it's faster, though the deviation is rather
wide, but I'm not sure what I could do that would be less noisy.  I'm
open to ideas!

Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20 17:18:35 -07:00
Linus Torvalds
a72db42cee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Unbalanced refcounting in TIPC, from Jon Maloy.

 2) Only allow TCP_MD5SIG to be set on sockets in close or listen state.
    Once the connection is established it makes no sense to change this.
    From Eric Dumazet.

 3) Missing attribute validation in neigh_dump_table(), also from Eric
    Dumazet.

 4) Fix address comparisons in SCTP, from Xin Long.

 5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller.

 6) Fix tunnel refcounting in l2tp, from Guillaume Nault.

 7) Fix double list insert in team driver, from Paolo Abeni.

 8) af_vsock.ko module was accidently made unremovable, from Stefan
    Hajnoczi.

 9) Fix reference to freed llc_sap object in llc stack, from Cong Wang.

10) Don't assume netdevice struct is DMA'able memory in virtio_net
    driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
  net/smc: fix shutdown in state SMC_LISTEN
  bnxt_en: Fix memory fault in bnxt_ethtool_init()
  virtio_net: sparse annotation fix
  virtio_net: fix adding vids on big-endian
  virtio_net: split out ctrl buffer
  net: hns: Avoid action name truncation
  docs: ip-sysctl.txt: fix name of some ipv6 variables
  vmxnet3: fix incorrect dereference when rxvlan is disabled
  llc: hold llc_sap before release_sock()
  MAINTAINERS: Direct networking documentation changes to netdev
  atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit"
  net: qmi_wwan: add Wistron Neweb D19Q1
  net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN"
  net: stmmac: Disable ACS Feature for GMAC >= 4
  net: mvpp2: Fix DMA address mask size
  net: change the comment of dev_mc_init
  net: qualcomm: rmnet: Fix warning seen with fill_info
  tun: fix vlan packet truncation
  tipc: fix infinite loop when dumping link monitor summary
  tipc: fix use-after-free in tipc_nametbl_stop
  ...
2018-04-20 09:34:39 -07:00
Linus Torvalds
b9abdcfd10 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes.

  Some of that is only a matter with fault injection (broken handling of
  small allocation failure in various mount-related places), but the
  last one is a root-triggerable stack overflow, and combined with
  userns it gets really nasty ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  Don't leak MNT_INTERNAL away from internal mounts
  mm,vmscan: Allow preallocating memory for register_shrinker().
  rpc_pipefs: fix double-dput()
  orangefs_kill_sb(): deal with allocation failures
  jffs2_kill_sb(): deal with failed allocations
  hypfs_kill_super(): deal with failed allocations
2018-04-20 09:15:14 -07:00
Linus Torvalds
0d9cf33b4a \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlrXYi4ACgkQnJ2qBz9k
 QNmp5Af9HhwGDuTgY+OcVEeZzXQ30BbFQNP4cby5LZ3Wmffio8T97c2DnlDDjr2E
 l8Biq5HTm3Gg5x7WEuopyrS6L4on76RDyr1Ohf34mOoIcxdcWQ7XpiyA8jeP9uLn
 9IJSPYPbA2RsQk4yr8GBRlVQh/udJfSG0vQOQ5cPylgICLsfxuMIC294vFfcgDt1
 wLOy24INlJYbyECMIQKrleyoZQF37Cv7v12KPU0w3eLtuNX0rdI4gwpOS4eEb27G
 2KrAsRYCb5KyFl3mRXBeCPNytiJ3ffFgvFC1Z57GVWVmNnadj/Jctw3zFirsMC2d
 j7IUA4XI/T+1Gql5rCOXcSgeJ0LRZA==
 =KQzD
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

 - isofs memory leak fix

 - two fsnotify fixes of event mask handling

 - udf fix of UTF-16 handling

 - couple other smaller cleanups

* tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix leak of UTF-16 surrogates into encoded strings
  fs: ext2: Adding new return type vm_fault_t
  isofs: fix potential memory leak in mount option parsing
  MAINTAINERS: add an entry for FSNOTIFY infrastructure
  fsnotify: fix typo in a comment about mark->g_list
  fsnotify: fix ignore mask logic in send_to_group()
  isofs compress: Remove VLA usage
  fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init
  fanotify: fix logic of events on child
2018-04-20 09:01:26 -07:00