Commit graph

900375 commits

Author SHA1 Message Date
Byungchul Park
a35d16905e rcu: Add basic support for kfree_rcu() batching
Recently a discussion about stability and performance of a system
involving a high rate of kfree_rcu() calls surfaced on the list [1]
which led to another discussion how to prepare for this situation.

This patch adds basic batching support for kfree_rcu(). It is "basic"
because we do none of the slab management, dynamic allocation, code
moving or any of the other things, some of which previous attempts did
[2]. These fancier improvements can be follow-up patches and there are
different ideas being discussed in those regards. This is an effort to
start simple, and build up from there. In the future, an extension to
use kfree_bulk and possibly per-slab batching could be done to further
improve performance due to cache-locality and slab-specific bulk free
optimizations. By using an array of pointers, the worker thread
processing the work would need to read lesser data since it does not
need to deal with large rcu_head(s) any longer.

Torture tests follow in the next patch and show improvements of around
5x reduction in number of  grace periods on a 16 CPU system. More
details and test data are in that patch.

There is an implication with rcu_barrier() with this patch. Since the
kfree_rcu() calls can be batched, and may not be handed yet to the RCU
machinery in fact, the monitor may not have even run yet to do the
queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
to wait for those kfree_rcu()s that are already made. So this means a
kfree_rcu() followed by an rcu_barrier() does not imply that memory will
be freed once rcu_barrier() returns.

Another implication is higher active memory usage (although not
run-away..) until the kfree_rcu() flooding ends, in comparison to
without batching. More details about this are in the second patch which
adds an rcuperf test.

Finally, in the near future we will get rid of kfree_rcu() special casing
within RCU such as in rcu_do_batch and switch everything to just
batching. Currently we don't do that since timer subsystem is not yet up
and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
lock are not initialized. That would also mean getting rid of
kfree_call_rcu_nobatch() entirely.

[1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-mst@kernel.org
[2] https://lkml.org/lkml/2017/12/19/824

Cc: kernel-team@android.com
Cc: kernel-team@lge.com
Co-developed-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
[ paulmck: Applied 0day and Paul Walmsley feedback on ->monitor_todo. ]
[ paulmck: Make it work during early boot. ]
[ paulmck: Add a crude early boot self-test. ]
[ paulmck: Style adjustments and experimental docbook structure header. ]
Link: https://lore.kernel.org/lkml/alpine.DEB.2.21.9999.1908161931110.32497@viisi.sifive.com/T/#me9956f66cb611b95d26ae92700e1d901f46e8c59
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-01-24 10:17:03 -08:00
Eric Biggers
80f2388afa f2fs: fix race conditions in ->d_compare() and ->d_hash()
Since ->d_compare() and ->d_hash() can be called in RCU-walk mode,
->d_parent and ->d_inode can be concurrently modified, and in
particular, ->d_inode may be changed to NULL.  For f2fs_d_hash() this
resulted in a reproducible NULL dereference if a lookup is done in a
directory being deleted, e.g. with:

	int main()
	{
		if (fork()) {
			for (;;) {
				mkdir("subdir", 0700);
				rmdir("subdir");
			}
		} else {
			for (;;)
				access("subdir/file", 0);
		}
	}

... or by running the 't_encrypted_d_revalidate' program from xfstests.
Both repros work in any directory on a filesystem with the encoding
feature, even if the directory doesn't actually have the casefold flag.

I couldn't reproduce a crash in f2fs_d_compare(), but it appears that a
similar crash is possible there.

Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and
falling back to the case sensitive behavior if the inode is NULL.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-24 10:04:09 -08:00
Linus Torvalds
6381b44283 IOMMU Fixes for Linux v5.5-rc7
Two Fixes:
 
 	- Fix NULL-ptr dereference bug in Intel IOMMU driver
 
 	- Properly safe and restore AMD IOMMU performance counter
 	  registers when testing if they are writable.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAl4rKJ0ACgkQK/BELZcB
 GuM8BxAAse7i4EAB0DZ26aDVU/ZZAGBTc5pH+7IOvuGFWpu3Ik7siltMZKTa56XD
 5reQl1hUHyBesdwNcd+ZQz/0QzgMu0zEcxJtzLD5nbk2Rp1pDYab84+wA0oqftmN
 7TBqPJjqGYfnfkobPW+4SaiCSxM6XKEXYaJvXqF6dsz5fLHHqTVjGyEXuu0pWey5
 1AG1nNWey5TobR39ZUNVIGUEf4ftN8tOGutjGKch29OmMg1UkHZjDyYpVJurjSUl
 1YDf0WA2X5cUWzRg519o9Uxs+5y/tBItS2qMJ/Uwh1fp/kOYrGoEPXRH9brU3owh
 j8kVKyz/cNlvhdJN/QPVq/A+GBgzAq+u9FUGH68LBVx05cuCRGWb6qP/blEZWUSA
 zQvbHPIPqPBK0pYtbVCgJFCpOS5C6X33RMgbRr3XncAItmfEp2FH7tcHIsDZUzmO
 EGBgSPrhRFJrgfa3hopc93L4OEHmGy+20o7oT4kk0SBcHVDSyDZVBu1lX+/gAZKd
 ZoV5FyF9KNB+qQQ6E+vpL5gLh7BWiBgOI41Bdw5ut09I0gYURPdFWUPHRFwHK1/1
 KTSWLyuLkS4VxdhcwhA30eVmSrbfocPPwuUxcGn2p1YnaycFkDNoJnhifvFiOS6w
 tFOJ3wEVL2OX7On6U3jviVy7MQapX3MvWrWKlgg/UoMv36dbqRA=
 =ViDp
 -----END PGP SIGNATURE-----

Merge tag 'iommu-fixes-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:
 "Two fixes:

   - Fix NULL-ptr dereference bug in Intel IOMMU driver

   - Properly save and restore AMD IOMMU performance counter registers
     when testing if they are writable"

* tag 'iommu-fixes-v5.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/amd: Fix IOMMU perf counter clobbering during init
  iommu/vt-d: Call __dmar_remove_one_dev_info with valid pointer
2020-01-24 09:56:59 -08:00
Eric Biggers
5515eae647 f2fs: fix dcache lookup of !casefolded directories
Do the name comparison for non-casefolded directories correctly.

This is analogous to ext4's commit 66883da1ee ("ext4: fix dcache
lookup of !casefolded directories").

Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-01-24 09:53:02 -08:00
Linus Torvalds
3c45d7510c powerpc fixes for 5.5 #6
Fix our hash MMU code to avoid having overlapping ids between user and kernel,
 which isn't as bad as it sounds but led to crashes on some machines.
 
 A fix for the Power9 XIVE interrupt code, which could return the wrong interrupt
 state in obscure error conditions.
 
 A minor Kconfig fix for the recently added CONFIG_PPC_UV code.
 
 Thanks to:
   Aneesh Kumar K.V, Bharata B Rao, Cédric Le Goater, Frederic Barrat.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl4q3iMTHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgM1fEAC7YefQZsuIstV2bNH9xFJe/oEOSLWt
 DqZYfWZRBRZRdRMAP7+hmXpsLfHAB3QSGv4yEc2Pgm5Y5vYuv8MWqEKkpJoivIPW
 /Fu7mNGI06IWinMyl59dLA5fXPNeOzSNcdPnBdIUC/62XU414IGsy273X3LN4YiM
 hfXSpPA3tpSjVBTeqfKTJowXWMOvbkwiHMs2ziAu4Y9GDauJsU0UiOxMkl2MYT7Z
 PULO/SrjYSbP8+MazD/YdJ5k7F4rNV3bVDPUCU305tKIkrqRmBs+8DXjeUAMWVjj
 y2huUZ/0XXsr8QZyyGfMhEC1sJKAD3colG4LHUysbWz1oInY93CvXELdPAMXhqrH
 xXTzx9ae74Xk7t7WaGgq5xJGHJCDSg1aA2HPkUg6Gbk6iJIztOCkKMdSPvFMo18S
 p+Kjis8db83RPXQm3ooW7s/xhp6AaDiXd8khQ7A2efgt4SDTSPFvgv2q4CMDC1lP
 njchgzgPJ635MSd3CC/u6i7eViuUylb6SfhFVYY0DqOOvNJVrj1W6M4N7tpWSrRS
 PXJI2/NwlL3WIajKO3KfqTRI7QaSLtccTrXGLiGBxx/ooAowP94/WL3CYJAeuO2F
 NmA1GXH6/WXvzMMNm5NB0kCOUOOhtJk/MFuO7Agly6YL9p36fqVz4WIJVAqtwJLO
 uPu0xASJCgp9Mg==
 =r/RM
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.5-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Some more powerpc fixes for 5.5:

   - Fix our hash MMU code to avoid having overlapping ids between user
     and kernel, which isn't as bad as it sounds but led to crashes on
     some machines.

   - A fix for the Power9 XIVE interrupt code, which could return the
     wrong interrupt state in obscure error conditions.

   - A minor Kconfig fix for the recently added CONFIG_PPC_UV code.

  Thanks to Aneesh Kumar K.V, Bharata B Rao, Cédric Le Goater, Frederic
  Barrat"

* tag 'powerpc-5.5-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/hash: Fix sharing context ids between kernel & userspace
  powerpc/xive: Discard ESB load value when interrupt is invalid
  powerpc: Ultravisor: Fix the dependencies for CONFIG_PPC_UV
2020-01-24 09:49:20 -08:00
Linus Torvalds
274adbff45 drm fixes for 5.5-rc8
core/mst:
 - Fix SST branch device handling
 
 amdgpu:
 - enable renoir outside experimental
 
 i915:
 - Avoid overflow with huge userptr objects
 - uAPI fix to correctly handle negative values in
   engine->uabi_class/instance (cc: stable)
 
 panfrost:
 - Fix mapping of globally visible BO's (Boris)
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJeKoezAAoJEAx081l5xIa+hRMP/jC45Om4YDtsoH+75v41mYfO
 yUt1jBsCDXHAvPjBbRcoM2kbZU+ZcaDR4EZS7lA7QTW5CAosqzxrGdkzGobjmj5v
 TN/0BWWyQYtQ+lbt/pUxM+kktaV9IkjZIAUlW3msCwKsYE+nPIWfXDnW2ioFpitk
 sXk3vZYC4Lh6IVJ5PG879nz5XPpo2eb4sYYQ4PTFj65u436NEd8AFqmcDzp64Jhb
 xh1jLSxeS/JROyb/AmTHt0PgrGB5B08U8V7CAJXhIgf7zM28iiA3ynanoO3PauYb
 3uPg8eVdP2tOqcpplV5CBkiw/in2sPgRprMEg66cKYzo9rieWu7nPL+1zmaxD16s
 NIXSmov4Sm7yOS4LAGAckS12ejniy7L6OcP5N2S2Gz0HpIMXluBfLw3d4ZiIW5Q5
 yKdpmrwTfl0IfTiblweaHi8uT4K19vSqoXUf4bFa4UxqeYOtipHnNP907+nDs0JN
 GiHjIvrQJ4KH/A/JAkjEyNzHRAqJhJke84E519UP+mI+8paqzftsbDvf0ADA7nzZ
 q6TbZpTbEPV+NqnaH6n8volfh/WougW2BpxozMO3UP/474AZq1bXJIXi45ksatRW
 Cq0KQP8iI//C0xhCDPs7LQ/XzgkcQtDPZ1Nx6vWjrn6O3DMjKCm0RxZgfKj5CfbJ
 PovYqZQXJ01989DfXfQv
 =K7TY
 -----END PGP SIGNATURE-----

Merge tag 'drm-fixes-2020-01-24' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "This one has a core mst fix and two i915 fixes. amdgpu just enables
  some hw outside experimental.

  The panfrost fix is a little bigger than I'd like at this stage but it
  fixes a fairly fundamental problem with global shared buffers in that
  driver, and since it's confined to that driver and I've taken a look
  at it, I think it's fine to get into the tree now, so it can get
  stable propagated as well.

  core/mst:
   - Fix SST branch device handling

  amdgpu:
   - enable renoir outside experimental

  i915:
   - Avoid overflow with huge userptr objects
   - uAPI fix to correctly handle negative values in
     engine->uabi_class/instance (cc: stable)

  panfrost:
   - Fix mapping of globally visible BO's (Boris)"

* tag 'drm-fixes-2020-01-24' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: remove the experimental flag for renoir
  drm/panfrost: Add the panfrost_gem_mapping concept
  drm/i915: Align engine->uabi_class/instance with i915_drm.h
  drm/i915/userptr: fix size calculation
  drm/dp_mst: Handle SST-only branch device case
2020-01-24 09:38:04 -08:00
Sibi Sankar
600c39b343 remoteproc: qcom: q6v5-mss: Improve readability of reset_assert
Define AXI_GATING_VALID_OVERRIDE and fixup comments to improve readability
of Q6 modem reset sequence on SC7180 SoCs.

Reviewed-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Sibi Sankar <sibis@codeaurora.org>
Link: https://lore.kernel.org/r/20200123131236.1078-3-sibis@codeaurora.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2020-01-24 09:34:07 -08:00
Sibi Sankar
01bf3fec38 remoteproc: qcom: q6v5-mss: Use regmap_read_poll_timeout
Replace the loop for HALT_ACK detection with regmap_read_poll_timeout.

Reviewed-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Sibi Sankar <sibis@codeaurora.org>
Link: https://lore.kernel.org/r/20200123131236.1078-2-sibis@codeaurora.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2020-01-24 09:34:03 -08:00
Christophe Leroy
ab10ae1c3b lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user()
The range passed to user_access_begin() by strncpy_from_user() and
strnlen_user() starts at 'src' and goes up to the limit of userspace
although reads will be limited by the 'count' param.

On 32 bits powerpc (book3s/32) access has to be granted for each
256Mbytes segment and the cost increases with the number of segments to
unlock.

Limit the range with 'count' param.

Fixes: 594cc251fd ("make 'user_access_begin()' do 'access_ok()'")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-24 09:27:34 -08:00
Jiri Wiesner
ab658b9fa7 netfilter: conntrack: sctp: use distinct states for new SCTP connections
The netlink notifications triggered by the INIT and INIT_ACK chunks
for a tracked SCTP association do not include protocol information
for the corresponding connection - SCTP state and verification tags
for the original and reply direction are missing. Since the connection
tracking implementation allows user space programs to receive
notifications about a connection and then create a new connection
based on the values received in a notification, it makes sense that
INIT and INIT_ACK notifications should contain the SCTP state
and verification tags available at the time when a notification
is sent. The missing verification tags cause a newly created
netfilter connection to fail to verify the tags of SCTP packets
when this connection has been created from the values previously
received in an INIT or INIT_ACK notification.

A PROTOINFO event is cached in sctp_packet() when the state
of a connection changes. The CLOSED and COOKIE_WAIT state will
be used for connections that have seen an INIT and INIT_ACK chunk,
respectively. The distinct states will cause a connection state
change in sctp_packet().

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 18:26:53 +01:00
Lukas Bulwahn
06b9c26993 docs: nvdimm: use ReST notation for subsection
The ACPI Device Specific Methods (_DSM) paragraph is intended to be a
subsection of the Submit Checklist Addendum section. Dan Williams however
used Markdown notation for this subsection, which does not parse as
intended in a ReST documentation.

Change the markup to ReST notation, as described in the Specific
guidelines for the kernel documentation section in
Documentation/doc-guide/sphinx.rst.

Fixes: 47843401e3 ("libnvdimm, MAINTAINERS: Maintainer Entry Profile")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20200118153620.8276-1-lukas.bulwahn@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-01-24 09:54:42 -07:00
Nathan Chancellor
e91440ddfb
ASoC: rt1015: Remove unnecessary const
Clang warns:

../sound/soc/codecs/rt1015.c:392:14: warning: duplicate 'const'
declaration specifier [-Wduplicate-decl-specifier]
static const SOC_ENUM_SINGLE_DECL(rt1015_boost_mode_enum, 0, 0,
             ^
../include/sound/soc.h:355:2: note: expanded from macro
'SOC_ENUM_SINGLE_DECL'
        SOC_ENUM_DOUBLE_DECL(name, xreg, xshift, xshift, xtexts)
        ^
../include/sound/soc.h:352:2: note: expanded from macro
'SOC_ENUM_DOUBLE_DECL'
        const struct soc_enum name = SOC_ENUM_DOUBLE(xreg, xshift_l, xshift_r, \
        ^
1 warning generated.

Remove the const after static to fix it.

Fixes: df31007400 ("ASoC: rt1015: add rt1015 amplifier driver")
Link: https://github.com/ClangBuiltLinux/linux/issues/845
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Link: https://lore.kernel.org/r/20200124155750.33753-1-natechancellor@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2020-01-24 16:53:57 +00:00
Marek Szyprowski
3a6adf3263
ASoC: max98090: silence lockdep warning
Commit 08df0d9a00 ("ASoC: max98090: revert "ASoC: max98090: fix lockdep
warning"") provided a good rationale for removing separate lock for the
SHDN register access. However it restored the lockdep warning during the
system boot. To silence the lockdep warning, mark the mutex taken in the
max98090_shdn_save() function with the lockdep class dedicated for the
runtime DAPM operations: SND_SOC_DAPM_CLASS_RUNTIME. This finally fixes
the following lockdep warning observed on Exynos4412-based Odroid U3
board:

======================================================
WARNING: possible circular locking dependency detected
5.5.0-rc7-next-20200123 #7329 Not tainted
------------------------------------------------------
alsactl/1105 is trying to acquire lock:
ed4f7cf4 (&card->dapm_mutex){+.+.}, at: max98090_shdn_save+0x1c/0x28

but task is already holding lock:
edb8d49c (&card->controls_rwsem){++++}, at: snd_ctl_ioctl+0xcc/0xbb8

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&card->controls_rwsem){++++}:
       snd_ctl_add_replace+0x3c/0x84
       dapm_create_or_share_kcontrol+0x24c/0x2e0
       snd_soc_dapm_new_widgets+0x308/0x594
       snd_soc_bind_card+0x834/0xa94
       devm_snd_soc_register_card+0x34/0x6c
       odroid_audio_probe+0x288/0x34c
       platform_drv_probe+0x6c/0xa4
       really_probe+0x200/0x48c
       driver_probe_device+0x78/0x1f8
       bus_for_each_drv+0x74/0xb8
       __device_attach+0xd4/0x16c
       bus_probe_device+0x88/0x90
       deferred_probe_work_func+0x3c/0xd0
       process_one_work+0x230/0x7bc
       worker_thread+0x44/0x524
       kthread+0x130/0x164
       ret_from_fork+0x14/0x20
       0x0

-> #0 (&card->dapm_mutex){+.+.}:
       lock_acquire+0xe8/0x270
       __mutex_lock+0x9c/0xb18
       mutex_lock_nested+0x1c/0x24
       max98090_shdn_save+0x1c/0x28
       max98090_put_enum_double+0x20/0x40
       snd_ctl_ioctl+0x190/0xbb8
       ksys_ioctl+0x484/0xb10
       ret_fast_syscall+0x0/0x28
       0xbede0564

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&card->controls_rwsem);
                               lock(&card->dapm_mutex);
                               lock(&card->controls_rwsem);
  lock(&card->dapm_mutex);

 *** DEADLOCK ***

1 lock held by alsactl/1105:
 #0: edb8d49c (&card->controls_rwsem){++++}, at: snd_ctl_ioctl+0xcc/0xbb8

stack backtrace:
CPU: 2 PID: 1105 Comm: alsactl Not tainted 5.5.0-rc7-next-20200123 #7329
Hardware name: Samsung Exynos (Flattened Device Tree)
[<c01126f0>] (unwind_backtrace) from [<c010e1e8>] (show_stack+0x10/0x14)
[<c010e1e8>] (show_stack) from [<c0b5234c>] (dump_stack+0xb4/0xe0)
[<c0b5234c>] (dump_stack) from [<c018a610>] (check_noncircular+0x1ec/0x208)
[<c018a610>] (check_noncircular) from [<c018ca2c>] (__lock_acquire+0x1210/0x25ec)
[<c018ca2c>] (__lock_acquire) from [<c018e728>] (lock_acquire+0xe8/0x270)
[<c018e728>] (lock_acquire) from [<c0b71928>] (__mutex_lock+0x9c/0xb18)
[<c0b71928>] (__mutex_lock) from [<c0b723c0>] (mutex_lock_nested+0x1c/0x24)
[<c0b723c0>] (mutex_lock_nested) from [<c086097c>] (max98090_shdn_save+0x1c/0x28)
[<c086097c>] (max98090_shdn_save) from [<c08613f8>] (max98090_put_enum_double+0x20/0x40)
[<c08613f8>] (max98090_put_enum_double) from [<c0833f20>] (snd_ctl_ioctl+0x190/0xbb8)
[<c0833f20>] (snd_ctl_ioctl) from [<c02cae14>] (ksys_ioctl+0x484/0xb10)
[<c02cae14>] (ksys_ioctl) from [<c0101000>] (ret_fast_syscall+0x0/0x28)
Exception stack(0xed331fa8 to 0xed331ff0)
...

Fixes: 08df0d9a00 ("ASoC: max98090: revert "ASoC: max98090: fix lockdep warning"")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Tzung-Bi Shih <tzungbi@google.com>
Link: https://lore.kernel.org/r/20200123134046.9769-1-m.szyprowski@samsung.com
Signed-off-by: Mark Brown <broonie@kernel.org>
2020-01-24 16:53:38 +00:00
Yue Hu
5871023c3a zram: correct documentation about sysfs node of huge page writeback
sysfs node for huge page writeback is writeback rather than write.

Signed-off-by: Yue Hu <huyue2@yulong.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20200120102949.12132-1-zbestahu@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-01-24 09:52:05 -07:00
Randy Dunlap
a3e1c56a0b Documentation: zram: various fixes in zram.rst
Fix various items in zram.rst:
- typos/spellos
- punctuation
- grammar
- shell syntax
- indentation
- sysfs file names

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/77000e12-677a-62f6-9f78-343be5bd6630@infradead.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-01-24 09:50:29 -07:00
Jonathan Corbet
53b7f3aa41 Add a maintainer entry profile for documentation
Documentation should lead by example, so here's a basic maintainer entry
profile for this subsystem.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-01-24 09:48:39 -07:00
Jonathan Corbet
d96574b0b4 Add a document on how to contribute to the documentation
This is mostly a collection of thoughts for how people who want to help out
can make the docs better.  Hopefully the world will respond with a flurry
of useful patches.

Acked-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-01-24 09:48:14 -07:00
Jonathan Corbet
bcac386f3d docs: Keep up with the location of NoUri
Sphinx 2.1 moved sphinx.environment.NoUri into sphinx.errors; that produced
this warning in the docs build:

  /usr/lib/python3.7/site-packages/sphinx/registry.py:473:
    RemovedInSphinx30Warning: sphinx.environment.NoUri is deprecated.

Grab NoUri from the right place and make the warning go away.  That symbol
was only added to sphinx.errors in 2.1, so we must still import it from the
old location when running in older versions.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-01-24 09:47:05 -07:00
Murphy Zhou
1a980b8cbf ovl: add splice file read write helper
Now overlayfs falls back to use default file splice read
and write, which is not compatiple with overlayfs, returning
EFAULT. xfstests generic/591 can reproduce part of this.

Tested this patch with xfstests auto group tests.

Signed-off-by: Murphy Zhou <jencce.kernel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-01-24 16:28:15 +01:00
Joerg Roedel
e3b5ee0cfb Merge branches 'iommu/fixes', 'arm/smmu', 'x86/amd', 'x86/vt-d' and 'core' into next 2020-01-24 15:39:39 +01:00
Adrian Huang
154e3a65f4 iommu/amd: Remove the unnecessary assignment
The assignment of the global variable 'iommu_detected' has been
moved from amd_iommu_init_dma_ops() to amd_iommu_detect(), so
this patch removes the assignment in amd_iommu_init_dma_ops().

Signed-off-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:38:31 +01:00
Lu Baolu
857f081426 iommu/vt-d: Remove unnecessary WARN_ON_ONCE()
Address field in device TLB invalidation descriptor is qualified
by the S field. If S field is zero, a single page at page address
specified by address [63:12] is requested to be invalidated. If S
field is set, the least significant bit in the address field with
value 0b (say bit N) indicates the invalidation address range. The
spec doesn't require the address [N - 1, 0] to be cleared, hence
remove the unnecessary WARN_ON_ONCE().

Otherwise, the caller might set "mask = MAX_AGAW_PFN_WIDTH" in order
to invalidating all the cached mappings on an endpoint, and below
overflow error will be triggered.

[...]
UBSAN: Undefined behaviour in drivers/iommu/dmar.c:1354:3
shift exponent 64 is too large for 64-bit type 'long long unsigned int'
[...]

Reported-and-tested-by: Frank <fgndev@posteo.de>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:36:27 +01:00
Lu Baolu
b89b6605b8 iommu/vt-d: Unnecessary to handle default identity domain
The iommu default domain framework has been designed to take
care of setting identity default domain type. It's unnecessary
to handle this again in the VT-d driver. Hence, remove it.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:32:54 +01:00
Lu Baolu
9235cb13d7 iommu/vt-d: Allow devices with RMRRs to use identity domain
Since commit ea2447f700 ("intel-iommu: Prevent devices with
RMRRs from being placed into SI Domain"), the Intel IOMMU driver
doesn't allow any devices with RMRR locked to use the identity
domain. This was added to to fix the issue where the RMRR info
for devices being placed in and out of the identity domain gets
lost. This identity maps all RMRRs when setting up the identity
domain, so that devices with RMRRs could also use it.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:32:54 +01:00
Barret Rhoden
ce4cc52b51 iommu/vt-d: Add RMRR base and end addresses sanity check
The VT-d spec specifies requirements for the RMRR entries base and
end (called 'Limit' in the docs) addresses.

This commit will cause the DMAR processing to mark the firmware as
tainted if any RMRR entries that do not meet these requirements.

Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:32:53 +01:00
Barret Rhoden
f5a68bb075 iommu/vt-d: Mark firmware tainted if RMRR fails sanity check
RMRR entries describe memory regions that are DMA targets for devices
outside the kernel's control.

RMRR entries that fail the sanity check are pointing to regions of
memory that the firmware did not tell the kernel are reserved or
otherwise should not be used.

Instead of aborting DMAR processing, this commit marks the firmware
as tainted. These RMRRs will still be identity mapped, otherwise,
some devices, e.x. graphic devices, will not work during boot.

Signed-off-by: Barret Rhoden <brho@google.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Fixes: f036c7fa0a ("iommu/vt-d: Check VT-d RMRR region in BIOS is reported as reserved")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:32:15 +01:00
Shuah Khan
8c17bbf6c8 iommu/amd: Fix IOMMU perf counter clobbering during init
init_iommu_perf_ctr() clobbers the register when it checks write access
to IOMMU perf counters and fails to restore when they are writable.

Add save and restore to fix it.

Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Fixes: 30861ddc9c ("perf/x86/amd: Add IOMMU Performance Counter resource management")
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:28:40 +01:00
Jerry Snitselaar
bf708cfb2f iommu/vt-d: Call __dmar_remove_one_dev_info with valid pointer
It is possible for archdata.iommu to be set to
DEFER_DEVICE_DOMAIN_INFO or DUMMY_DEVICE_DOMAIN_INFO so check for
those values before calling __dmar_remove_one_dev_info. Without a
check it can result in a null pointer dereference. This has been seen
while booting a kdump kernel on an HP dl380 gen9.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@vger.kernel.org # 5.3+
Cc: linux-kernel@vger.kernel.org
Fixes: ae23bfb68f ("iommu/vt-d: Detach domain before using a private one")
Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com>
Acked-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2020-01-24 15:23:50 +01:00
Qu Wenruo
1bbb97b8ce btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG]
For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071,
looped runs can lead to random failure, where scrub finds csum error.

The possibility is not high, around 1/20 to 1/100, but it's causing data
corruption.

The bug is observable after commit b12de52896 ("btrfs: scrub: Don't
check free space before marking a block group RO")

[CAUSE]
Dev-replace has two source of writes:

- Write duplication
  All writes to source device will also be duplicated to target device.

  Content:	Not yet persisted data/meta

- Scrub copy
  Dev-replace reused scrub code to iterate through existing extents, and
  copy the verified data to target device.

  Content:	Previously persisted data and metadata

The difference in contents makes the following race possible:
	Regular Writer		|	Dev-replace
-----------------------------------------------------------------
  ^                             |
  | Preallocate one data extent |
  | at bytenr X, len 1M		|
  v				|
  ^ Commit transaction		|
  | Now extent [X, X+1M) is in  |
  v commit root			|
 ================== Dev replace starts =========================
  				| ^
				| | Scrub extent [X, X+1M)
				| | Read [X, X+1M)
				| | (The content are mostly garbage
				| |  since it's preallocated)
  ^				| v
  | Write back happens for	|
  | extent [X, X+512K)		|
  | New data writes to both	|
  | source and target dev.	|
  v				|
				| ^
				| | Scrub writes back extent [X, X+1M)
				| | to target device.
				| | This will over write the new data in
				| | [X, X+512K)
				| v

This race can only happen for nocow writes. Thus metadata and data cow
writes are safe, as COW will never overwrite extents of previous
transaction (in commit root).

This behavior can be confirmed by disabling all fallocate related calls
in fsstress (*), then all related tests can pass a 2000 run loop.

*: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \
		   -f collapse=0 -f punch=0 -f resvsp=0"
   I didn't expect resvsp ioctl will fallback to fallocate in VFS...

[FIX]
Make dev-replace to require mandatory block group RO, and wait for current
nocow writes before calling scrub_chunk().

This patch will mostly revert commit 76a8efa171 ("btrfs: Continue replace
when set_block_ro failed") for dev-replace path.

The side effect is, dev-replace can be more strict on avaialble space, but
definitely worth to avoid data corruption.

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 76a8efa171 ("btrfs: Continue replace when set_block_ro failed")
Fixes: b12de52896 ("btrfs: scrub: Don't check free space before marking a block group RO")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-24 14:35:56 +01:00
Rafael J. Wysocki
0552e05fdf PM: core: Fix handling of devices deleted during system-wide resume
If a device is deleted by one of its system-wide resume callbacks
(for example, because it does not appear to be present or accessible
any more) along with its children, the resume of the children may
continue leading to use-after-free errors and other issues
(potentially).

Namely, if the device's children are resumed asynchronously, their
resume may have been scheduled already before the device's callback
runs and so the device may be deleted while dpm_wait_for_superior()
is being executed for them.  The memory taken up by the parent device
object may be freed then while dpm_wait() is waiting for the parent's
resume callback to complete, which leads to a use-after-free.
Moreover, the resume of the children is really not expected to
continue after they have been unregistered, so it must be terminated
right away in that case.

To address this problem, modify dpm_wait_for_superior() to check
if the target device is still there in the system-wide PM list of
devices and if so, to increment its parent's reference counter, both
under dpm_list_mtx which prevents device_del() running for the child
from dropping the parent's reference counter prematurely.

If the device is not present in the system-wide PM list of devices
any more, the resume of it cannot continue, so check that again after
dpm_wait() returns, which means that the parent's callback has been
completed, and pass the result of that check to the caller of
dpm_wait_for_superior() to allow it to abort the device's resume
if it is not there any more.

Link: https://lore.kernel.org/linux-pm/1579568452-27253-1-git-send-email-chanho.min@lge.com
Reported-by: Chanho Min <chanho.min@lge.com>
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-24 14:27:05 +01:00
David S. Miller
08a45c59f1 Merge branch 'mptcp-part-two'
Christoph Paasch says:

====================
Multipath TCP part 2: Single subflow & RFC8684 support

v2 -> v3: Added RFC8684-style handshake (see below fore more details) and some minor fixes
v1 -> v2: Rebased on latest "Multipath TCP: Prerequisites" v3 series

This set adds MPTCP connection establishment, writing & reading MPTCP
options on data packets, a sysctl to allow MPTCP per-namespace, and self
tests. This is sufficient to establish and maintain a connection with a
MPTCP peer, but will not yet allow or initiate establishment of
additional MPTCP subflows.

We also add the necessary code for the RFC8684-style handshake.
RFC8684 obsoletes the experimental RFC6824 and makes MPTCP move-on to
version 1.

Originally our plan was to submit single-subflow and RFC8684 support in
two patchsets, but to simplify the merging-process and ensure that a coherent
MPTCP-version lands in Linux we decided to merge the two sets into a single
one.

The MPTCP patchset exclusively supports RFC 8684. Although all MPTCP
deployments are currently based on RFC 6824, future deployments will be
migrating to MPTCP version 1. 3GPP's 5G standardization also solely supports
RFC 8684. In addition, we believe that this initial submission of MPTCP will be
cleaner by solely supporting RFC 8684. If later on support for the old
MPTCP-version is required it can always be added in the future.

The major difference between RFC 8684 and RFC 6824 is that it has a better
support for servers using TCP SYN-cookies by reliably retransmitting the
MP_CAPABLE option.

Before ending this cover letter with some refs, it is worth mentioning
that we promise David Miller that merging this series will be rewarded by
Twitter dopamine hits :-D

Clone/fetch:
https://github.com/multipath-tcp/mptcp_net-next.git (tag: netdev-v3-part2)

Browse:
https://github.com/multipath-tcp/mptcp_net-next/tree/netdev-v3-part2

Thank you for your review. You can find us at mptcp@lists.01.org and
https://is.gd/mptcp_upstream
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
8ab183deb2 mptcp: cope with later TCP fallback
With MPTCP v1, passive connections can fallback to TCP after the
subflow becomes established:

syn + MP_CAPABLE ->
               <- syn, ack + MP_CAPABLE

ack, seq = 3    ->
        // OoO packet is accepted because in-sequence
        // passive socket is created, is in ESTABLISHED
	// status and tentatively as MP_CAPABLE

ack, seq = 2     ->
        // no MP_CAPABLE opt, subflow should fallback to TCP

We can't use the 'subflow' socket fallback, as we don't have
it available for passive connection.

Instead, when the fallback is detected, replace the mptcp
socket with the underlying TCP subflow. Beyond covering
the above scenario, it makes a TCP fallback socket as efficient
as plain TCP ones.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Christoph Paasch
d22f4988ff mptcp: process MP_CAPABLE data option
This patch implements the handling of MP_CAPABLE + data option, as per
RFC 6824 bis / RFC 8684: MPTCP v1.

On the server side we can receive the remote key after that the connection
is established. We need to explicitly track the 'missing remote key'
status and avoid emitting a mptcp ack until we get such info.

When a late/retransmitted/OoO pkt carrying MP_CAPABLE[+data] option
is received, we have to propagate the mptcp seq number info to
the msk socket. To avoid ABBA locking issue, explicitly check for
that in recvmsg(), where we own msk and subflow sock locks.

The above also means that an established mp_capable subflow - still
waiting for the remote key - can be 'downgraded' to plain TCP.

Such change could potentially block a reader waiting for new data
forever - as they hook to msk, while later wake-up after the downgrade
will be on subflow only.

The above issue is not handled here, we likely have to get rid of
msk->fallback to handle that cleanly.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Christoph Paasch
cc7972ea19 mptcp: parse and emit MP_CAPABLE option according to v1 spec
This implements MP_CAPABLE options parsing and writing according
to RFC 6824 bis / RFC 8684: MPTCP v1.

Local key is sent on syn/ack, and both keys are sent on 3rd ack.
MP_CAPABLE messages len are updated accordingly. We need the skbuff to
correctly emit the above, so we push the skbuff struct as an argument
all the way from tcp code to the relevant mptcp callbacks.

When processing incoming MP_CAPABLE + data, build a full blown DSS-like
map info, to simplify later processing.  On child socket creation, we
need to record the remote key, if available.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
65492c5a6a mptcp: move from sha1 (v0) to sha256 (v1)
For simplicity's sake use directly sha256 primitives (and pull them
as a required build dep).
Add optional, boot-time self-tests for the hmac function.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Florian Westphal
048d19d444 mptcp: add basic kselftest for mptcp
Add mptcp_connect tool:
xmit two files back and forth between two processes, several net
namespaces including some adding delays, losses and reordering.
Wrapper script tests that data was transmitted without corruption.

The "-c" command line option for mptcp_connect.sh is there for debugging:

The script will use tcpdump to create one .pcap file per test case, named
according to the namespaces, protocols, and connect address in use.
For example, the first test case writes the capture to
ns1-ns1-MPTCP-MPTCP-10.0.1.1.pcap.

The stderr output from tcpdump is printed after the test completes to
show tcpdump's "packets dropped by kernel" information.

Also check that userspace can't create MPTCP sockets when mptcp.enabled
sysctl is off.

The "-b" option allows to tune/lower send buffer size.
"-m mmap" can be used to test blocking io.  Default is non-blocking
io using read/write/poll.

Will run automatically on "make kselftest".

Note that the default timeout of 45 seconds is used even if there is a
"settings" changing it to 450. 45 seconds should be enough in most cases
but this depends on the machine running the tests.

A fix to correctly read the "settings" file has been proposed upstream
but not applied yet. It is not blocking the execution of these new tests
but it would be nice to have it:

  https://patchwork.kernel.org/patch/11204935/

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Matthieu Baerts
784325e9f0 mptcp: new sysctl to control the activation per NS
New MPTCP sockets will return -ENOPROTOOPT if MPTCP support is disabled
for the current net namespace.

We are providing here a way to control access to the feature for those
that need to turn it on or off.

The value of this new sysctl can be different per namespace. We can then
restrict the usage of MPTCP to the selected NS. In case of serious
issues with MPTCP, administrators can now easily turn MPTCP off.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
57040755a3 mptcp: allow collapsing consecutive sendpages on the same substream
If the current sendmsg() lands on the same subflow we used last, we
can try to collapse the data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
7a6a6cbc3e mptcp: recvmsg() can drain data from multiple subflows
With the previous patch in place, the msk can detect which subflow
has the current map with a simple walk, let's update the main
loop to always select the 'current' subflow. The exit conditions now
closely mirror tcp_recvmsg() to get expected timeout and signal
behavior.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Florian Westphal
1891c4a076 mptcp: add subflow write space signalling and mptcp_poll
Add new SEND_SPACE flag to indicate that a subflow has enough space to
accept more data for transmission.

It gets cleared at the end of mptcp_sendmsg() in case ssk has run
below the free watermark.

It is (re-set) from the wspace callback.

This allows us to use msk->flags to determine the poll mask.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Mat Martineau
648ef4b886 mptcp: Implement MPTCP receive path
Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

The subflow implements its own data_ready() ops, which ensures that the
pending data is in sequence - according to MPTCP seq number - dropping
out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
This allows the MPTCP socket layer to determine if more data is
available without having to consult the individual subflows.

It additionally validates the current mapping and propagates EoF events
to the connection socket.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau
6d0060f600 mptcp: Write MPTCP DSS headers to outgoing data packets
Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, which is copied to page fragments and mapped in to MPTCP
DSS segments with size determined by the available page fragments and
the maximum mapping length allowed by the MPTCP specification. If
do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok -
the later skbs can either have duplicated DSS mapping information or
none at all, and the receiver can handle that.

The current implementation uses the subflow frag cache and tcp
sendpages to avoid excessive code duplication. More work is required to
ensure that it works correctly under memory pressure and to support
MPTCP-level retransmissions.

The MPTCP DSS checksum is not yet implemented.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
717e79c867 mptcp: Add setsockopt()/getsockopt() socket operations
set/getsockopt behaviour with multiple subflows is undefined.
Therefore, for now, we return -EOPNOTSUPP unless we're in fallback mode.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
214984901a mptcp: Add shutdown() socket operation
Call shutdown on all subflows in use on the given socket, or on the
fallback socket.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
79c0949e9a mptcp: Add key generation and token tree
Generate the local keys, IDSN, and token when creating a new socket.
Introduce the token tree to track all tokens in use using a radix tree
with the MPTCP token itself as the index.

Override the rebuild_header callback in inet_connection_sock_af_ops for
creating the local key on a new outgoing connection.

Override the init_req callback of tcp_request_sock_ops for creating the
local key on a new incoming connection.

Will be used to obtain the MPTCP parent socket to handle incoming joins.

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
cf7da0d66c mptcp: Create SUBFLOW socket for incoming connections
Add subflow_request_sock type that extends tcp_request_sock
and add an is_mptcp flag to tcp_request_sock distinguish them.

Override the listen() and accept() methods of the MPTCP
socket proto_ops so they may act on the subflow socket.

Override the conn_request() and syn_recv_sock() handlers
in the inet_connection_sock to handle incoming MPTCP
SYNs and the ACK to the response SYN.

Add handling in tcp_output.c to add MP_CAPABLE to an outgoing
SYN-ACK response for a subflow_request_sock.

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
cec37a6e41 mptcp: Handle MP_CAPABLE options for outgoing connections
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
2303f994b3 mptcp: Associate MPTCP context with TCP socket
Use ULP to associate a subflow_context structure with each TCP subflow
socket. Creating these sockets requires new bind and connect functions
to make sure ULP is set up immediately when the subflow sockets are
created.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
eda7acddf8 mptcp: Handle MPTCP TCP options
Add hooks to parse and format the MP_CAPABLE option.

This option is handled according to MPTCP version 0 (RFC6824).
MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
coordination with related code changes.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau
f870fa0b57 mptcp: Add MPTCP socket stubs
Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00