Change the "BUG" to "WARNING" and disable the message because it triggers
occasionally in spite of the check in flush_cache_page_if_present.
The pte value extracted for the "from" page in copy_user_highpage is racy and
occasionally the pte is cleared before the flush is complete. I assume that
the page is simultaneously flushed by flush_cache_mm before the pte is cleared
as nullifying the fdc doesn't seem to cause problems.
I investigated various locking scenarios but I wasn't able to find a way to
sequence the flushes. This code is called for every COW break and locks impact
performance.
This patch is related to the bigger cache flush patch because we need the pte
on PA8800/PA8900 to flush using the vma context.
I have also seen this from copy_to_user_page and copy_from_user_page.
The messages appear infrequently when enabled.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
It turns out that networking.pdf has exceeded 100 chapters and
titles of chapters >= 100 collide with their counts in its table
of contents (TOC).
Increase relevant params by 0.6em in the preamble to avoid such
ugly collisions.
While at it, fix a typo in comment (subsection).
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Link: https://lore.kernel.org/r/bdb60ba3-7813-47d0-74f9-7c31dd912d95@gmail.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
clk_generated_best_diff() helps in finding the parent and the divisor to
compute a rate closest to the required one. However, it doesn't take into
account the request's range for the new rate. Make sure the new rate
is within the required range.
Fixes: 8a8f4bf0c4 ("clk: at91: clk-generated: create function to find best_diff")
Signed-off-by: Codrin Ciubotariu <codrin.ciubotariu@microchip.com>
Link: https://lore.kernel.org/r/20220413071318.244912-1-codrin.ciubotariu@microchip.com
Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Pass updated i_size in fscache_unuse_cookie() when called
from nfs_fscache_release_file(), which ensures the size of
an fscache object gets written to the cache storage. Failing
to do so results in unnessary reads from the NFS server, even
when the data is cached, due to a cachefiles object coherency
check failing with a trace similar to the following:
cachefiles_coherency: o=0000000e BAD osiz B=afbb3 c=0
This problem can be reproduced as follows:
#!/bin/bash
v=4.2; NFS_SERVER=127.0.0.1
set -e; trap cleanup EXIT; rc=1
function cleanup {
umount /mnt/nfs > /dev/null 2>&1
RC_STR="TEST PASS"
[ $rc -eq 1 ] && RC_STR="TEST FAIL"
echo "$RC_STR on $(uname -r) with NFSv$v and server $NFS_SERVER"
}
mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs
rm -f /mnt/nfs/file1.bin > /dev/null 2>&1
dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1 > /dev/null 2>&1
echo 3 > /proc/sys/vm/drop_caches
echo Read file 1st time from NFS server into fscache
dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1
umount /mnt/nfs && mount -o vers=$v,fsc $NFS_SERVER:/export /mnt/nfs
echo 3 > /proc/sys/vm/drop_caches
echo Read file 2nd time from fscache
dd if=/mnt/nfs/file1.bin of=/dev/null > /dev/null 2>&1
echo Check mountstats for NFS read
grep -q "READ: 0" /proc/self/mountstats # (1st number) == 0
[ $? -eq 0 ] && rc=0
Fixes: a6b5a28eb5 "nfs: Convert to new fscache volume/cookie API"
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
cpufreq_offline() calls offline() and exit() under the policy rwsem
But they are called outside the rwsem in cpufreq_online().
Make cpufreq_online() call offline() and exit() as well as online() and
init() under the policy rwsem to achieve a clear lock relationship.
All of the init() and online() implementations in the tree only
initialize the policy object without attempting to acquire the policy
rwsem and they won't call cpufreq APIs attempting to acquire it.
Signed-off-by: Schspa Shi <schspa@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
If policy initialization fails after the sysfs files are created,
there is a possibility to end up running show()/store() callbacks
for half-initialized policies, which may have unpredictable
outcomes.
Abort show()/store() in such a case by making sure the policy is active.
Also dectivate the policy on such failures.
Signed-off-by: Schspa Shi <schspa@gmail.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To enable NFSv4 to work correctly, NFSv4 client identifiers have
to be globally unique and persistent over client reboots. We
believe that in many cases, a good default identifier can be
chosen and set when a client system is imaged.
Because there are many different ways a system can be imaged,
provide an explanation of how NFSv4 client identifiers and
principals can be set by install scripts and imaging tools.
Additional cases, such as NFSv4 clients running in containers, also
need unique and persistent identifiers. The Linux NFS community
sets forth this explanation to aid those who create and manage
container environments.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The documentation for nfs4_unique_id is out-of-date. In particular it
claim that when nfs4_unique_id is set, the host name is not used. since
Commit 55b592933b ("NFSv4: Fix nfs4_init_uniform_client_string for net
namespaces") both the unique_id AND the host name are used.
Update the documentation to match the code.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
NFSv4 can lose locks if, for example there is a network partition for
longer than the lease period. When this happens a warning message
NFS: __nfs4_reclaim_open_state: Lock reclaim failed!
is generated, possibly once for each lock (though rate limited).
This is potentially misleading as is can be read as suggesting that lock
reclaim was attempted. However the default behaviour is to not attempt
to recover locks (except due to server report).
This patch changes the reporting to produce at most one message for each
attempt to recover all state from a given server. The message reports
the server name and the number of locks lost if that number is non-zero.
It reports that locks were lost and give no suggestion as to whether
there was an attempt or not.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Now that everything is fully locked there is no need for container_users
to remain as an atomic, change it to an unsigned int.
Use 'if (group->container)' as the test to determine if the container is
present or not instead of using container_users.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/6-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Once userspace opens a group FD it is prevented from opening another
instance of that same group FD until all the prior group FDs and users of
the container are done.
The first is done trivially by checking the group->opened during group FD
open.
However, things get a little weird if userspace creates a device FD and
then closes the group FD. The group FD still cannot be re-opened, but this
time it is because the group->container is still set and container_users
is elevated by the device FD.
Due to this mismatched lifecycle we have the
vfio_group_try_dissolve_container() which tries to auto-free a container
after the group FD is closed but the device FD remains open.
Instead have the device FD hold onto a reference to the single group
FD. This directly prevents vfio_group_fops_release() from being called
when any device FD exists and makes the lifecycle model more
understandable.
vfio_group_try_dissolve_container() is removed as the only place a
container is auto-deleted is during vfio_group_fops_release(). At this
point the container_users is either 1 or 0 since all device FDs must be
closed.
Change group->opened to group->opened_file which points to the single
struct file * that is open for the group. If the group->open_file is
NULL then group->container == NULL.
If all device FDs have closed then the group's notifier list must be
empty.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/5-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is necessary to avoid various user triggerable races, for instance
racing SET_CONTAINER/UNSET_CONTAINER:
ioctl(VFIO_GROUP_SET_CONTAINER)
ioctl(VFIO_GROUP_UNSET_CONTAINER)
vfio_group_unset_container
int users = atomic_cmpxchg(&group->container_users, 1, 0);
// users == 1 container_users == 0
__vfio_group_unset_container(group);
container = group->container;
vfio_group_set_container()
if (!atomic_read(&group->container_users))
down_write(&container->group_lock);
group->container = container;
up_write(&container->group_lock);
down_write(&container->group_lock);
group->container = NULL;
up_write(&container->group_lock);
vfio_container_put(container);
/* woops we lost/leaked the new container */
This can then go on to NULL pointer deref since container == 0 and
container_users == 1.
Wrap all touches of container, except those on a performance path with a
known open device, with the group_rwsem.
The only user of vfio_group_add_container_user() holds the user count for
a simple operation, change it to just hold the group_lock over the
operation and delete vfio_group_add_container_user(). Containers now only
gain a user when a device FD is opened.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/4-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The split follows the pairing with the destroy functions:
- vfio_group_get_device_fd() destroyed by close()
- vfio_device_open() destroyed by vfio_device_fops_release()
- vfio_device_assign_container() destroyed by
vfio_group_try_dissolve_container()
The next patch will put a lock around vfio_device_assign_container().
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/3-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is not a performance path, just use the group_rwsem to protect the
value.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/2-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Without locking userspace can trigger a UAF by racing
KVM_DEV_VFIO_GROUP_DEL with VFIO_GROUP_GET_DEVICE_FD:
CPU1 CPU2
ioctl(KVM_DEV_VFIO_GROUP_DEL)
ioctl(VFIO_GROUP_GET_DEVICE_FD)
vfio_group_get_device_fd
open_device()
intel_vgpu_open_device()
vfio_register_notifier()
vfio_register_group_notifier()
blocking_notifier_call_chain(&group->notifier,
VFIO_GROUP_NOTIFY_SET_KVM, group->kvm);
set_kvm()
group->kvm = NULL
close()
kfree(kvm)
intel_vgpu_group_notifier()
vdev->kvm = data
[..]
kvm_get_kvm(vgpu->kvm);
// UAF!
Add a simple rwsem in the group to protect the kvm while the notifier is
using it.
Note this doesn't fix the race internal to i915 where userspace can
trigger two VFIO_GROUP_NOTIFY_SET_KVM's before we reach a consumer of
vgpu->kvm and trigger this same UAF, it just makes the notifier
self-consistent.
Fixes: ccd46dbae7 ("vfio: support notifier chain in vfio_group")
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Link: https://lore.kernel.org/r/1-v2-d035a1842d81+1bf-vfio_group_locking_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fix following coccicheck warning:
./virt/kvm/vfio.c:258:1-7: preceding lock on line 236
If kvm_vfio_file_iommu_group() failed, code would goto err_fdput with
mutex_lock acquired and then return ret. It might cause potential
deadlock. Move mutex_unlock bellow err_fdput tag to fix it.
Fixes: d55d9e7a45 ("kvm/vfio: Store the struct file in the kvm_vfio_group")
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20220517023441.4258-1-wanjiabing@vivo.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Convert rockchip,rk3368-cru.txt to YAML.
Changes against original bindings:
- Add clocks and clock-names because the device has to have
at least one input clock.
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220329180550.31043-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Convert rockchip,rk3228-cru.txt to YAML.
Changes against original bindings:
Add clocks and clock-names because the device has to have
at least one input clock.
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220330121923.24240-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Convert rockchip,rk3036-cru.txt to YAML.
Changes against original bindings:
Add clocks and clock-names because the device has to have
at least one input clock.
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220330114847.18633-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Convert rockchip,rk3308-cru.txt to YAML.
Changes against original bindings:
- Add clocks and clock-names because the device has to have
at least one input clock.
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220329184339.1134-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Not calling the function for dummy contexts will cause the context to
not be reset. During the next syscall, this will cause an error in
__audit_syscall_entry:
WARN_ON(context->context != AUDIT_CTX_UNUSED);
WARN_ON(context->name_count);
if (context->context != AUDIT_CTX_UNUSED || context->name_count) {
audit_panic("unrecoverable error in audit_syscall_entry()");
return;
}
These problematic dummy contexts are created via the following call
chain:
exit_to_user_mode_prepare
-> arch_do_signal_or_restart
-> get_signal
-> task_work_run
-> tctx_task_work
-> io_req_task_submit
-> io_issue_sqe
-> audit_uring_entry
Cc: stable@vger.kernel.org
Fixes: 5bd2182d58 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Julian Orth <ju.orth@gmail.com>
[PM: subject line tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
If the device is a detachable (and therefore lacks full keyboard), we may
still want to load this driver because the device might have some other
buttons or switches (e.g. volume and power buttons or a tablet mode
switch). In such case we do not want to register the "main" keyboard device
to allow userspace detect when the detachable keyboard is disconnected and
adjust the system behavior for the tablet mode.
Originally it was suggested to simply skip keyboard registration if row and
columns properties didn't exist, but that approach did not convey the
intent strongly enough and also had a slight problem for migrating existing
DTBs without updating the kernel first, so it was decided to introduce new
google,cros-ec-keyb-switches to explicitly mark devices that only have
axillary buttons and switches.
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20220516183452.942008-3-swboyd@chromium.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
If the ChromeOS board is a detachable, this cros-ec-keyb device won't
have a matrix keyboard but it may have some button switches, e.g. volume
buttons and power buttons. The driver still registers a keyboard though
and that leads to userspace confusion around where the keyboard is.
We tried to work around this by only registering the keyboard device when
rows/columns properties were specified for the device, but that led to
another problem where removing the rows/columns properties breaks the
existing binding. Technically before that commit the rows/columns
properties were required, otherwise the driver would fail to probe.
Removing the properties from devicetrees makes the driver fail to probe
unless the corresponding driver patch is present. Furthermore, this makes
requiring matrix keyboard properties for devices that really have a
keyboard impossible because the compatible drives the schema and now the
properties are optional.
Add a more specific compatible for this type of device that indicates to
the OS that there are only switches and no matrix keyboard present.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20220516183452.942008-2-swboyd@chromium.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Convert rockchip,px30-cru.txt to YAML.
Changes against original bindings:
Use compatible string: "rockchip,px30-pmucru"
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220330103923.11063-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Current dts files with RK3188/RK3066 'cru' nodes are manually verified.
In order to automate this process rockchip,rk3188-cru.txt has to be
converted to YAML.
Changed:
Add properties to fix notifications by clocks.yaml for example:
clocks
clock-names
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220329111323.3569-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Current dts files with RK3288 'cru' nodes are manually verified.
In order to automate this process rockchip,rk3288-cru.txt has to be
converted to YAML.
Changed:
Add properties to fix notifications by clocks.yaml for example:
clocks
clock-names
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220329113657.4567-1-jbx6244@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
There is a race condition in smb2_compound_op:
after_close:
num_rqst++;
if (cfile) {
cifsFileInfo_put(cfile); // sends SMB2_CLOSE to the server
cfile = NULL;
This is triggered by smb2_query_path_info operation that happens during
revalidate_dentry. In smb2_query_path_info, get_readable_path is called to
load the cfile, increasing the reference counter. If in the meantime, this
reference becomes the very last, this call to cifsFileInfo_put(cfile) will
trigger a SMB2_CLOSE request sent to the server just before sending this compound
request – and so then the compound request fails either with EBADF/EIO depending
on the timing at the server, because the handle is already closed.
In the first scenario, the race seems to be happening between smb2_query_path_info
triggered by the rename operation, and between “cleanup” of asynchronous writes – while
fsync(fd) likely waits for the asynchronous writes to complete, releasing the writeback
structures can happen after the close(fd) call. So the EBADF/EIO errors will pop up if
the timing is such that:
1) There are still outstanding references after close(fd) in the writeback structures
2) smb2_query_path_info successfully fetches the cfile, increasing the refcounter by 1
3) All writeback structures release the same cfile, reducing refcounter to 1
4) smb2_compound_op is called with that cfile
In the second scenario, the race seems to be similar – here open triggers the
smb2_query_path_info operation, and if all other threads in the meantime decrease the
refcounter to 1 similarly to the first scenario, again SMB2_CLOSE will be sent to the
server just before issuing the compound request. This case is harder to reproduce.
See https://bugzilla.samba.org/show_bug.cgi?id=15051
Cc: stable@vger.kernel.org
Fixes: 8de9e86c67 ("cifs: create a helper to find a writeable handle by path name")
Signed-off-by: Ondrej Hubsch <ohubsch@purestorage.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
We gate whether to IOPOLL for a request on whether the opcode is allowed
on a ring setup for IOPOLL and if it's got a file assigned. MSG_RING
is the only one that allows a file yet isn't pollable, it's merely
supported to allow communication on an IOPOLL ring, not because we can
poll for completion of it.
Put the assigned file early and clear it, so we don't attempt to poll
for it.
Reported-by: syzbot+1a0a53300ce782f8b3ad@syzkaller.appspotmail.com
Fixes: 3f1d52abf0 ("io_uring: defer msg-ring file validity check until command issue")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Tryng to rename a directory that has all following properties fails with
EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))'
in f2fs_match_ci_name():
- The directory is casefolded
- The directory is encrypted
- The directory's encryption key is not yet set up
- The parent directory is *not* encrypted
The problem is incorrect handling of the lookup of ".." to get the
parent reference to update. fscrypt_setup_filename() treats ".." (and
".") specially, as it's never encrypted. It's passed through as-is, and
setting up the directory's key is not attempted. As the name isn't a
no-key name, f2fs treats it as a "normal" name and attempts a casefolded
comparison. That breaks the assumption of the WARN_ON_ONCE() in
f2fs_match_ci_name() which assumes that for encrypted directories,
casefolded comparisons only happen when the directory's key is set up.
We could just remove this WARN_ON_ONCE(). However, since casefolding is
always a no-op on "." and ".." anyway, let's instead just not casefold
these names. This results in the standard bytewise comparison.
Fixes: 7ad08a58bf ("f2fs: Handle casefolding with Encryption")
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The f2fs_gc uses a bitmap to indicate pinned sections, but when disabling
chckpoint, we call f2fs_gc() with NULL_SEGNO which selects the same dirty
segment as a victim all the time, resulting in checkpoint=disable failure,
for example. Let's pick another one, if we fail to collect it.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Hulk Robot reported a BUG_ON:
==================================================================
EXT4-fs error (device loop3): ext4_mb_generate_buddy:805: group 0,
block bitmap and bg descriptor inconsistent: 25 vs 31513 free clusters
kernel BUG at fs/ext4/ext4_jbd2.c:53!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 25371 Comm: syz-executor.3 Not tainted 5.10.0+ #1
RIP: 0010:ext4_put_nojournal fs/ext4/ext4_jbd2.c:53 [inline]
RIP: 0010:__ext4_journal_stop+0x10e/0x110 fs/ext4/ext4_jbd2.c:116
[...]
Call Trace:
ext4_write_inline_data_end+0x59a/0x730 fs/ext4/inline.c:795
generic_perform_write+0x279/0x3c0 mm/filemap.c:3344
ext4_buffered_write_iter+0x2e3/0x3d0 fs/ext4/file.c:270
ext4_file_write_iter+0x30a/0x11c0 fs/ext4/file.c:520
do_iter_readv_writev+0x339/0x3c0 fs/read_write.c:732
do_iter_write+0x107/0x430 fs/read_write.c:861
vfs_writev fs/read_write.c:934 [inline]
do_pwritev+0x1e5/0x380 fs/read_write.c:1031
[...]
==================================================================
Above issue may happen as follows:
cpu1 cpu2
__________________________|__________________________
do_pwritev
vfs_writev
do_iter_write
ext4_file_write_iter
ext4_buffered_write_iter
generic_perform_write
ext4_da_write_begin
vfs_fallocate
ext4_fallocate
ext4_convert_inline_data
ext4_convert_inline_data_nolock
ext4_destroy_inline_data_nolock
clear EXT4_STATE_MAY_INLINE_DATA
ext4_map_blocks
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_regular_allocator
ext4_mb_good_group_nolock
ext4_mb_init_group
ext4_mb_init_cache
ext4_mb_generate_buddy --> error
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_restore_inline_data
set EXT4_STATE_MAY_INLINE_DATA
ext4_block_write_begin
ext4_da_write_end
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)
ext4_write_inline_data_end
handle=NULL
ext4_journal_stop(handle)
__ext4_journal_stop
ext4_put_nojournal(handle)
ref_cnt = (unsigned long)handle
BUG_ON(ref_cnt == 0) ---> BUG_ON
The lock held by ext4_convert_inline_data is xattr_sem, but the lock
held by generic_perform_write is i_rwsem. Therefore, the two locks can
be concurrent.
To solve above issue, we add inode_lock() for ext4_convert_inline_data().
At the same time, move ext4_convert_inline_data() in front of
ext4_punch_hole(), remove similar handling from ext4_punch_hole().
Fixes: 0c8d414f16 ("ext4: let fallocate handle inline data correctly")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220428134031.4153381-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Symlink's external data block is one kind of metadata block, and now
that almost all ext4 metadata block's page cache (e.g. directory blocks,
quota blocks...) belongs to bdev backing inode except the symlink. It
is essentially worked in data=journal mode like other regular file's
data block because probably in order to make it simple for generic VFS
code handling symlinks or some other historical reasons, but the logic
of creating external data block in ext4_symlink() is complicated. and it
also make things confused if user do not want to let the filesystem
worked in data=journal mode. This patch convert the final exceptional
case and make things clean, move the mapping of the symlink's external
data block to bdev like any other metadata block does.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-3-yi.zhang@huawei.com
Current ext4_getblk() might sleep if some resources are not valid or
could be race with a concurrent extents modifing procedure. So we
cannot call ext4_getblk() and ext4_map_blocks() to get map blocks in
the atomic context in some fast path (e.g. the upcoming procedure of
getting symlink external block in the RCU context), even if the map
extents have already been check and cached.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20220424140936.1898920-2-yi.zhang@huawei.com
In __ext4_super() we always overwrote the user specified journal_ioprio
value with a default value, expecting parse_apply_sb_mount_options() to
later correctly set ctx->journal_ioprio to the user specified value.
However, if parse_apply_sb_mount_options() returned early because of
empty sbi->es_s->s_mount_opts, the correct journal_ioprio value was
never set.
This patch fixes __ext4_super() to only use the default value if the
user has not specified any value for journal_ioprio.
Similarly, the remount behavior was to either use journal_ioprio
value specified during initial mount, or use the default value
irrespective of the journal_ioprio value specified during remount.
This patch modifies this to first check if a new value for ioprio
has been passed during remount and apply it. If no new value is
passed, use the value specified during initial mount.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Link: https://lore.kernel.org/r/20220418083545.45778-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:
// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m
[ Update function documentation for ext4_trim_all_free -- TYT ]
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Link: https://lore.kernel.org/r/1650214995-860245-1-git-send-email-dmtrmonakhov@yandex-team.ru
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Zoned devices are expected to have zone sizes in the range of 1-2GB for
ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to
allow arbitrarily small zone sizes on btrfs.
But for testing purposes with emulated devices it is sometimes desirable
to create devices with as small as 4MB zone size to uncover errors.
So use 4MB as the smallest possible zone size and reject mounts of devices
with a smaller zone size.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs defaults to max_inline=2K to make small writes inlined into
metadata.
The default value is always a win, as even DUP/RAID1/RAID10 doubles the
metadata usage, it should still cause less physical space used compared
to a 4K regular extents.
But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
users may find inlined extents causing too much space wasted, and want
to convert those inlined extents back to regular extents.
Unfortunately defrag will unconditionally skip all inline extents, no
matter if the user is trying to converting them back to regular extents.
So this patch will add a small exception for defrag_collect_targets() to
allow defragging inline extents, if and only if the inlined extents are
larger than max_inline, allowing users to convert them to regular ones.
This also allows us to defrag extents like the following:
item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
generation 7 type 0 (inline)
inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
generation 7 type 1 (regular)
extent data disk byte 13631488 nr 4096
extent data offset 0 nr 16384 ram 16384
extent compression 1 (zlib)
Previously we're unable to do any defrag, since the first extent is
inlined, and the second one has no extent to merge.
Now we can defrag it to just one single extent, saving 48 bytes metadata
space.
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 13635584 nr 4096
extent data offset 0 nr 20480 ram 20480
extent compression 1 (zlib)
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following error message lack the "0x" obviously:
cannot mount because of unsupported optional features (4000)
Add the prefix to make it less confusing. This can happen on older
kernels that try to mount a filesystem with newer features so it makes
sense to backport to older trees.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reserving metadata units for creating an inode, we don't need to
reserve one extra unit for the inode ref item because when creating the
inode, at btrfs_create_new_inode(), we always insert the inode item and
the inode ref item in a single batch (a single btree insert operation,
and both ending up in the same leaf).
As we have accounted already one unit for the inode item, the extra unit
for the inode ref item is superfluous, it only makes us reserve more
metadata than necessary and often adding more reclaim pressure if we are
low on available metadata space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block_group->alloc_offset is an offset from the start of the block
group. OTOH, the ->meta_write_pointer is an address in the logical
space. So, we should compare the alloc_offset shifted with the
block_group->start.
Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.
However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).
Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).
So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.
Example scenario, on a box with 32G of RAM:
$ btrfs subvolume create /mnt/sv1
$ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
$ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
$ free -m
total used free shared buff/cache available
Mem: 31937 186 26866 0 4883 31297
Swap: 8188 0 8188
# After this we get less 4G of free memory.
$ btrfs send /mnt/snap1 >/dev/null
$ free -m
total used free shared buff/cache available
Mem: 31937 186 22814 0 8935 31297
Swap: 8188 0 8188
The same, obviously, applies to an incremental send.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This documentation provides wrong node name for the Ethernet controller.
It should be "ethernet" instead of "smsc" as required by Ethernet
controller devicetree schema:
Documentation/devicetree/bindings/net/ethernet-controller.yaml
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220517111505.929722-4-o.rempel@pengutronix.de
Create initial schema for Microchip/SMSC LAN95xx USB Ethernet controllers and
import some of currently supported USB IDs form drivers/net/usb/smsc95xx.c
These devices are already used in some of DTs. So, this schema makes it official.
NOTE: there was no previously documented txt based DT binding for this
controllers.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220517111505.929722-3-o.rempel@pengutronix.de
Create schema for ASIX USB Ethernet controllers and import some of
currently supported USB IDs form drivers/net/usb/asix_devices.c
These devices are already used in some of DTs. So, this schema makes it official.
NOTE: there was no previously documented txt based DT binding for this
controllers.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220517111505.929722-2-o.rempel@pengutronix.de