The majority of this is temperature sensor additions for various devices
and fixes to the trigger type of the thermal interrupts.
Other than that there are various minor fixes across the board.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEiOrDCAFJzPfAjcif3SOs138+s6EFAmEWmI8THHRyZWRpbmdA
bnZpZGlhLmNvbQAKCRDdI6zXfz6zoUMWD/9wdd5/zlOG3XHYGlo30TzoHLW6nO6E
VmZm5RlJ7JB7XRxFLrB22d0Kt9L6fm8UW5GOGCK4sG47I/+YyNUFDtn+TpQJeUvM
iQ7ImX9X4bFqVn7ul6xWhdCEN+p3ZTRcqTAtgEjJevs9i86EIZzYNtLaIHRqvxwT
9S7/6jtESZNRSVoH0/wQay0lOHoBoHhSzJTGSS61yGak/NAQDRdU0xNkvTarZbVb
VCojDx8uNuAuEUqvPxY9BRdhHy6PAPebSem60Akm+S76HUQ4CIea9A/SExsOfq0/
Zv07nMVKHjxAkKp+Mr8QWCpbJpdrf6snUU1nuGsOvnOwvgHQtatVlsYs81HlByGu
lI3mdajL8XSkxCIHnl2uXOXMuOnlh7BBuE6ie0/S3QN6yOMEI17kAJUnoB2eYitx
wfswrH3sVlJ5xFhb9oqhfz3VfjG69PPBLudm2YGLVs4pYVfvGS88Pqes//6jKfsT
1TgRmJEAwYaFCrhPJXjwnZniHc4tc6SowQ9XsPyhV9LepwbAQKF6VMzVOOqU9UW/
2fGlemi4aKWRmQRAlLT8kA2IS8fip7NOXSl/ffcGGnpqM5o38kikZsYRWgS+7kxT
M24mNefKwTy7FQjnfPPXfRo2xBTqXKNRmdUEHFdMwCTxNnVF8bf3I96IO8zsb408
vH6+nkJ/Ty7HaA==
=8RvP
-----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmEa0YAACgkQmmx57+YA
GNlbRA/8CvbB9v0CzVxIyEZQHEsTP3FZAu5kDUFxjG32yqNxmWTfGV197wQ7xLdI
R80L3LMf2kfWwwVi66dUiYpOVdUSg7Z+vZmieZhX4Z2rjtDdWzHYZYHF45Sc2UUr
yvJa41pCAx1GzI7vyfPVSRnIRpoc1GHumJzEQT8K4K7w9owImboZfgwufwkd7zB/
Y6XdAaFUiZjYsddAaiEE6NFAGAP/QSWQLRYmJB923P7kCKytS4zQp/bcUH+CrrrE
mr4ATkvrW5mclmaDhWjFsIHVUmpFUPiooCqy4eEg65muLm1MS30+6LO2U3F3+Vtt
hvAYIDtaI5WBrgwYVBnWMJuPtP4UZdybdmWJnHXKH+jVU3a7jjOo/59uI2ZKapYj
yOImKxJFMkJh/qPEDIyECtNXrbc+hqy5Pj3PryztEKx0lW8okl8eqDUN3JtUNkQc
AP5oMP36vdOhP2nZkfy8LlPz/1woQOlC2cPbP9mVU74kw2t2eybbinNLnU0pgAm/
bNh7In09ivbC9LUNoP63c1gS+ZF4yzdzomHNqzqIIy2i+RLrccPiIWg7DgMwX5ze
MtRHbuGNmMqoGgSzQx5u5PMPaw8aSRnNuoUiD7EjfydIqI90/0NLQ0lDrqMD/ctV
ftrlkNTVM6Y3OdGTyWFhmVAUjLeljY0yEaqBaIb+2YHS2icAYmI=
=zHxH
-----END PGP SIGNATURE-----
Merge tag 'tegra-for-5.15-arm-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into arm/dt
ARM: tegra: Device tree changes for v5.15-rc1
The majority of this is temperature sensor additions for various devices
and fixes to the trigger type of the thermal interrupts.
Other than that there are various minor fixes across the board.
* tag 'tegra-for-5.15-arm-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
ARM: tegra: tamonten: Fix UART pad setting
ARM: tegra: nexus7: Improve thermal zones
ARM: tegra: acer-a500: Improve thermal zones
ARM: tegra: acer-a500: Use verbose variant of atmel,wakeup-method value
ARM: tegra: acer-a500: Add power supplies to accelerometer
ARM: tegra: acer-a500: Remove bogus USB VBUS regulators
ARM: tegra: jetson-tk1: Correct interrupt trigger type of temperature sensor
ARM: tegra: dalmore: Correct interrupt trigger type of temperature sensor
ARM: tegra: cardhu: Correct interrupt trigger type of temperature sensor
ARM: tegra: apalis: Correct interrupt trigger type of temperature sensor
ARM: tegra: nyan: Correct interrupt trigger type of temperature sensor
ARM: tegra: acer-a500: Add interrupt to temperature sensor node
ARM: tegra: nexus7: Add interrupt to temperature sensor node
ARM: tegra: paz00: Add interrupt to temperature sensor node
ARM: tegra: ouya: Add interrupt to temperature sensor node
ARM: tegra: Add SoC thermal sensor to Tegra30 device-trees
Link: https://lore.kernel.org/r/20210813162157.2820913-4-thierry.reding@gmail.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Clang + -Wimplicit-fallthrough warns:
drivers/gpu/drm/radeon/radeon_fb.c:170:2: warning: unannotated
fall-through between switch labels [-Wimplicit-fallthrough]
default:
^
drivers/gpu/drm/radeon/radeon_fb.c:170:2: note: insert 'break;' to avoid
fall-through
default:
^
break;
1 warning generated.
Clang's version of this warning is a little bit more pedantic than
GCC's. Add the missing break to satisfy it to match what has been done
all over the kernel tree.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This version brings along following fixes:
- Ensure DCN save init registers after VM setup
- Fix multi-display support for idle opt workqueue
- Use vblank control events for PSR enable/disable
- Create default dc_sink when fail reading EDID under MST
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Aric Cyr <aric.cyr@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Anthony Koo <Anthony.Koo@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
Compilation of the workqueue fails if not building with the DCN config
option set.
[How]
Guard calls to the flush with the DCN config option to fix the build.
Reviewed-by: Roman Li <Roman.Li@amd.com>
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
DM initializes VM context after DMCUB initialization.
This results in loss of DCN_VM_CONTEXT registers after z10.
[How]
Notify DMCUB when VM setup is complete, and have DMCUB
save init registers.
v2: squash in CONFIG_DRM_AMD_DC_DCN3_1 fix
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Jake Wang <haonan.wang2@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
DM initializes VM context after DMCUB initialization.
This results in loss of DCN_VM_CONTEXT registers after z10.
[How]
Notify DMCUB when VM setup is complete, and have DMCUB
save init registers.
v2: squash in CONFIG_DRM_AMD_DC_DCN3_1 fix
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Jake Wang <haonan.wang2@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
Note: Google Test filter = KFDSVMRangeTest.*
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from KFDSVMRangeTest
[ RUN ] KFDSVMRangeTest.BasicSystemMemTest
[ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
[ RUN ] KFDSVMRangeTest.SetGetAttributesTest
[ ] Get default atrributes
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure
Value of: expectedDefaultResults[i]
Actual: 4
Expected: outputAttributes[i].type
Which is: 2
[ ] Setting/Getting atrributes
[ FAILED ]
the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus
some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before
iterate svm interval tree.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
change the workload type for some cards as it is needed.
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 0979d43259.
Revert this because it does not apply to all the cards.
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
PSR can disable the HUBP along with the OTG when PSR is active.
We'll hit a pageflip timeout when the OTG is disable because we're no
longer updating the CRTC vblank counter and the pflip high IRQ will
not fire on the flip.
In order to flip the page flip timeout occur we should modify the
enter/exit conditions to match DRM requirements.
[How]
Use our deferred handlers for DRM vblank control to notify DMCU(B)
when it can enable or disable PSR based on whether vblank is disabled or
enabled respectively.
We'll need to pass along the stream with the notification now because
we want to access the CRTC state while the CRTC is locked to get the
stream state prior to the commit.
Retain a reference to the stream so it remains safe to continue to
access and release that reference once we're done with it.
Enable/disable logic follows what we were previously doing in
update_planes.
The workqueue has to be flushed before programming streams or planes
to ensure that we exit out of idle optimizations and PSR before
these events occur if necessary.
To keep the skip count logic the same to avoid FBCON PSR enablement
requires copying the allow condition onto the DM IRQ parameters - a
field that we can actually access from the worker.
Reviewed-by: Roman Li <Roman.Li@amd.com>
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
The current implementation for idle optimization support only has a
single work item that gets reshuffled into the system workqueue
whenever we receive an enable or disable event.
We can have mismatched events if the work hasn't been processed or if
we're getting control events from multiple displays at once.
This fixes this issue and also makes the implementation usable for
PSR control - which will be addressed in another patch.
[How]
We need to be able to flush remaining work out on demand for driver stop
and psr disable so create a driver specific workqueue instead of using
the system one. The workqueue will be single threaded to guarantee the
ordering of enable/disable events.
Refactor the queue to allocate the control work and deallocate it
after processing it.
Pass the acrtc directly to make it easier to handle psr enable/disable
in a later patch.
Rename things to indicate that it's not just MALL specific.
Reviewed-by: Roman Li <Roman.Li@amd.com>
Acked-by: Wayne Lin <wayne.lin@amd.com>
Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]
While reading remote EDID via Startech 1-to-4 hub, occasionally we
won't get response in time and won't light up corresponding monitor.
Ideally, we can still add generic modes for userspace to choose to try
to light up the monitor and which is done in
drm_helper_probe_single_connector_modes(). So the main problem here is
that we fail .mode_valid since we don't create remote dc_sink for this
case.
[How]
Also add default dc_sink if we can't get the EDID.
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Wayne Lin <Wayne.Lin@amd.com>
Tested-by: Daniel Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
These registers have different address from other SMU V11 ASICs.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As the fan control was guarded under manual mode before fan speed
RPM/PWM setting. Thus the extra check is totally redundant.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Currently, the readout of fan speed pwm is transited into percent-based
and then pwm-based. However, the transition into percent-based is totally
unnecessary and make the final output less accurate.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The relationship "PWM = RPM / smu->fan_max_rpm" between fan speed
PWM and RPM is not true for SMU11 ASICs. So, we need a new way to
retrieving the fan speed RPM.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The relationship "PWM = RPM / smu->fan_max_rpm" between fan speed
PWM and RPM is not true for SMU11 ASICs. So, we need a new way to
retrieving the fan speed PWM.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As the relationship "PWM = RPM / smu->fan_max_rpm" between fan speed
PWM and RPM is not true for SMU11 ASICs. So, both the RPM and PWM
settings need to be saved.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The relationship "PWM = RPM / smu->fan_max_rpm" between fan speed
PWM and RPM is not true for SMU11 ASICs. So, we need a new way to
perform the fan speed RPM setting.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Delete ras_if->name in the RAS ctx structure and remove related lines.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use DEFINE_RES_MEM_NAMED() to save a couple of lines of code, which makes
the code a bit shorter and easier to read.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Asmaa Mnebhi <asmaa@nvidia.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Simplify the platform_get_resource() and devm_ioremap_resource()
calls with devm_platform_ioremap_resource().
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Asmaa Mnebhi <asmaa@nvidia.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
ACPI_PTR() is more harmful than helpful. For example, in this case
if CONFIG_ACPI=n, the ID table left unused which is not what we want.
Instead of adding ifdeffery here and there, drop ACPI_PTR() and
replace acpi.h with mod_devicetable.h.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Asmaa Mnehi <asmaa@nvidia.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
Note: Google Test filter = KFDSVMRangeTest.*
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from KFDSVMRangeTest
[ RUN ] KFDSVMRangeTest.BasicSystemMemTest
[ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
[ RUN ] KFDSVMRangeTest.SetGetAttributesTest
[ ] Get default atrributes
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure
Value of: expectedDefaultResults[i]
Actual: 4
Expected: outputAttributes[i].type
Which is: 2
[ ] Setting/Getting atrributes
[ FAILED ]
the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus
some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before
iterate svm interval tree.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
change the workload type for some cards as it is needed.
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit 0979d43259.
Revert this because it does not apply to all the cards.
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
When guest received FLR notification from host, it would
lock adapter into reset state. There will be no more
job submission and hardware access after that.
Then it should send a response to host that it has prepared
for host reset.
Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com>
Signed-off-by: Peng Ju Zhou <PengJu.Zhou@amd.com>
Reviewed-by: Emily.Deng <Emily.Deng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
the following clock is only support voltage DPM, change attribute to RO:
1. pp_dpm_sclk
2. pp_dpm_mclk
3. pp_dpm_fclk
Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
the following message is allowed in sriov mode:
1. GetEnabledSmuFeaturesLow
2. GetEnabledSmuFeaturesHigh
Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1:
1. change return value to avoid smu driver probe fails when FEATURE_PPT is
not enabled.
2. if FEATURE_PPT is not enabled, set power limit value to 0.
v2:
instead dev_err with dev_warn
Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v1:
1. skip to load smu firmware in sriov mode for aldebaran chip
2. using vbios pptable if in sriov mode.
v2:
clean up smu driver code in sriov code path
Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
the following feature is wrong, it will cause sysnode of pp_features show error:
1. DPM_XGMI
2. VCN_DPM
Signed-off-by: Kevin Wang <kevin1.wang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.
How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: Jack Zhang <Jack.Zhang7@hotmail.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed by: Monk Liu <monk.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modern-day mapping_set_error has the ability to squash the usual
negative error code into something appropriate for long-term storage in
a struct address_space -- ENOSPC becomes AS_ENOSPC, and everything else
becomes EIO. iomap squashes /everything/ to EIO, just as XFS did before
that, but this doesn't make sense.
Fix this by making it so that we can pass ENOSPC to userspace when
writeback fails due to space problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
We only use the CIL workqueue in the CIL, so it makes no sense to
hang it off the xfs_mount and have to walk multiple pointers back up
to the mount when we have the CIL structures right there.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Because we use a single work structure attached to the CIL rather
than the CIL context, we can only queue a single work item at a
time. This results in the CIL being single threaded and limits
performance when it becomes CPU bound.
The design of the CIL is that it is pipelined and multiple commits
can be running concurrently, but the way the work is currently
implemented means that it is not pipelining as it was intended. The
critical work to switch the CIL context can take a few milliseconds
to run, but the rest of the CIL context flush can take hundreds of
milliseconds to complete. The context switching is the serialisation
point of the CIL, once the context has been switched the rest of the
context push can run asynchrnously with all other context pushes.
Hence we can move the work to the CIL context so that we can run
multiple CIL pushes at the same time and spread the majority of
the work out over multiple CPUs. We can keep the per-cpu CIL commit
state on the CIL rather than the context, because the context is
pinned to the CIL until the switch is done and we aggregate and
drain the per-cpu state held on the CIL during the context switch.
However, because we no longer serialise the CIL work, we can have
effectively unlimited CIL pushes in progress. We don't want to do
this - not only does it create contention on the iclogs and the
state machine locks, we can run the log right out of space with
outstanding pushes. Instead, limit the work concurrency to 4
concurrent works being processed at a time. This is enough
concurrency to remove the CIL from being a CPU bound bottleneck but
not enough to create new contention points or unbound concurrency
issues.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.
As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:
....
399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
400.099623: xfsaild: first lsn 0x11002f3600
400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
400.589349: xfsaild: first lsn 0x1100305000
400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
400.950343: xfsaild: first lsn 0x1100317c00
400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
401.142334: xfsaild: first lsn 0x110032e600
401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
401.154328: xfsaild: first lsn 0x1100353000
401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
401.451526: xfsaild: first lsn 0x1100353000
401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....
In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.
Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.
The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.
Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.
We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.
One of the complexities here is that the CIL push does not guarantee
that the commit record for the CIL checkpoint is written to disk.
The current log force ensures this by submitting the current ACTIVE
iclog that the commit record was written to. We need the CIL to
actually write this commit record to disk for an async push to
ensure that the checkpoint actually makes it to disk and unpins the
pinned items in the checkpoint on completion. Hence we need to pass
down to the CIL push that we are doing an async flush so that it can
switch out the commit_iclog if necessary to get written to disk when
the commit iclog is finally released.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Because log recovery depends on strictly ordered start records as
well as strictly ordered commit records.
This is a zero day bug in the way XFS writes pipelined transactions
to the journal which is exposed by fixing the zero day bug that
prevents the CIL from pipelining checkpoints. This re-introduces
explicit concurrent commits back into the on-disk journal and hence
out of order start records.
The XFS journal commit code has never ordered start records and we
have relied on strict commit record ordering for correct recovery
ordering of concurrently written transactions. Unfortunately, root
cause analysis uncovered the fact that log recovery uses the LSN of
the start record for transaction commit processing. Hence, whilst
the commits are processed in strict order by recovery, the LSNs
associated with the commits can be out of order and so recovery may
stamp incorrect LSNs into objects and/or misorder intents in the AIL
for later processing. This can result in log recovery failures
and/or on disk corruption, sometimes silent.
Because this is a long standing log recovery issue, we can't just
fix log recovery and call it good. This still leaves older kernels
susceptible to recovery failures and corruption when replaying a log
from a kernel that pipelines checkpoints. There is also the issue
that in-memory ordering for AIL pushing and data integrity
operations are based on checkpoint start LSNs, and if the start LSN
is incorrect in the journal, it is also incorrect in memory.
Hence there's really only one choice for fixing this zero-day bug:
we need to strictly order checkpoint start records in ascending
sequence order in the log, the same way we already strictly order
commit records.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Now that we have a mechanism to guarantee that the callbacks
attached to an iclog are owned by the context that attaches them
until they drop their reference to the iclog via
xlog_state_release_iclog(), we can attach callbacks to the iclog at
any time we have an active reference to the iclog.
xlog_state_get_iclog_space() always guarantees that the commit
record will fit in the iclog it returns, so we can move this IO
callback setting to xlog_cil_set_ctx_write_state(), record the
commit iclog in the context and remove the need for the commit iclog
to be returned by xlog_write() altogether.
This, in turn, allows us to move the wakeup for ordered commit
record writes up into xlog_cil_set_ctx_write_state(), too, because
we have been guaranteed that this commit record will be physically
located in the iclog before any waiting commit record at a higher
sequence number will be granted iclog space.
This further cleans up the post commit record write processing in
the CIL push code, especially as xlog_state_release_iclog() will now
clean up the context when shutdown errors occur.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
So we can use it for start record ordering as well as commit record
ordering in future.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Pass the CIL context to xlog_write() rather than a pointer to a LSN
variable. Only the CIL checkpoint calls to xlog_write() need to know
about the start LSN of the writes, so rework xlog_write to directly
write the LSNs into the CIL context structure.
This removes the commit_lsn variable from xlog_cil_push_work(), so
now we only have to issue the commit record ordering wakeup from
there.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
It is only used by the CIL checkpoints, and is the counterpart to
start record formatting and writing that is already local to
xfs_log_cil.c.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>