History log of /linux/drivers/thermal/thermal_debugfs.c (Results 1 – 17 of 17)
Revision Date Author Comments
# 5a599e10 23-May-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Allow tze_seq_show() to print statistics for invalid trips

Commit a6258fde8de3 ("thermal/debugfs: Make tze_seq_show() skip invalid
trips and trips with no stats") modified tze_seq_s

thermal/debugfs: Allow tze_seq_show() to print statistics for invalid trips

Commit a6258fde8de3 ("thermal/debugfs: Make tze_seq_show() skip invalid
trips and trips with no stats") modified tze_seq_show() to skip invalid
trips, but it overlooked the fact that a trip may become invalid during
a mitigation eposide involving it, in which case its statistics should
still be reported.

For this reason, remove the invalid trip temperature check from the
main loop in tze_seq_show().

The trips that have never been valid will still be skipped after this
change because there are no statistics to report for them.

Fixes: a6258fde8de3 ("thermal/debugfs: Make tze_seq_show() skip invalid trips and trips with no stats")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# 9e69acc1 23-May-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Print initial trip temperature and hysteresis in tze_seq_show()

The temperature and hysteresis of a trip point may change during a
mitigation episode it is involved in (it may even

thermal/debugfs: Print initial trip temperature and hysteresis in tze_seq_show()

The temperature and hysteresis of a trip point may change during a
mitigation episode it is involved in (it may even become invalid
altogether), so in order to avoid possible confusion related to that,
store the temperature and hysteresis of trip points at the time they
are crossed on the way up and print those values instead of their
current temperature and hysteresis.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# bd700ba9 25-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Avoid printing zero duration for mitigation events in progress

If a thermal mitigation event is in progress, its duration value has
not been updated yet, so 0 will be printed as the

thermal/debugfs: Avoid printing zero duration for mitigation events in progress

If a thermal mitigation event is in progress, its duration value has
not been updated yet, so 0 will be printed as the event duration by
tze_seq_show() which is confusing.

Avoid doing that by marking the beginning of the event with the
KTIME_MIN duration value and making tze_seq_show() compute the current
event duration on the fly, in which case '>' will be printed instead of
'=' in the event duration value field.

Similarly, for trip points that have been crossed on the down, mark
the end of mitigation with the KTIME_MAX timestamp value and make
tze_seq_show() compute the current duration on the fly for the trip
points still involved in the mitigation, in which cases the duration
value printed by it will be prepended with a '>' character.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# 31a0fa00 25-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()

If cdev_dt_seq_show() runs before the first state transition of a cooling
device, it will not print any state residency informa

thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()

If cdev_dt_seq_show() runs before the first state transition of a cooling
device, it will not print any state residency information for it, even
though it might be reasonably expected to print residency information for
the initial state of the cooling device.

For this reason, rearrange the code to get the initial state of a cooling
device at the registration time and pass it to thermal_debug_cdev_add(),
so that the latter can create a duration record for that state which will
allow cdev_dt_seq_show() to print its residency information.

Fixes: 755113d76786 ("thermal/debugfs: Add thermal cooling device debugfs information")
Reported-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# f4ae18fc 25-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Create records for cdev states as they get used

Because thermal_debug_cdev_state_update() only creates a duration record
for the old state of a cooling device, if its new state is u

thermal/debugfs: Create records for cdev states as they get used

Because thermal_debug_cdev_state_update() only creates a duration record
for the old state of a cooling device, if its new state is used for the
first time, there will be no record for it and cdev_dt_seq_show() will
not print the duration information for it even though it contains code
to compute the duration value in that case.

Address this by making thermal_debug_cdev_state_update() create a
duration record for the new state if there is none.

Fixes: 755113d76786 ("thermal/debugfs: Add thermal cooling device debugfs information")
Reported-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# d351eb0a 26-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Prevent use-after-free from occurring after cdev removal

Since thermal_debug_cdev_remove() does not run under cdev->lock, it can
run in parallel with thermal_debug_cdev_state_update

thermal/debugfs: Prevent use-after-free from occurring after cdev removal

Since thermal_debug_cdev_remove() does not run under cdev->lock, it can
run in parallel with thermal_debug_cdev_state_update() and it may free
the struct thermal_debugfs object used by the latter after it has been
checked against NULL.

If that happens, thermal_debug_cdev_state_update() will access memory
that has been freed already causing the kernel to crash.

Address this by using cdev->lock in thermal_debug_cdev_remove() around
the cdev->debugfs value check (in case the same cdev is removed at the
same time in two different threads) and its reset to NULL.

Fixes: 755113d76786 ("thermal/debugfs: Add thermal cooling device debugfs information")
Cc :6.8+ <stable@vger.kernel.org> # 6.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# c7f7c372 25-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Fix two locking issues with thermal zone debug

With the current thermal zone locking arrangement in the debugfs code,
user space can open the "mitigations" file for a thermal zone b

thermal/debugfs: Fix two locking issues with thermal zone debug

With the current thermal zone locking arrangement in the debugfs code,
user space can open the "mitigations" file for a thermal zone before
the zone's debugfs pointer is set which will result in a NULL pointer
dereference in tze_seq_start().

Moreover, thermal_debug_tz_remove() is not called under the thermal
zone lock, so it can run in parallel with the other functions accessing
the thermal zone's struct thermal_debugfs object. Then, it may clear
tz->debugfs after one of those functions has checked it and the
struct thermal_debugfs object may be freed prematurely.

To address the first problem, pass a pointer to the thermal zone's
struct thermal_debugfs object to debugfs_create_file() in
thermal_debug_tz_add() and make tze_seq_start(), tze_seq_next(),
tze_seq_stop(), and tze_seq_show() retrieve it from s->private
instead of a pointer to the thermal zone object. This will ensure
that tz_debugfs will be valid across the "mitigations" file accesses
until thermal_debugfs_remove_id() called by thermal_debug_tz_remove()
removes that file.

To address the second problem, use tz->lock in thermal_debug_tz_remove()
around the tz->debugfs value check (in case the same thermal zone is
removed at the same time in two different threads) and its reset to NULL.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Cc :6.8+ <stable@vger.kernel.org> # 6.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# 72c1afff 25-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Free all thermal zone debug memory on zone removal

Because thermal_debug_tz_remove() does not free all memory allocated for
thermal zone diagnostics, some of that memory becomes unr

thermal/debugfs: Free all thermal zone debug memory on zone removal

Because thermal_debug_tz_remove() does not free all memory allocated for
thermal zone diagnostics, some of that memory becomes unreachable after
freeing the thermal zone's struct thermal_debugfs object.

Address this by making thermal_debug_tz_remove() free all of the memory
in question.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Cc :6.8+ <stable@vger.kernel.org> # 6.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# a6258fde 17-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Make tze_seq_show() skip invalid trips and trips with no stats

Currently, tze_seq_show() output includes all of the trips in the zone
except for critical ones, including invalid tri

thermal/debugfs: Make tze_seq_show() skip invalid trips and trips with no stats

Currently, tze_seq_show() output includes all of the trips in the zone
except for critical ones, including invalid trips and trips with no stats
which is confusing.

Make it skip the trips for which there is not mitigation information.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>

show more ...


# 8dff6e84 17-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Rename thermal_debug_update_temp() to thermal_debug_update_trip_stats()

Rename thermal_debug_update_temp() to thermal_debug_update_trip_stats()
which is a better match for the purpo

thermal/debugfs: Rename thermal_debug_update_temp() to thermal_debug_update_trip_stats()

Rename thermal_debug_update_temp() to thermal_debug_update_trip_stats()
which is a better match for the purpose of the function.

No functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>

show more ...


# e271f997 17-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Clean up thermal_debug_update_temp()

Notice that it is not necessary to compute tze in every iteration of the
for () loop in thermal_debug_update_temp() because it is the same for a

thermal/debugfs: Clean up thermal_debug_update_temp()

Notice that it is not necessary to compute tze in every iteration of the
for () loop in thermal_debug_update_temp() because it is the same for all
trips, so compute it once before the loop starts.

Also use a trip_stats local variable to make the code in that loop easier
to follow and move the trip_id variable definition into that loop because
it is not used elsewhere in the function.

While at it, change to order of local variable definitions in the function
to follow the reverse-xmas-tree pattern.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>

show more ...


# 0a293c77 17-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Avoid excessive updates of trip point statistics

Since thermal_debug_update_temp() is called before invoking
thermal_debug_tz_trip_down() for the trips that were crossed by the
zone

thermal/debugfs: Avoid excessive updates of trip point statistics

Since thermal_debug_update_temp() is called before invoking
thermal_debug_tz_trip_down() for the trips that were crossed by the
zone temperature on the way up, it updates the statistics for them
as though the current zone temperature was above the low temperature
of each of them. However, if a given trip has just been crossed on the
way down, the zone temperature is in fact below its low temperature,
but this is handled by thermal_debug_tz_trip_down() running after the
update of the trip statistics.

The remedy is to call thermal_debug_update_temp() after
thermal_debug_tz_trip_down() has been invoked for all of the
trips in question, but then thermal_debug_tz_trip_up() needs to
be adjusted, so it does not update the statistics for the trips
that has just been crossed on the way up, as that will be taken
care of by thermal_debug_update_temp() down the road.

Modify the code accordingly.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>

show more ...


# b552f63c 15-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal/debugfs: Add missing count increment to thermal_debug_tz_trip_up()

The count field in struct trip_stats, representing the number of times
the zone temperature was above the trip point, needs

thermal/debugfs: Add missing count increment to thermal_debug_tz_trip_up()

The count field in struct trip_stats, representing the number of times
the zone temperature was above the trip point, needs to be incremented
in thermal_debug_tz_trip_up(), for two reasons.

First, if a trip point is crossed on the way up for the first time,
thermal_debug_update_temp() called from update_temperature() does
not see it because it has not been added to trips_crossed[] array
in the thermal zone's struct tz_debugfs object yet. Therefore, when
thermal_debug_tz_trip_up() is called after that, the trip point's
count value is 0, and the attempt to divide by it during the average
temperature computation leads to a divide error which causes the kernel
to crash. Setting the count to 1 before the division by incrementing it
fixes this problem.

Second, if a trip point is crossed on the way up, but it has been
crossed on the way up already before, its count value needs to be
incremented to make a record of the fact that the zone temperature is
above the trip now. Without doing that, if the mitigations applied
after crossing the trip cause the zone temperature to drop below its
threshold, the count will not be updated for this episode at all and
the average temperature in the trip statistics record will be somewhat
higher than it should be.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Cc :6.8+ <stable@vger.kernel.org> # 6.8+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# daeeb032 02-Apr-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

thermal: core: Move threshold out of struct thermal_trip

The threshold field in struct thermal_trip is only used internally by
the thermal core and it is better to prevent drivers from misusing it.

thermal: core: Move threshold out of struct thermal_trip

The threshold field in struct thermal_trip is only used internally by
the thermal core and it is better to prevent drivers from misusing it.
It also takes some space unnecessarily in the trip tables passed by
drivers to the core during thermal zone registration.

For this reason, introduce struct thermal_trip_desc as a wrapper around
struct thermal_trip, move the threshold field directly into it and make
the thermal core store struct thermal_trip_desc objects in the internal
thermal zone trip tables. Adjust all of the code using trip tables in
the thermal core accordingly.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>

show more ...


# 6dcb3508 12-Jan-2024 Dan Carpenter <dan.carpenter@linaro.org>

thermal/debugfs: Unlock on error path in thermal_debug_tz_trip_up()

Add a missing mutex_unlock(&thermal_dbg->lock) to this error path.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs inf

thermal/debugfs: Unlock on error path in thermal_debug_tz_trip_up()

Add a missing mutex_unlock(&thermal_dbg->lock) to this error path.

Fixes: 7ef01f228c9f ("thermal/debugfs: Add thermal debugfs information for mitigation episodes")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# 7ef01f22 09-Jan-2024 Daniel Lezcano <daniel.lezcano@linaro.org>

thermal/debugfs: Add thermal debugfs information for mitigation episodes

The mitigation episodes are recorded. A mitigation episode happens
when the first trip point is crossed the way up and then t

thermal/debugfs: Add thermal debugfs information for mitigation episodes

The mitigation episodes are recorded. A mitigation episode happens
when the first trip point is crossed the way up and then the way
down. During this episode other trip points can be crossed also and
are accounted for this mitigation episode. The interesting information
is the average temperature at the trip point, the undershot and the
overshot. The standard deviation of the mitigated temperature will be
added later.

The thermal debugfs directory structure tries to stay consistent with
the sysfs one but in a very simplified way:

thermal/
`-- thermal_zones
|-- 0
| `-- mitigations
`-- 1
`-- mitigations

The content of the mitigations file has the following format:

,-Mitigation at 349988258us, duration=130136ms
| trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) |
| 0 | passive | 65000 | 2000 | 130136 | 68227 | 62500 | 75625 |
| 1 | passive | 75000 | 2000 | 104209 | 74857 | 71666 | 77500 |
,-Mitigation at 272451637us, duration=75000ms
| trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) |
| 0 | passive | 65000 | 2000 | 75000 | 68561 | 62500 | 75000 |
| 1 | passive | 75000 | 2000 | 60714 | 74820 | 70555 | 77500 |
,-Mitigation at 238184119us, duration=27316ms
| trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) |
| 0 | passive | 65000 | 2000 | 27316 | 73377 | 62500 | 75000 |
| 1 | passive | 75000 | 2000 | 19468 | 75284 | 69444 | 77500 |
,-Mitigation at 39863713us, duration=136196ms
| trip | type | temp(°mC) | hyst(°mC) | duration | avg(°mC) | min(°mC) | max(°mC) |
| 0 | passive | 65000 | 2000 | 136196 | 73922 | 62500 | 75000 |
| 1 | passive | 75000 | 2000 | 91721 | 74386 | 69444 | 78125 |

More information for a better understanding of the thermal behavior
will be added after. The idea is to give detailed statistics
information about the undershots and overshots, the temperature speed,
etc... As all the information in a single file is too much, the idea
would be to create a directory named with the mitigation timestamp
where all data could be added.

Please note this code is immune against trip ordering but not against
a trip temperature change while a mitigation is happening. However,
this situation should be extremely rare, perhaps not happening and we
might question ourselves if something should be done in the core
framework for other components first.

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: White space fixups, rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# 755113d7 09-Jan-2024 Daniel Lezcano <daniel.lezcano@linaro.org>

thermal/debugfs: Add thermal cooling device debugfs information

The thermal framework does not have any debug information except a
sysfs stat which is a bit controversial. This one allocates big chu

thermal/debugfs: Add thermal cooling device debugfs information

The thermal framework does not have any debug information except a
sysfs stat which is a bit controversial. This one allocates big chunks
of memory for every cooling devices with a high number of states and
could represent on some systems in production several megabytes of
memory for just a portion of it. As the sysfs is limited to a page
size, the output is not exploitable with large data array and gets
truncated.

The patch provides the same information than sysfs except the
transitions are dynamically allocated, thus they won't show more
events than the ones which actually occurred. There is no longer a
size limitation and it opens the field for more debugging information
where the debugfs is designed for, not sysfs.

The thermal debugfs directory structure tries to stay consistent with
the sysfs one but in a very simplified way:

thermal/
-- cooling_devices
|-- 0
| |-- clear
| |-- time_in_state_ms
| |-- total_trans
| `-- trans_table
|-- 1
| |-- clear
| |-- time_in_state_ms
| |-- total_trans
| `-- trans_table
|-- 2
| |-- clear
| |-- time_in_state_ms
| |-- total_trans
| `-- trans_table
|-- 3
| |-- clear
| |-- time_in_state_ms
| |-- total_trans
| `-- trans_table
`-- 4
|-- clear
|-- time_in_state_ms
|-- total_trans
`-- trans_table

The content of the files in the cooling devices directory is the same
as the sysfs one except for the trans_table which has the following
format:

Transition Hits
1->0 246
0->1 246
2->1 632
1->2 632
3->2 98
2->3 98

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ rjw: White space fixups, rebase ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...