#
312be9fc |
| 01-Apr-2024 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
The MSR_PEBS_DATA_CFG MSR register is used to configure which data groups should be generated into a PEBS record, and it's shar
perf/x86/intel/ds: Don't clear ->pebs_data_cfg for the last PEBS event
The MSR_PEBS_DATA_CFG MSR register is used to configure which data groups should be generated into a PEBS record, and it's shared among all counters.
If there are different configurations among counters, perf combines all the configurations.
The first perf command as below requires a complete PEBS record (including memory info, GPRs, XMMs, and LBRs). The second perf command only requires a basic group. However, after the second perf command is running, the MSR_PEBS_DATA_CFG register is cleared. Only a basic group is generated in a PEBS record, which is wrong. The required information for the first perf command is missed.
$ perf record --intr-regs=AX,SP,XMM0 -a -C 8 -b -W -d -c 100000003 -o /dev/null -e cpu/event=0xd0,umask=0x81/upp & $ sleep 5 $ perf record --per-thread -c 1 -e cycles:pp --no-timestamp --no-tid taskset -c 8 ./noploop 1000
The first PEBS event is a system-wide PEBS event. The second PEBS event is a per-thread event. When the thread is scheduled out, the intel_pmu_pebs_del() function is invoked to update the PEBS state. Since the system-wide event is still available, the cpuc->n_pebs is 1. The cpuc->pebs_data_cfg is cleared. The data configuration for the system-wide PEBS event is lost.
The (cpuc->n_pebs == 1) check was introduced in commit:
b6a32f023fcc ("perf/x86: Fix PEBS threshold initialization")
At that time, it indeed didn't hurt whether the state was updated during the removal, because only the threshold is updated.
The calculation of the threshold takes the last PEBS event into account.
However, since commit:
b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
we delay the threshold update, and clear the PEBS data config, which triggers the bug.
The PEBS data config update scope should not be shrunk during removal.
[ mingo: Improved the changelog & comments. ]
Fixes: b752ea0c28e3 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240401133320.703971-1-kan.liang@linux.intel.com
show more ...
|
#
154fcf3a |
| 04-Mar-2024 |
Thomas Gleixner <tglx@linutronix.de> |
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
To clean up the per CPU insanity of UP which causes sparse to be rightfully unhappy and prevents the usage of the generic per CPU acc
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
To clean up the per CPU insanity of UP which causes sparse to be rightfully unhappy and prevents the usage of the generic per CPU accessors on cpu_info it is necessary to include <linux/percpu.h> into <asm/msr.h>.
Including <linux/percpu.h> into <asm/msr.h> is impossible because it ends up in header dependency hell. The problem is that <asm/processor.h> includes <asm/msr.h>. The inclusion of <linux/percpu.h> results in a compile fail where the compiler cannot longer handle an include in <asm/cpufeature.h> which references boot_cpu_data which is defined in <asm/processor.h>.
The only reason why <asm/msr.h> is included in <asm/processor.h> are the set/get_debugctlmsr() inlines. They are defined there because <asm/processor.h> is such a nice dump ground for everything. In fact they belong obviously into <asm/debugreg.h>.
Move them to <asm/debugreg.h> and fix up the resulting damage which is just exposing the reliance on random include chains.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240304005104.454678686@linutronix.de
show more ...
|
#
33744916 |
| 25-Oct-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel: Support branch counters logging
The branch counters logging (A.K.A LBR event logging) introduces a per-counter indication of precise event occurrences in LBRs. It can provide a means
perf/x86/intel: Support branch counters logging
The branch counters logging (A.K.A LBR event logging) introduces a per-counter indication of precise event occurrences in LBRs. It can provide a means to attribute exposed retirement latency to combinations of events across a block of instructions. It also provides a means of attributing Timed LBR latencies to events.
The feature is first introduced on SRF/GRR. It is an enhancement of the ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences of events on the GP counters. The information is displayed by the order of counters.
The design proposed in this patch requires that the events which are logged must be in a group with the event that has LBR. If there are more than one LBR group, the counters logging information only from the current group (overflowed) are stored for the perf tool, otherwise the perf tool cannot know which and when other groups are scheduled especially when multiplexing is triggered. The user can ensure it uses the maximum number of counters that support LBR info (4 by now) by making the group large enough.
The HW only logs events by the order of counters. The order may be different from the order of enabling which the perf tool can understand. When parsing the information of each branch entry, convert the counter order to the enabled order, and store the enabled order in the extension space.
Unconditionally reset LBRs for an LBR event group when it's deleted. The logged counter information is only valid for the current LBR group. If another LBR group is scheduled later, the information from the stale LBRs would be otherwise wrongly interpreted.
Add a sanity check in intel_pmu_hw_config(). Disable the feature if other counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode is enabled. (For the LBR call stack mode, we cannot simply flush the LBR, since it will break the call stack. Also, there is no obvious usage with the call stack mode for now.)
Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch stack setup.
Expose the maximum number of supported counters and the width of the counters into the sysfs. The perf tool can use the information to parse the logged counters in each branch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231025201626.3000228-5-kan.liang@linux.intel.com
show more ...
|
#
571d91dc |
| 25-Oct-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf: Add branch stack counters
Currently, the additional information of a branch entry is stored in a u64 space. With more and more information added, the space is running out. For example, the inf
perf: Add branch stack counters
Currently, the additional information of a branch entry is stored in a u64 space. With more and more information added, the space is running out. For example, the information of occurrences of events will be added for each branch.
Two places were suggested to append the counters. https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/ One place is right after the flags of each branch entry. It changes the existing struct perf_branch_entry. The later ARCH specific implementation has to be really careful to consistently pick the right struct. The other place is right after the entire struct perf_branch_stack. The disadvantage is that the pointer of the extra space has to be recorded. The common interface perf_sample_save_brstack() has to be updated.
The latter is much straightforward, and should be easily understood and maintained. It is implemented in the patch.
Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate the event which is recorded in the branch info.
The "u64 counters" may store the occurrences of several events. The information regarding the number of events/counters and the width of each counter should be exposed via sysfs as a reference for the perf tool. Define the branch_counter_nr and branch_counter_width ABI here. The support will be implemented later in the Intel-specific patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
show more ...
|
#
b0560bfd |
| 29-Aug-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel: Clean up the hybrid CPU type handling code
There is a fairly long list of grievances about the current code. The main beefs:
1. hybrid_big_small assumes that the *HARDWARE* (CPUI
perf/x86/intel: Clean up the hybrid CPU type handling code
There is a fairly long list of grievances about the current code. The main beefs:
1. hybrid_big_small assumes that the *HARDWARE* (CPUID) provided core types are a bitmap. They are not. If Intel happened to make a core type of 0xff, hilarity would ensue. 2. adl_get_hybrid_cpu_type() utterly inscrutable. There are precisely zero comments and zero changelog about what it is attempting to do.
According to Kan, the adl_get_hybrid_cpu_type() is there because some Alder Lake (ADL) CPUs can do some silly things. Some ADL models are *supposed* to be hybrid CPUs with big and little cores, but there are some SKUs that only have big cores. CPUID(0x1a) on those CPUs does not say that the CPUs are big cores. It apparently just returns 0x0. It confuses perf because it expects to see either 0x40 (Core) or 0x20 (Atom).
The perf workaround for this is to watch for a CPU core saying it is type 0x0. If that happens on an Alder Lake, it calls x86_pmu.get_hybrid_cpu_type() and just assumes that the core is a Core (0x40) CPU.
To fix up the mess, separate out the CPU types and the 'pmu' types. This allows 'hybrid_pmu_type' bitmaps without worrying that some future CPU type will set multiple bits.
Since the types are now separate, add a function to glue them back together again. Actual comment on the situation in the glue function (find_hybrid_pmu_for_cpu()).
Also, give ->get_hybrid_cpu_type() a real return type and make it clear that it is overriding the *CPU* type, not the PMU type.
Rename cpu_type to pmu_type in the struct x86_hybrid_pmu to reflect the change.
Originally-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230829125806.3016082-6-kan.liang@linux.intel.com
show more ...
|
#
d4b5694c |
| 29-Aug-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel: Use the common uarch name for the shared functions
From PMU's perspective, the SPR/GNR server has a similar uarch to the ADL/MTL client p-core. Many functions are shared. However, th
perf/x86/intel: Use the common uarch name for the shared functions
From PMU's perspective, the SPR/GNR server has a similar uarch to the ADL/MTL client p-core. Many functions are shared. However, the shared function name uses the abbreviation of the server product code name, rather than the common uarch code name.
Rename these internal shared functions by the common uarch name.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230829125806.3016082-2-kan.liang@linux.intel.com
show more ...
|
#
a430021f |
| 22-May-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel: Add Crestmont PMU
The Grand Ridge and Sierra Forest are successors to Snow Ridge. They both have Crestmont core. From the core PMU's perspective, they are similar to the e-core of MT
perf/x86/intel: Add Crestmont PMU
The Grand Ridge and Sierra Forest are successors to Snow Ridge. They both have Crestmont core. From the core PMU's perspective, they are similar to the e-core of MTL. The only difference is the LBR event logging feature, which will be implemented in the following patches.
Create a non-hybrid PMU setup for Grand Ridge and Sierra Forest.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/20230522113040.2329924-1-kan.liang@linux.intel.com
show more ...
|
#
b752ea0c |
| 21-Apr-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG
Several similar kernel warnings can be triggered,
[56605.607840] CPU0 PEBS record size 0, expected 32, config 0 cpuc->record_size=208
perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG
Several similar kernel warnings can be triggered,
[56605.607840] CPU0 PEBS record size 0, expected 32, config 0 cpuc->record_size=208
when the below commands are running in parallel for a while on SPR.
while true; do perf record --no-buildid -a --intr-regs=AX \ -e cpu/event=0xd0,umask=0x81/pp \ -c 10003 -o /dev/null ./triad; done &
while true; do perf record -o /tmp/out -W -d \ -e '{ld_blocks.store_forward:period=1000000, \ MEM_TRANS_RETIRED.LOAD_LATENCY:u:precise=2:ldlat=4}' \ -c 1037 ./triad; done
The triad program is just the generation of loads/stores.
The warnings are triggered when an unexpected PEBS record (with a different config and size) is found.
A system-wide PEBS event with the large PEBS config may be enabled during a context switch. Some PEBS records for the system-wide PEBS may be generated while the old task is sched out but the new one hasn't been sched in yet. When the new task is sched in, the cpuc->pebs_record_size may be updated for the per-task PEBS events. So the existing system-wide PEBS records have a different size from the later PEBS records.
The PEBS buffer should be flushed right before the hardware is reprogrammed. The new size and threshold should be updated after the old buffer has been flushed.
Reported-by: Stephane Eranian <eranian@google.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230421184529.3320912-1-kan.liang@linux.intel.com
show more ...
|
#
89e97eb8 |
| 25-Jan-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel/ds: Fix the conversion from TSC to perf time
The time order is incorrect when the TSC in a PEBS record is used.
$perf record -e cycles:upp dd if=/dev/zero of=/dev/null count=10000
perf/x86/intel/ds: Fix the conversion from TSC to perf time
The time order is incorrect when the TSC in a PEBS record is used.
$perf record -e cycles:upp dd if=/dev/zero of=/dev/null count=10000 $ perf script --show-task-events perf-exec 0 0.000000: PERF_RECORD_COMM: perf-exec:915/915 dd 915 106.479872: PERF_RECORD_COMM exec: dd:915/915 dd 915 106.483270: PERF_RECORD_EXIT(915:915):(914:914) dd 915 106.512429: 1 cycles:upp: ffffffff96c011b7 [unknown] ([unknown]) ... ...
The perf time is from sched_clock_cpu(). The current PEBS code unconditionally convert the TSC to native_sched_clock(). There is a shift between the two clocks. If the TSC is stable, the shift is consistent, __sched_clock_offset. If the TSC is unstable, the shift has to be calculated at runtime.
This patch doesn't support the conversion when the TSC is unstable. The TSC unstable case is a corner case and very unlikely to happen. If it happens, the TSC in a PEBS record will be dropped and fall back to perf_event_clock().
Fixes: 47a3aeb39e8d ("perf/x86/intel/pebs: Fix PEBS timestamps overwritten") Reported-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/CAM9d7cgWDVAq8-11RbJ2uGfwkKD6fA-OMwOKDrNUrU_=8MgEjg@mail.gmail.com/
show more ...
|
#
13738a36 |
| 09-Nov-2022 |
Like Xu <likexu@tencent.com> |
perf/x86/intel: Expose EPT-friendly PEBS for SPR and future models
According to Intel SDM, the EPT-friendly PEBS is supported by all the platforms after ICX, ADL and the future platforms with PEBS f
perf/x86/intel: Expose EPT-friendly PEBS for SPR and future models
According to Intel SDM, the EPT-friendly PEBS is supported by all the platforms after ICX, ADL and the future platforms with PEBS format 5.
Currently the only in-kernel user of this capability is KVM, which has very limited support for hybrid core pmu, so ADL and its successors do not currently expose this capability. When both hybrid core and PEBS format 5 are present, KVM will decide on its own merits.
Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-perf-users@vger.kernel.org Suggested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221109082802.27543-4-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
show more ...
|
#
f6e70715 |
| 18-Jan-2023 |
Namhyung Kim <namhyung@kernel.org> |
perf/core: Introduce perf_prepare_header()
Factor out perf_prepare_header() so that it can call perf_prepare_sample() without a header if not needed.
Also it checks the filtered_sample_type to avoi
perf/core: Introduce perf_prepare_header()
Factor out perf_prepare_header() so that it can call perf_prepare_sample() without a header if not needed.
Also it checks the filtered_sample_type to avoid duplicate work when perf_prepare_sample() is called twice (or more).
Suggested-by: Peter Zijlstr <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
show more ...
|
#
eb55b455 |
| 18-Jan-2023 |
Namhyung Kim <namhyung@kernel.org> |
perf/core: Add perf_sample_save_brstack() helper
When we saves the branch stack to the perf sample data, we needs to update the sample flags and the dynamic size. To make sure this is done consiste
perf/core: Add perf_sample_save_brstack() helper
When we saves the branch stack to the perf sample data, we needs to update the sample flags and the dynamic size. To make sure this is done consistently, add the perf_sample_save_brstack() helper and convert all call sites.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-5-namhyung@kernel.org
show more ...
|
#
31046500 |
| 18-Jan-2023 |
Namhyung Kim <namhyung@kernel.org> |
perf/core: Add perf_sample_save_callchain() helper
When we save the callchain to the perf sample data, we need to update the sample flags and the dynamic size. To ensure this is done consistently,
perf/core: Add perf_sample_save_callchain() helper
When we save the callchain to the perf sample data, we need to update the sample flags and the dynamic size. To ensure this is done consistently, add the perf_sample_save_callchain() helper and convert all call sites.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230118060559.615653-3-namhyung@kernel.org
show more ...
|
#
c87a3109 |
| 04-Jan-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86: Support Retire Latency
Retire Latency reports the number of elapsed core clocks between the retirement of the instruction indicated by the Instruction Pointer field of the PEBS record and
perf/x86: Support Retire Latency
Retire Latency reports the number of elapsed core clocks between the retirement of the instruction indicated by the Instruction Pointer field of the PEBS record and the retirement of the prior instruction. It's enumerated by the IA32_PERF_CAPABILITIES.PEBS_TIMING_INFO[17].
Add flag PMU_FL_RETIRE_LATENCY to indicate the availability of the feature.
The Retire Latency is not supported by the fixed counter 0 on p-core of MTL.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-3-kan.liang@linux.intel.com
show more ...
|
#
38aaf921 |
| 04-Jan-2023 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86: Add Meteor Lake support
From PMU's perspective, Meteor Lake is similar to Alder Lake. Both are hybrid platforms, with e-core and p-core.
The key differences include: - The e-core supports
perf/x86: Add Meteor Lake support
From PMU's perspective, Meteor Lake is similar to Alder Lake. Both are hybrid platforms, with e-core and p-core.
The key differences include: - The e-core supports 2 PDIST GP counters (GP0 & GP1) - New MSRs for the Module Snoop Response Events on the e-core. - New Data Source fields are introduced for the e-core. - There are 8 GP counters for the e-core. - The load latency AUX event is not required for the p-core anymore. - Retire Latency (Support in a separate patch) for both cores.
Since most of the code in the intel_pmu_init() should be the same as Alder Lake, to avoid code duplication, share the path with Alder Lake.
Add new specific functions of extra_regs, and get_event_constraints to support the OCR events, Module Snoop Response Events and 2 PDIST GP counters on e-core.
Add new MTL specific mem_attrs which drops the load latency AUX event.
The Data Source field is extended to 4:0, which can contains max 32 sources.
The Retire Latency is implemented with a separate patch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-2-kan.liang@linux.intel.com
show more ...
|
#
0916886b |
| 31-Oct-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel: Fix pebs event constraints for SPR
According to the latest event list, update the MEM_INST_RETIRED events which support the DataLA facility for SPR.
Fixes: 61b985e3e775 ("perf/x86/i
perf/x86/intel: Fix pebs event constraints for SPR
According to the latest event list, update the MEM_INST_RETIRED events which support the DataLA facility for SPR.
Fixes: 61b985e3e775 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221031154119.571386-2-kan.liang@linux.intel.com
show more ...
|
#
acc5568b |
| 31-Oct-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel: Fix pebs event constraints for ICL
According to the latest event list, update the MEM_INST_RETIRED events which support the DataLA facility.
Fixes: 6017608936c1 ("perf/x86/intel: Ad
perf/x86/intel: Fix pebs event constraints for ICL
According to the latest event list, update the MEM_INST_RETIRED events which support the DataLA facility.
Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support") Reported-by: Jannis Klinkenberg <jannis.klinkenberg@rwth-aachen.de> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20221031154119.571386-1-kan.liang@linux.intel.com
show more ...
|
#
bd275681 |
| 08-Oct-2022 |
Peter Zijlstra <peterz@infradead.org> |
perf: Rewrite core context handling
There have been various issues and limitations with the way perf uses (task) contexts to track events. Most notable is the single hardware PMU task context, which
perf: Rewrite core context handling
There have been various issues and limitations with the way perf uses (task) contexts to track events. Most notable is the single hardware PMU task context, which has resulted in a number of yucky things (both proposed and merged).
Notably: - HW breakpoint PMU - ARM big.little PMU / Intel ADL PMU - Intel Branch Monitoring PMU - AMD IBS PMU - S390 cpum_cf PMU - PowerPC trace_imc PMU
*Current design:*
Currently we have a per task and per cpu perf_event_contexts:
task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context ^ | ^ | ^ `---------------------------------' | `--> pmu ---' v ^ perf_event ------'
Each task has an array of pointers to a perf_event_context. Each perf_event_context has a direct relation to a PMU and a group of events for that PMU. The task related perf_event_context's have a pointer back to that task.
Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which includes a perf_event_context, which again has a direct relation to that PMU, and a group of events for that PMU.
The perf_cpu_context also tracks which task context is currently associated with that CPU and includes a few other things like the hrtimer for rotation etc.
Each perf_event is then associated with its PMU and one perf_event_context.
*Proposed design:*
New design proposed by this patch reduce to a single task context and a single CPU context but adds some intermediate data-structures:
task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context ^ | ^ ^ `---------------------------' | | | | perf_cpu_pmu_context <--. | `----. ^ | | | | | | v v | | ,--> perf_event_pmu_context | | | | | | | v v | perf_event ---> pmu ----------------'
With the new design, perf_event_context will hold all events for all pmus in the (respective pinned/flexible) rbtrees. This can be achieved by adding pmu to rbtree key:
{cpu, pmu, cgroup, group_index}
Each perf_event_context carries a list of perf_event_pmu_context which is used to hold per-pmu-per-context state. For example, it keeps track of currently active events for that pmu, a pmu specific task_ctx_data, a flag to tell whether rotation is required or not etc.
Additionally, perf_cpu_pmu_context is used to hold per-pmu-per-cpu state like hrtimer details to drive the event rotation, a pointer to perf_event_pmu_context of currently running task and some other ancillary information.
Each perf_event is associated to it's pmu, perf_event_context and perf_event_pmu_context.
Further optimizations to current implementation are possible. For example, ctx_resched() can be optimized to reschedule only single pmu events.
Much thanks to Ravi for picking this up and pushing it towards completion.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221008062424.313-1-ravi.bangoria@amd.com
show more ...
|
#
7b084630 |
| 21-Sep-2022 |
Namhyung Kim <namhyung@kernel.org> |
perf: Use sample_flags for addr
Use the new sample_flags to indicate whether the addr field is filled by the PMU driver. As most PMU drivers pass 0, it can set the flag only if it has a non-zero va
perf: Use sample_flags for addr
Use the new sample_flags to indicate whether the addr field is filled by the PMU driver. As most PMU drivers pass 0, it can set the flag only if it has a non-zero value. And use 0 in perf_sample_output() if it's not filled already.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220921220032.2858517-1-namhyung@kernel.org
show more ...
|
#
3749d33e |
| 08-Sep-2022 |
Namhyung Kim <namhyung@kernel.org> |
perf: Use sample_flags for callchain
So that it can call perf_callchain() only if needed. Historically it used __PERF_SAMPLE_CALLCHAIN_EARLY but we can do that with sample_flags in the struct perf_
perf: Use sample_flags for callchain
So that it can call perf_callchain() only if needed. Historically it used __PERF_SAMPLE_CALLCHAIN_EARLY but we can do that with sample_flags in the struct perf_sample_data.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220908214104.3851807-1-namhyung@kernel.org
show more ...
|
#
ee9db0e1 |
| 01-Sep-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf: Use sample_flags for txn
Use the new sample_flags to indicate whether the txn field is filled by the PMU driver.
Remove the txn field from the perf_sample_data_init() to minimize the number o
perf: Use sample_flags for txn
Use the new sample_flags to indicate whether the txn field is filled by the PMU driver.
Remove the txn field from the perf_sample_data_init() to minimize the number of cache lines touched.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-7-kan.liang@linux.intel.com
show more ...
|
#
e16fd7f2 |
| 01-Sep-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf: Use sample_flags for data_src
Use the new sample_flags to indicate whether the data_src field is filled by the PMU driver.
Remove the data_src field from the perf_sample_data_init() to minimi
perf: Use sample_flags for data_src
Use the new sample_flags to indicate whether the data_src field is filled by the PMU driver.
Remove the data_src field from the perf_sample_data_init() to minimize the number of cache lines touched.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-6-kan.liang@linux.intel.com
show more ...
|
#
2abe681d |
| 01-Sep-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf: Use sample_flags for weight
Use the new sample_flags to indicate whether the weight field is filled by the PMU driver.
Remove the weight field from the perf_sample_data_init() to minimize the
perf: Use sample_flags for weight
Use the new sample_flags to indicate whether the weight field is filled by the PMU driver.
Remove the weight field from the perf_sample_data_init() to minimize the number of cache lines touched.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-5-kan.liang@linux.intel.com
show more ...
|
#
a9a931e2 |
| 01-Sep-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf: Use sample_flags for branch stack
Use the new sample_flags to indicate whether the branch stack is filled by the PMU driver.
Remove the br_stack from the perf_sample_data_init() to minimize t
perf: Use sample_flags for branch stack
Use the new sample_flags to indicate whether the branch stack is filled by the PMU driver.
Remove the br_stack from the perf_sample_data_init() to minimize the number of cache lines touched.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-4-kan.liang@linux.intel.com
show more ...
|
#
47a3aeb3 |
| 01-Sep-2022 |
Kan Liang <kan.liang@linux.intel.com> |
perf/x86/intel/pebs: Fix PEBS timestamps overwritten
The PEBS TSC-based timestamps do not appear correctly in the final perf.data output file from perf record.
The data->time field setup by PEBS in
perf/x86/intel/pebs: Fix PEBS timestamps overwritten
The PEBS TSC-based timestamps do not appear correctly in the final perf.data output file from perf record.
The data->time field setup by PEBS in the setup_pebs_fixed_sample_data() is later overwritten by perf_events generic code in perf_prepare_sample(). There is an ordering problem.
Set the sample flags when the data->time is updated by PEBS. The data->time field will not be overwritten anymore.
Reported-by: Andreas Kogler <andreas.kogler.0x@gmail.com> Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220901130959.1285717-3-kan.liang@linux.intel.com
show more ...
|