diff options
author | Francisco Jerez <[email protected]> | 2020-03-26 14:59:02 -0700 |
---|---|---|
committer | Francisco Jerez <[email protected]> | 2020-04-28 23:01:03 -0700 |
commit | 188a3659aea6dec9acf1c2fd15fcaecffe4f7d4e (patch) | |
tree | edbd8d2f8da9c64a390bf5eb61531e6cf963dc19 /src/intel/compiler/brw_vec4.h | |
parent | c8ce1cfc9c115032aaaede691c5fe6f92c0e6168 (diff) |
intel/ir: Import shader performance analysis pass.
This introduces an analysis pass intended to estimate several
performance statistics of the shader, including cycle count latency
and throughput values, based on static modeling. It has instruction
performance information more comprehensive than the current scheduling
pass for all platforms between Gen4-11, and works on both the FS and
VEC4 back-end.
The most immediate purpose of this pass is to implement a heuristic
meant to determine whether using SIMD32 dispatch for a fragment shader
can be expected to help more than it hurts. In addition this will
allow the effect of passes run after scheduling (e.g. the TGL software
scoreboard pass and the VEC4 dependency control pass) to be visible in
shader-db statistics.
But that isn't the end of the story, other potential applications of
this pass (not part of this MR) I've been playing around with are:
- Implement a similar SIMD16 heuristic allowing the identification of
inefficient SIMD16 fragment shaders.
- Implement similar SIMD16 and SIMD32 heuristics for the compute
shader stage -- Currently compute shader builds always use the
SIMD16 shader if available and never use the SIMD32 shader unless
strictly necessary, which is suboptimal under certain conditions.
- Hook up to the instruction scheduler in order to improve the
accuracy of its timing information.
- Use as heuristic in order to drive the selection of scheduling
modes (Matt was experimenting with that).
- Plug to the TGL software scoreboard pass in order to implement a
more effective SBID token allocation algorithm, since in general
the optimal token allocation depends on the timings of all
instructions in the program.
- Use its bottleneck detection functionality in order to implement a
heuristic computing a more optimal bound for the number of fragment
shader threads executed in parallel (by adjusting the
MaximumNumberofThreadsPerPSD control of 3DSTATE_PS).
As a follow-up I'm planning to submit updated timing information for
Gen12 platforms -- Everything else required to support Gen12 like SWSB
handling is already included in this patch, but there were some IP
concerns regarding the TGL timing parameters since they cannot
currently be obtained with the documentation and hardware which is
publicly available. The timing parameters for any previous Gen7-11
platforms can be obtained by anyone by sampling the timestamp register
using e.g. shader_time, though I have some more convenient
instrumentation coming up.
Reviewed-by: Kenneth Graunke <[email protected]>
Diffstat (limited to 'src/intel/compiler/brw_vec4.h')
-rw-r--r-- | src/intel/compiler/brw_vec4.h | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/src/intel/compiler/brw_vec4.h b/src/intel/compiler/brw_vec4.h index 1f2d922b186..aa93b05d5af 100644 --- a/src/intel/compiler/brw_vec4.h +++ b/src/intel/compiler/brw_vec4.h @@ -28,6 +28,7 @@ #ifdef __cplusplus #include "brw_ir_vec4.h" +#include "brw_ir_performance.h" #include "brw_vec4_builder.h" #include "brw_vec4_live_variables.h" #endif @@ -107,6 +108,8 @@ public: unsigned int max_grf; BRW_ANALYSIS(live_analysis, brw::vec4_live_variables, backend_shader *) live_analysis; + BRW_ANALYSIS(performance_analysis, brw::performance, + vec4_visitor *) performance_analysis; bool need_all_constants_in_pull_buffer; |