aboutsummaryrefslogtreecommitdiffstats
path: root/src/intel/compiler/brw_vec4.h
diff options
context:
space:
mode:
authorFrancisco Jerez <[email protected]>2020-03-26 14:59:02 -0700
committerFrancisco Jerez <[email protected]>2020-04-28 23:01:03 -0700
commit188a3659aea6dec9acf1c2fd15fcaecffe4f7d4e (patch)
treeedbd8d2f8da9c64a390bf5eb61531e6cf963dc19 /src/intel/compiler/brw_vec4.h
parentc8ce1cfc9c115032aaaede691c5fe6f92c0e6168 (diff)
intel/ir: Import shader performance analysis pass.
This introduces an analysis pass intended to estimate several performance statistics of the shader, including cycle count latency and throughput values, based on static modeling. It has instruction performance information more comprehensive than the current scheduling pass for all platforms between Gen4-11, and works on both the FS and VEC4 back-end. The most immediate purpose of this pass is to implement a heuristic meant to determine whether using SIMD32 dispatch for a fragment shader can be expected to help more than it hurts. In addition this will allow the effect of passes run after scheduling (e.g. the TGL software scoreboard pass and the VEC4 dependency control pass) to be visible in shader-db statistics. But that isn't the end of the story, other potential applications of this pass (not part of this MR) I've been playing around with are: - Implement a similar SIMD16 heuristic allowing the identification of inefficient SIMD16 fragment shaders. - Implement similar SIMD16 and SIMD32 heuristics for the compute shader stage -- Currently compute shader builds always use the SIMD16 shader if available and never use the SIMD32 shader unless strictly necessary, which is suboptimal under certain conditions. - Hook up to the instruction scheduler in order to improve the accuracy of its timing information. - Use as heuristic in order to drive the selection of scheduling modes (Matt was experimenting with that). - Plug to the TGL software scoreboard pass in order to implement a more effective SBID token allocation algorithm, since in general the optimal token allocation depends on the timings of all instructions in the program. - Use its bottleneck detection functionality in order to implement a heuristic computing a more optimal bound for the number of fragment shader threads executed in parallel (by adjusting the MaximumNumberofThreadsPerPSD control of 3DSTATE_PS). As a follow-up I'm planning to submit updated timing information for Gen12 platforms -- Everything else required to support Gen12 like SWSB handling is already included in this patch, but there were some IP concerns regarding the TGL timing parameters since they cannot currently be obtained with the documentation and hardware which is publicly available. The timing parameters for any previous Gen7-11 platforms can be obtained by anyone by sampling the timestamp register using e.g. shader_time, though I have some more convenient instrumentation coming up. Reviewed-by: Kenneth Graunke <[email protected]>
Diffstat (limited to 'src/intel/compiler/brw_vec4.h')
-rw-r--r--src/intel/compiler/brw_vec4.h3
1 files changed, 3 insertions, 0 deletions
diff --git a/src/intel/compiler/brw_vec4.h b/src/intel/compiler/brw_vec4.h
index 1f2d922b186..aa93b05d5af 100644
--- a/src/intel/compiler/brw_vec4.h
+++ b/src/intel/compiler/brw_vec4.h
@@ -28,6 +28,7 @@
#ifdef __cplusplus
#include "brw_ir_vec4.h"
+#include "brw_ir_performance.h"
#include "brw_vec4_builder.h"
#include "brw_vec4_live_variables.h"
#endif
@@ -107,6 +108,8 @@ public:
unsigned int max_grf;
BRW_ANALYSIS(live_analysis, brw::vec4_live_variables,
backend_shader *) live_analysis;
+ BRW_ANALYSIS(performance_analysis, brw::performance,
+ vec4_visitor *) performance_analysis;
bool need_all_constants_in_pull_buffer;