diff options
author | Caio Marcelo de Oliveira Filho <[email protected]> | 2020-01-14 12:03:22 -0800 |
---|---|---|
committer | Marge Bot <[email protected]> | 2020-01-21 23:41:35 +0000 |
commit | 45164fc8c58f07b75b62057d097e85c45bc157fe (patch) | |
tree | 92b971d8e852e6b5ba623e20e5ff7a3a693302ef /src | |
parent | 4f431e870c7239879bc7c7f54f65184d5b391815 (diff) |
intel/fs: Don't emit control barrier if only one thread is used
When there's only one hardware thread (i.e. the dispatch width greater
or equal to the workgroup size), there's no need to use a barrier to
ensure all the invocations reach the same point in the shader, because
they are already running lock-step.
Results for SKL running Iris for shader-db tests with compute shaders
total sends in shared programs: 18361 -> 18339 (-0.12%)
sends in affected programs: 904 -> 882 (-2.43%)
helped: 9
HURT: 0
helped stats (abs) min: 1 max: 5 x̄: 2.44 x̃: 2
helped stats (rel) min: 0.84% max: 21.43% x̄: 7.82% x̃: 2.67%
95% mean confidence interval for sends value: -3.31 -1.58
95% mean confidence interval for sends %-change: -14.67% -0.97%
Sends are helped.
Shaders from Aztec Ruins, Car Chase, Manhattan and DeusEx are helped.
Results for ICL and TGL are similar to SKL.
Results for BDW are similar to SKL except for DeusEx shader that has a
workgroup size 16 but in BDW picks the SIMD8.
Reviewed-by: Francisco Jerez <[email protected]>
Tested-by: Marge Bot <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/merge_requests/3226>
Diffstat (limited to 'src')
-rw-r--r-- | src/intel/compiler/brw_fs_nir.cpp | 9 |
1 files changed, 9 insertions, 0 deletions
diff --git a/src/intel/compiler/brw_fs_nir.cpp b/src/intel/compiler/brw_fs_nir.cpp index 0b4d50c56e3..a861a1d938d 100644 --- a/src/intel/compiler/brw_fs_nir.cpp +++ b/src/intel/compiler/brw_fs_nir.cpp @@ -3720,6 +3720,15 @@ fs_visitor::nir_emit_cs_intrinsic(const fs_builder &bld, switch (instr->intrinsic) { case nir_intrinsic_control_barrier: + /* The whole workgroup fits in a single HW thread, so all the + * invocations are already executed lock-step. Instead of an actual + * barrier just emit a scheduling fence, that will generate no code. + */ + if (workgroup_size() <= dispatch_width) { + bld.exec_all().group(1, 0).emit(FS_OPCODE_SCHEDULING_FENCE); + break; + } + emit_barrier(); cs_prog_data->uses_barrier = true; break; |