aboutsummaryrefslogtreecommitdiffstats
path: root/src/intel/compiler/brw_fs_generator.cpp
diff options
context:
space:
mode:
authorCaio Marcelo de Oliveira Filho <[email protected]>2020-01-17 15:07:44 -0800
committerMarge Bot <[email protected]>2020-04-29 07:17:27 +0000
commitf858fa26b4cca8834c8687f01d2ba431fcc8e006 (patch)
tree3a338d70fc1482426d62c766e4e9e58c65877e82 /src/intel/compiler/brw_fs_generator.cpp
parent0e96b0d6dd99e80c1ccbc13629ad22a946a74828 (diff)
intel/fs,vec4: Pull stall logic for memory fences up into the IR
Instead of emitting the stall MOV "inside" the SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when creating the IR. For IvyBridge, every (data cache) fence is accompained by a render cache fence, that now is explicit in the IR, two SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs). Because Begin and End interlock intrinsics are effectively memory barriers, move its handling alongside the other memory barrier intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish if we are going to use a SENDC (for Begin) or regular SEND (for End). This change is a preparation to allow emitting both SENDs in Gen11+ before we can stall on them. Shader-db results for IVB (i965): total instructions in shared programs: 11971190 -> 11971200 (<.01%) instructions in affected programs: 11482 -> 11492 (0.09%) helped: 0 HURT: 8 HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1 HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10% 95% mean confidence interval for instructions value: 0.66 1.84 95% mean confidence interval for instructions %-change: 0.01% 0.27% Instructions are HURT. Unlike the previous code, that used the `mov g1 g2` trick to force both `g1` and `g2` to stall, the scheduling fence will generate `mov null g1` and `mov null g2`. During review it was decided it was not worth keeping the special codepath for the small effect will have. Shader-db results for HSW (i965), BDW and SKL don't have a change on instruction count, but do report changes in cycles count, showing SKL results below total cycles in shared programs: 341738444 -> 341710570 (<.01%) cycles in affected programs: 7240002 -> 7212128 (-0.38%) helped: 46 HURT: 5 helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154 helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95% HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362 HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08% 95% mean confidence interval for cycles value: -777.71 -315.38 95% mean confidence interval for cycles %-change: -1.42% -0.83% Cycles are helped. This seems to be the effect of allocating two registers separatedly instead of a single one with size 2, which causes different register allocation, affecting the cycle estimates. while ICL also has not change on instruction count but report changes negative changes in cycles total cycles in shared programs: 352665369 -> 352707484 (0.01%) cycles in affected programs: 9608288 -> 9650403 (0.44%) helped: 4 HURT: 104 helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101 helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49% HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48 HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45% 95% mean confidence interval for cycles value: 256.67 523.24 95% mean confidence interval for cycles %-change: 0.63% 1.03% Cycles are HURT. AFAICT this is the result of the case above. Shader-db results for TGL have similar cycles result as ICL, but also affect instructions total instructions in shared programs: 17690586 -> 17690597 (<.01%) instructions in affected programs: 64617 -> 64628 (0.02%) helped: 55 HURT: 32 helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3 helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74% HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2 HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69% 95% mean confidence interval for instructions value: -2.03 2.28 95% mean confidence interval for instructions %-change: -0.41% 0.15% Inconclusive result (value mean confidence interval includes 0). Now that more is done in the IR, more dependencies are visible and more SWSB annotations are emitted. Mixed with different register allocation decisions like above, some shaders will see more `sync nops` while others able to avoid them. Most of the new `sync nops` are also redundant and could be dropped, which will be fixed in a separate change. Reviewed-by: Francisco Jerez <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
Diffstat (limited to 'src/intel/compiler/brw_fs_generator.cpp')
-rw-r--r--src/intel/compiler/brw_fs_generator.cpp20
1 files changed, 10 insertions, 10 deletions
diff --git a/src/intel/compiler/brw_fs_generator.cpp b/src/intel/compiler/brw_fs_generator.cpp
index b055110ddf8..5825e0770d4 100644
--- a/src/intel/compiler/brw_fs_generator.cpp
+++ b/src/intel/compiler/brw_fs_generator.cpp
@@ -2217,13 +2217,19 @@ fs_generator::generate_code(const cfg_t *cfg, int dispatch_width,
generate_shader_time_add(inst, src[0], src[1], src[2]);
break;
+ case SHADER_OPCODE_INTERLOCK:
case SHADER_OPCODE_MEMORY_FENCE: {
assert(src[1].file == BRW_IMMEDIATE_VALUE);
assert(src[2].file == BRW_IMMEDIATE_VALUE);
- const unsigned sends =
- brw_memory_fence(p, dst, src[0], BRW_OPCODE_SEND, src[1].ud,
- src[2].ud);
- send_count += sends;
+
+ const enum opcode send_op = inst->opcode == SHADER_OPCODE_INTERLOCK ?
+ BRW_OPCODE_SENDC : BRW_OPCODE_SEND;
+
+ brw_memory_fence(p, dst, src[0], send_op,
+ brw_message_target(inst->sfid),
+ /* commit_enable */ src[1].ud,
+ /* bti */ src[2].ud);
+ send_count++;
break;
}
@@ -2257,12 +2263,6 @@ fs_generator::generate_code(const cfg_t *cfg, int dispatch_width,
break;
- case SHADER_OPCODE_INTERLOCK:
- assert(devinfo->gen >= 9);
- /* The interlock is basically a memory fence issued via sendc */
- brw_memory_fence(p, dst, src[0], BRW_OPCODE_SENDC, false, /* bti */ 0);
- break;
-
case SHADER_OPCODE_FIND_LIVE_CHANNEL: {
const struct brw_reg mask =
brw_stage_has_packed_dispatch(devinfo, stage,