i965/fs/gen7: split instructions that run into exec masking bugs

In fp64 we can produce code like this: mov(16) vgrf2<2>:UD, vgrf3<2>:UD That our simd lowering pass would typically split in instructions with a width of 8, writing to two consecutive registers each. Unfortunately, gen7 hardware has a bug affecting execution masking and as a result, the second GRF register write won't work properly. Curro verified this: "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is the 8-bit quarter of the execution mask signals specified in the instruction control fields) for the second compressed half of any single-precision instruction (for double-precision instructions it's hardwired to use NibCtrl+1, at least on HSW), which means that the EU will apply the wrong execution controls for the second sequential GRF write if the number of channels per GRF is not exactly eight in single-precision mode (or four in double-float mode)." In practice, this means that we cannot write more than one consecutive GRF in a single instruction if the number of channels per GRF is not exactly eight in single-precision mode (or four in double-float mode). This patch makes our SIMD lowering pass split this kind of instructions so that the split versions only write to a single register. In the example above this means that we split the write in 4 instructions, each one writing 4 UD elements (width = 4) to a single register. v2 (Curro): - Make explicit that the thing about hardwiring NibCtrl+1 for the second compressed half is known to happen in Haswell and the issue with IVB might not be exactly the same. - Assign max_width instead of returning early so that we can handle multiple restrictions affecting to the same instruction. - Avoid division by 0 if the instruction does not write any registers. - Ignore instructions what have WE_all set. - Use the instruction execution type size instead of the dst type size. v3 (Curro): - Move the implementation down so it is not placed in the middle of another workaround. - Declare channels_per_grf as const. - Don't break the loop early if we find a BAD_FILE source. - Fix the number of channels that the hardware shifts for the second half of a compressed instruction to be 8 in single precision and 4 in double precision. Reviewed-by: Francisco Jerez <[email protected]>
author: Iago Toral Quiroga <[email protected]> 2016-03-30 14:00:31 +0200
committer: Samuel Iglesias Gonsálvez <[email protected]> 2016-07-13 07:09:41 +0200
commit: aa4796ae815f38ff44283476f3553edc06114e80 (patch)
tree: f19336171a259a797eb84e24d057ff180f232bf1 /src
parent: 87a13f598b1ecd50bc209088cf1dc60fd90df015 (diff)
1 files changed, 29 insertions, 0 deletions
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
index 2f473ccda93..6ed98f53f79 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -4755,6 +4755,35 @@ get_fpu_lowered_simd_width(const struct brw_device_info *devinfo,
    if (inst->is_3src(devinfo) && !devinfo->supports_simd16_3src)
       max_width = MIN2(max_width, inst->exec_size / reg_count);
 
+   /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is
+    * the 8-bit quarter of the execution mask signals specified in the
+    * instruction control fields) for the second compressed half of any
+    * single-precision instruction (for double-precision instructions
+    * it's hardwired to use NibCtrl+1, at least on HSW), which means that
+    * the EU will apply the wrong execution controls for the second
+    * sequential GRF write if the number of channels per GRF is not exactly
+    * eight in single-precision mode (or four in double-float mode).
+    *
+    * In this situation we calculate the maximum size of the split
+    * instructions so they only ever write to a single register.
+    */
+   if (devinfo->gen < 8 && inst->regs_written > 1 &&
+       !inst->force_writemask_all) {
+      const unsigned channels_per_grf = inst->exec_size / inst->regs_written;
+      unsigned exec_type_size = 0;
+      for (int i = 0; i < inst->sources; i++) {
+         if (inst->src[i].file != BAD_FILE)
+            exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type));
+      }
+      assert(exec_type_size);
+
+      /* The hardware shifts exactly 8 channels per compressed half of the
+       * instruction in single-precision mode and exactly 4 in double-precision.
+       */
+      if (channels_per_grf != (exec_type_size == 8 ? 4 : 8))
+         max_width = MIN2(max_width, channels_per_grf);
+   }
+
    /* Only power-of-two execution sizes are representable in the instruction
     * control fields.
     */
author	Iago Toral Quiroga <[email protected]>	2016-03-30 14:00:31 +0200
committer	Samuel Iglesias Gonsálvez <[email protected]>	2016-07-13 07:09:41 +0200
commit	aa4796ae815f38ff44283476f3553edc06114e80 (patch)
tree	f19336171a259a797eb84e24d057ff180f232bf1 /src
parent	87a13f598b1ecd50bc209088cf1dc60fd90df015 (diff)