aboutsummaryrefslogtreecommitdiffstats
path: root/src/intel/compiler
Commit message (Collapse)AuthorAgeFilesLines
* intel/nir: Call nir_metadata_preserve on !progressJason Ekstrand2020-06-115-6/+20
| | | | | Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5171>
* nir: Call nir_metadata_preserve on !progressJason Ekstrand2020-06-111-0/+2
| | | | | | Reviewed-by: Alyssa Rosenzweig <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5171>
* intel/fs: Add Fall-through commentCaio Marcelo de Oliveira Filho2020-06-081-0/+1
| | | | | | | | | Just to clarify the missing break is intentional. Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Boris Brezillon <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5365>
* intel/compiler: Extract control barriers from scoped barriersBoris Brezillon2020-06-034-0/+87
| | | | | | | | | | | Add a lowering pass extracting all control barriers embedded in scoped barriers into proper control barriers so we can get rid of the logic inserting control barriers when an SpvOpControlBarrier with WorkGroup scope is parsed in spirv_to_nir(). Signed-off-by: Boris Brezillon <[email protected]> Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4900>
* nir: Replace the scoped_memory barrier by a scoped_barrierBoris Brezillon2020-06-033-5/+8
| | | | | | | | | | | | | | SPIRV OpControlBarrier can have both a memory and a control barrier which some hardware can handle with a single instruction. Let's turn the scoped_memory_barrier into a scoped barrier which can embed both barrier types. Note that control-only or memory-only barriers can be supported through this new intrinsic by passing NIR_SCOPE_NONE to the unused barrier type. Signed-off-by: Boris Brezillon <[email protected]> Suggested-by: Caio Marcelo de Oliveira Filho <[email protected]> Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4900>
* nir: add callback to nir_remove_dead_variables()Timothy Arceri2020-06-031-3/+3
| | | | | | | | | | | | This allows us to do API specific checks before removing variable without filling nir_remove_dead_variables() with API specific code. In the following patches we will use this to support the removal of dead uniforms in GLSL. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Eric Anholt <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4797>
* meson: use gnu_symbol_visibility argumentDylan Baker2020-06-011-2/+2
| | | | | | | | | | This uses a meson builtin to handle -fvisibility=hidden. This is nice because we don't need to track which languages are used, if C++ is suddenly added meson just does the right thing. Acked-by: Matt Turner <[email protected]> Reviewed-by: Eric Engestrom <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4740>
* intel/fs: Emit HALT for discard on Gen4-5Jason Ekstrand2020-05-306-30/+107
| | | | | | | | | | | | | | | | | | Using HALT to immediately jump to the end of the shader is required to implement GL_EXT_gpu_shader4 and OpenGL 3.0. However, vanilla OpenGL 1.2 doesn't forbid it and it likely makes something somewhere faster. We should be consistent and implement the same discard behavior on all hardware if we can. The rules for HALT on Gen4-5 are a bit different from Gen6+. On the older hardware, there is no stack for HALT; instead it's up to software to save and restore mask registers. However, there's no real saving needed since we only use HALT to jump to the end of the program where we're about about to do our FB writes. All we need to do is reset AMask to DMask, the value it was initialized to at the start of the thread. Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5244>
* intel/fs: Fix unused texture coordinate zeroing on Gen4-5Jason Ekstrand2020-05-301-1/+2
| | | | | | | | | | We were inserting the right number of MOVs but, thanks to the way we advanced msg_end earlier in the function, were often writing the zeros past the end of where we actually read in the register file. Cc: [email protected] Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5243>
* intel/vec4: Stomp the return type of RESINFO to UINT32Jason Ekstrand2020-05-301-0/+11
| | | | | | | | | We already do this in the FS back-end; we just weren't doing it in vec4 so RESINFO messages weren't returning the right data. Cc: [email protected] Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5243>
* intel: Add helper to calculate GPGPU_WALKER::RightExecutionMaskCaio Marcelo de Oliveira Filho2020-05-271-0/+13
| | | | | | | Suggested by Jason. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
* intel/fs: Generate multiple CS SIMD variants for variable group sizeCaio Marcelo de Oliveira Filho2020-05-272-62/+163
| | | | | | | | | | | | | | | | | This will make the GL drivers pick the right SIMD variant for a given group size set during dispatch. The heuristic implemented in brw_cs_simd_size_for_group_size() is the same as in brw_compile_cs(). The cs_prog_data::simd_size field was removed. The generated SIMD sizes are marked in a bitmask, which is already used via brw_cs_simd_size_for_group_size() by the drivers. When in variable group size, it is OK if larger SIMD shader spill, since we'd need it for the cases where the smaller one can't hold all the invocations. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
* intel/fs: Add helper to get prog_offset and simd_sizeCaio Marcelo de Oliveira Filho2020-05-272-0/+22
| | | | | | | | This indirection will be used by the variable group size case in a later change. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
* intel/fs: Support INTEL_DEBUG=no8,no32 in compute shadersCaio Marcelo de Oliveira Filho2020-05-271-2/+17
| | | | | | | The "no32" flag will have precedence over "do32", like is done for FS. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
* intel/fs: Remove min_dispatch_width spilling decision from RACaio Marcelo de Oliveira Filho2020-05-272-38/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | Move the decision one level up, let brw_compile_*() functions use the spilling information to decide whether or not a certain width compilation can spill (passed via run_*() functions). The min_dispatch_width was used to compare with the dispatch_width and decide whether "a previous shader is already available, so don't accept spill". This is replaced by: - Not calling run_*() functions if it is know beforehand a smaller width already spilled -- since the larger width will spill and fail; - Explicitly passing whether or not a shader is allowed to spill. For the cases where the smaller width is available and haven't spilled, the larger width will be compiled but is only useful if it won't spill. Moving the decision to this level will be useful later for variable group size, which is a case where we want all the widths to be allowed to spill. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5142>
* intel/fs: Work around dual-source blending hangs in combination with SIMD16Danylo Piliaiev2020-05-271-2/+6
| | | | | | | | | | | | It was found that dual-source blending hangs with SIMD16 dispatch in some specific but unknown situation. Which in the wild happen when rgba anti-aliasing is enabled for fonts. Cc: <[email protected]> Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/2183 Signed-off-by: Danylo Piliaiev <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5037>
* intel: Use SATURATEAlyssa Rosenzweig2020-05-261-2/+2
| | | | | | Signed-off-by: Alyssa Rosenzweig <[email protected]> Reviewed-by: Eric Engestrom <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5100>
* intel/fs: Remove redundant assert()Caio Marcelo de Oliveira Filho2020-05-261-3/+0
| | | | | | | | This is covered by the two previous similar asserts. Each time `v` is assigned this is asserted. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
* intel/fs: Early return when can't satisfy explicit group sizeCaio Marcelo de Oliveira Filho2020-05-261-8/+11
| | | | | Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
* intel/fs: Remove unused state from brw_nir_lower_cs_intrinsicsCaio Marcelo de Oliveira Filho2020-05-261-16/+11
| | | | | | | | | After 2663759af0e ("intel/fs: Add and use a new load_simd_width_intel intrinsic") the local_workgroup_size is not used anymore except for assertions at the pass' start, so drop it from state struct. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
* intel/fs: Remove unused emission of load_simd_with_intelCaio Marcelo de Oliveira Filho2020-05-261-5/+0
| | | | | | | | | | The nir_intrinsic_load_simd_width_intel is always lowered by the brw_nir_lower_simd() pass before the emission happens. This is likely a "leftover" from patch rewriting/squashing that happened when this intrinsic was added. Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5213>
* tree-wide: fix deprecated GitLab URLsEric Engestrom2020-05-231-1/+1
| | | | | | | | | | | | | They will stop working in the next GitLab release, so let's update them ASAP to make sure things are propagated to everyone by then. See: https://about.gitlab.com/releases/2020/05/06/gitlab-com-13-0-breaking-changes/#removal-of-deprecated-project-paths Cc: [email protected] Signed-off-by: Eric Engestrom <[email protected]> Acked-by: Alyssa Rosenzweig <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5111>
* i965/vec4: Ignore swizzle of VGRF for use by var_range_end()Andrii Simiklit2020-05-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Issue description from Matt's commit e7c376ad: "var_range_end(v, n) loops over the n components of variable number v and finds the maximum value, giving the last use of any component of v. Therefore it expects v to correspond to the variable associated with the .x channel of the VGRF. var_from_reg() however returns the variable for the first channel of the VGRF, post-swizzle. So, if the last register had a swizzle with y, z, or w in the swizzle component, we would read out of bounds. For any other register, we would read liveness information from the next register. The fix is to convert the src_reg to a dst_reg in order to call the dst_reg version of var_from_reg() that doesn't consider the swizzle." Closes: #3003 Fixes: 48dfb30f ('intel/compiler: Move all live interval analysis results into vec4_live_variables') Reviewed-by: Matt Turner <[email protected]> Signed-off-by: Andrii Simiklit <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4941>
* intel/fs: Use writes_memory from shader_infoCaio Marcelo de Oliveira Filho2020-05-182-25/+3
| | | | | Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4815>
* intel/compiler: fix alignment assert in nir_emit_intrinsicArcady Goldmints-Orlov2020-05-121-1/+1
| | | | | | | | Fixes: c643979228 (intel/fs: Choose memory message type based on bit size) Fixes: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_i8vec2 Reviewed-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5000>
* nir: do not vectorize load/store if offset can overflow and robustness enabledSamuel Pitoiset2020-05-111-1/+2
| | | | | | | | | | | | This prevents vectorization for loads/stores that can overflow if the low offset is negative and the range greater or equal than 0. The caller can pass the list of variable modes that matter for robust access. Signed-off-by: Samuel Pitoiset <[email protected]> Reviewed-by: Rhys Perry <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4881>
* nir/algebraic: Split ibfe and ubfe with two constant sourcesIan Romanick2020-05-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I also tried splitting ubfe instructions with one or zero constants, and zero shaders in shader-db were affected. The "lost" shader is a compute shader that was promoted from SIMD8 to SIMD16, so is also counted as the gained shader. v2: Further restrict bfe splitting. bfe with multiple constants is better on at least some Radeon GPUs. Use -x instead of 32-x in shift counts. v3: Fix the outer shift count for ibfe lowering. Add c=0 optimizations to prevent bad lowering. Both suggested by Rhys. Add shift by -32 optimizations. Tiger Lake total instructions in shared programs: 17608764 -> 17596316 (-0.07%) instructions in affected programs: 303765 -> 291317 (-4.10%) helped: 113 HURT: 46 helped stats (abs) min: 1 max: 458 x̄: 120.67 x̃: 21 helped stats (rel) min: 0.09% max: 11.23% x̄: 3.47% x̃: 1.39% HURT stats (abs) min: 1 max: 201 x̄: 25.83 x̃: 6 HURT stats (rel) min: 0.23% max: 5.18% x̄: 1.53% x̃: 1.11% 95% mean confidence interval for instructions value: -101.13 -55.45 95% mean confidence interval for instructions %-change: -2.61% -1.44% Instructions are helped. total cycles in shared programs: 338390770 -> 333530868 (-1.44%) cycles in affected programs: 79438330 -> 74578428 (-6.12%) helped: 112 HURT: 64 helped stats (abs) min: 2 max: 268955 x̄: 44261.93 x̃: 1452 helped stats (rel) min: <.01% max: 29.51% x̄: 4.72% x̃: 2.23% HURT stats (abs) min: 2 max: 17618 x̄: 1522.41 x̃: 84 HURT stats (rel) min: <.01% max: 7.34% x̄: 1.35% x̃: 0.34% 95% mean confidence interval for cycles value: -37232.47 -17993.69 95% mean confidence interval for cycles %-change: -3.37% -1.65% Cycles are helped. total spills in shared programs: 8944 -> 8138 (-9.01%) spills in affected programs: 3240 -> 2434 (-24.88%) helped: 67 HURT: 0 total fills in shared programs: 9373 -> 7842 (-16.33%) fills in affected programs: 4736 -> 3205 (-32.33%) helped: 67 HURT: 0 LOST: 1 GAINED: 2 Ice Lake and Skylake had similar results. (Ice Lake shown) total instructions in shared programs: 16123288 -> 16116876 (-0.04%) instructions in affected programs: 241155 -> 234743 (-2.66%) helped: 126 HURT: 2 helped stats (abs) min: 1 max: 209 x̄: 50.90 x̃: 7 helped stats (rel) min: 0.07% max: 5.94% x̄: 1.76% x̃: 0.65% HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 HURT stats (rel) min: 0.05% max: 0.24% x̄: 0.15% x̃: 0.15% 95% mean confidence interval for instructions value: -61.29 -38.89 95% mean confidence interval for instructions %-change: -2.05% -1.42% Instructions are helped. total cycles in shared programs: 335419163 -> 330438819 (-1.48%) cycles in affected programs: 77515502 -> 72535158 (-6.42%) helped: 139 HURT: 37 helped stats (abs) min: 2 max: 269140 x̄: 36374.19 x̃: 597 helped stats (rel) min: <.01% max: 28.60% x̄: 3.67% x̃: 1.31% HURT stats (abs) min: 4 max: 17618 x̄: 2045.08 x̃: 174 HURT stats (rel) min: 0.02% max: 8.32% x̄: 2.61% x̃: 0.62% 95% mean confidence interval for cycles value: -37799.30 -18795.51 95% mean confidence interval for cycles %-change: -3.13% -1.57% Cycles are helped. total spills in shared programs: 8065 -> 7306 (-9.41%) spills in affected programs: 3153 -> 2394 (-24.07%) helped: 67 HURT: 0 total fills in shared programs: 8710 -> 7412 (-14.90%) fills in affected programs: 4466 -> 3168 (-29.06%) helped: 67 HURT: 0 LOST: 1 GAINED: 1 Broadwell total instructions in shared programs: 14970538 -> 14965967 (-0.03%) instructions in affected programs: 227040 -> 222469 (-2.01%) helped: 126 HURT: 2 helped stats (abs) min: 1 max: 136 x̄: 36.29 x̃: 8 helped stats (rel) min: 0.07% max: 6.02% x̄: 1.47% x̃: 0.89% HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 HURT stats (rel) min: 0.05% max: 0.24% x̄: 0.14% x̃: 0.14% 95% mean confidence interval for instructions value: -43.05 -28.37 95% mean confidence interval for instructions %-change: -1.69% -1.19% Instructions are helped. total cycles in shared programs: 336237662 -> 333035960 (-0.95%) cycles in affected programs: 72066394 -> 68864692 (-4.44%) helped: 134 HURT: 42 helped stats (abs) min: 4 max: 122640 x̄: 24344.54 x̃: 1833 helped stats (rel) min: <.01% max: 26.93% x̄: 4.02% x̃: 2.38% HURT stats (abs) min: 1 max: 17205 x̄: 1439.69 x̃: 92 HURT stats (rel) min: <.01% max: 7.12% x̄: 1.34% x̃: 0.62% 95% mean confidence interval for cycles value: -23753.58 -12629.40 95% mean confidence interval for cycles %-change: -3.50% -1.98% Cycles are helped. total spills in shared programs: 21122 -> 20204 (-4.35%) spills in affected programs: 3644 -> 2726 (-25.19%) helped: 67 HURT: 0 total fills in shared programs: 24879 -> 23460 (-5.70%) fills in affected programs: 4883 -> 3464 (-29.06%) helped: 67 HURT: 0 Haswell total instructions in shared programs: 13148269 -> 13145444 (-0.02%) instructions in affected programs: 137046 -> 134221 (-2.06%) helped: 97 HURT: 3 helped stats (abs) min: 1 max: 137 x̄: 30.58 x̃: 3 helped stats (rel) min: 0.14% max: 4.38% x̄: 1.38% x̃: 0.44% HURT stats (abs) min: 1 max: 70 x̄: 47.00 x̃: 70 HURT stats (rel) min: 0.05% max: 5.82% x̄: 3.90% x̃: 5.82% 95% mean confidence interval for instructions value: -37.15 -19.35 95% mean confidence interval for instructions %-change: -1.56% -0.89% Instructions are helped. total cycles in shared programs: 321221834 -> 318333159 (-0.90%) cycles in affected programs: 54932349 -> 52043674 (-5.26%) helped: 95 HURT: 53 helped stats (abs) min: 4 max: 123390 x̄: 30648.39 x̃: 702 helped stats (rel) min: <.01% max: 28.87% x̄: 4.27% x̃: 2.87% HURT stats (abs) min: 4 max: 2357 x̄: 432.49 x̃: 113 HURT stats (rel) min: <.01% max: 3.44% x̄: 1.03% x̃: 0.54% 95% mean confidence interval for cycles value: -26154.16 -12881.99 95% mean confidence interval for cycles %-change: -3.20% -1.55% Cycles are helped. total spills in shared programs: 19878 -> 19293 (-2.94%) spills in affected programs: 3020 -> 2435 (-19.37%) helped: 41 HURT: 2 total fills in shared programs: 20918 -> 19875 (-4.99%) fills in affected programs: 3968 -> 2925 (-26.29%) helped: 41 HURT: 2 LOST: 0 GAINED: 1 Ivy Bridge total instructions in shared programs: 11875585 -> 11873641 (-0.02%) instructions in affected programs: 78065 -> 76121 (-2.49%) helped: 27 HURT: 0 helped stats (abs) min: 8 max: 134 x̄: 72.00 x̃: 72 helped stats (rel) min: 0.36% max: 4.23% x̄: 2.42% x̃: 2.42% 95% mean confidence interval for instructions value: -83.68 -60.32 95% mean confidence interval for instructions %-change: -2.78% -2.07% Instructions are helped. total cycles in shared programs: 178232734 -> 175769085 (-1.38%) cycles in affected programs: 50018707 -> 47555058 (-4.93%) helped: 27 HURT: 0 helped stats (abs) min: 82035 max: 99953 x̄: 91246.26 x̃: 92278 helped stats (rel) min: 4.40% max: 5.69% x̄: 4.93% x̃: 4.95% 95% mean confidence interval for cycles value: -93674.20 -88818.32 95% mean confidence interval for cycles %-change: -5.09% -4.78% Cycles are helped. total spills in shared programs: 4182 -> 3739 (-10.59%) spills in affected programs: 1089 -> 646 (-40.68%) helped: 27 HURT: 0 total fills in shared programs: 5216 -> 4345 (-16.70%) fills in affected programs: 1874 -> 1003 (-46.48%) helped: 27 HURT: 0 No changes on any earlier Intel platforms. Reviewed-by: Matt Turner <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4156>
* intel: Let drivers call brw_nir_lower_cs_intrinsics()Caio Marcelo de Oliveira Filho2020-05-011-2/+0
| | | | | | | | | | | | | | The motivating factor is: this lowering may cause nir_intrinsic_load_local_group_size intrinsics to be added to the shader, and by moving this around we make possible for the drivers to lower that intrinsic by themselves. Iris will do just that in a later patch for implementing variable group size. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
* intel/fs: Add and use a new load_simd_width_intel intrinsicCaio Marcelo de Oliveira Filho2020-05-014-34/+73
| | | | | | | | | | | | | | | | | | | | | Intrinsic to get the SIMD width, which not always the same as subgroup size. Starting with a small scope (Intel), but we can rename it later to generalize if this turns out useful for other drivers. Change brw_nir_lower_cs_intrinsics() to use this intrinsic instead of a width will be passed as argument. The pass also used to optimized load_subgroup_id for the case that the workgroup fitted into a single thread (it will be constant zero). This optimization moved together with lowering of the SIMD. This is a preparation for letting the drivers call it before the brw_compile_cs() step. No shader-db changes in BDW, SKL, ICL and TGL. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
* intel/fs: Add an option to lower variable group size in backendCaio Marcelo de Oliveira Filho2020-05-012-1/+10
| | | | | | | | | Adding this since Iris will handle variable group size parameters by itself. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
* intel/fs: Clean up variable group size handling in backendCaio Marcelo de Oliveira Filho2020-05-013-8/+4
| | | | | | | | Just use the information from NIR shader_info. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4794>
* anv,iris: Fix input vertex max for tcs on gen12D Scott Phillips2020-05-011-1/+1
| | | | | | | | | | | | | gen12 does away with the single patch dispatch mode for tcs, and increases some limits so that 8_patch mode can always work. Make the necessary changes so we don't try to fall back to single patch mode. Fixes KHR-GL46.tessellation_shader.single.max_patch_vertices and others Fixes: 44754279ace7 ("intel/fs/gen12: Use TCS 8_PATCH mode.") Reviewed-by: Kenneth Graunke <[email protected]> Acked-by: Jason Ekstrand <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4843>
* intel/fs: Update location of Render Target Array Index for gen12D Scott Phillips2020-05-011-1/+9
| | | | | | | | | | | | Render Target Array Index has moved from R0.0[26:16] to R1.1[26:16] on gen12. Fixes dEQP-VK.multiview.input_attachments.* Cc: <[email protected]> Reviewed-by: Francisco Jerez <[email protected]> Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4836>
* intel/eu: Use non-coherent mode (BTI=253) for stateless A64 messagesJason Ekstrand2020-04-302-10/+39
| | | | | | | | | | We don't care about full IA coherency since we always have the opportunity in GL or Vulkan to flush the data cache. Using IA-coherent mode is likely just making A64 access slower than it needs to be. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4819>
* intel/ir: Update performance analysis parameters for memory fence codegen ↵Francisco Jerez2020-04-291-4/+18
| | | | | | | | | | | | | changes. The SFID field of the SHADER_OPCODE_MEMORY_FENCE and SHADER_OPCODE_INTERLOCK instructions now indicates the target function of the memory fence. Account the cycle-count cost to the right shared unit. Fixes: f858fa26b4cca8834c8687f01d2ba431fcc8e006 ("intel/fs,vec4: Pull stall logic for memory fences up into the IR") Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4817>
* intel/fs: Don't delete coalesced MOVs if they have a cmodJason Ekstrand2020-04-291-2/+15
| | | | | | | | | | | | | | | | | | | | | Shader-db results on ICL: total instructions in shared programs: 17133088 -> 17133287 (<.01%) instructions in affected programs: 61300 -> 61499 (0.32%) helped: 0 HURT: 199 This means it's likely fixing 199 bugs. :-) All the changed shaders are in Mad Max. It's surprisingly difficult to get the back-end compiler to generate a pattern that hits this we don't tend to emit a lot coalescable MOVs. The pattern in Mad Max that's able to hit is fsign(fsat(x)) under the right conditions. Closes: #2820 Cc: [email protected] Tested-by: Ian Romanick <[email protected]> Reviewed-by: Ian Romanick <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4773>
* intel/fs: Only stall after sending all memory fence messagesCaio Marcelo de Oliveira Filho2020-04-291-19/+16
| | | | | | | | | | | | | | In Gen11+, when emitting a fence for both L3 and SLM, the generated code would look like SEND, MOV (for stall), SEND, MOV (for stall) This commit change that so two SENDs are emitted before the MOVs for stall. This is similar to the approach used in Ivy Bridge for the render fence. Reviewed-by: Francisco Jerez <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
* intel/fs,vec4: Pull stall logic for memory fences up into the IRCaio Marcelo de Oliveira Filho2020-04-297-109/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of emitting the stall MOV "inside" the SHADER_OPCODE_MEMORY_FENCE generation, use the scheduling fences when creating the IR. For IvyBridge, every (data cache) fence is accompained by a render cache fence, that now is explicit in the IR, two SHADER_OPCODE_MEMORY_FENCEs are emitted (with different SFIDs). Because Begin and End interlock intrinsics are effectively memory barriers, move its handling alongside the other memory barrier intrinsics. The SHADER_OPCODE_INTERLOCK is still used to distinguish if we are going to use a SENDC (for Begin) or regular SEND (for End). This change is a preparation to allow emitting both SENDs in Gen11+ before we can stall on them. Shader-db results for IVB (i965): total instructions in shared programs: 11971190 -> 11971200 (<.01%) instructions in affected programs: 11482 -> 11492 (0.09%) helped: 0 HURT: 8 HURT stats (abs) min: 1 max: 3 x̄: 1.25 x̃: 1 HURT stats (rel) min: 0.03% max: 0.50% x̄: 0.14% x̃: 0.10% 95% mean confidence interval for instructions value: 0.66 1.84 95% mean confidence interval for instructions %-change: 0.01% 0.27% Instructions are HURT. Unlike the previous code, that used the `mov g1 g2` trick to force both `g1` and `g2` to stall, the scheduling fence will generate `mov null g1` and `mov null g2`. During review it was decided it was not worth keeping the special codepath for the small effect will have. Shader-db results for HSW (i965), BDW and SKL don't have a change on instruction count, but do report changes in cycles count, showing SKL results below total cycles in shared programs: 341738444 -> 341710570 (<.01%) cycles in affected programs: 7240002 -> 7212128 (-0.38%) helped: 46 HURT: 5 helped stats (abs) min: 14 max: 1940 x̄: 676.22 x̃: 154 helped stats (rel) min: <.01% max: 2.62% x̄: 1.28% x̃: 0.95% HURT stats (abs) min: 2 max: 1768 x̄: 646.40 x̃: 362 HURT stats (rel) min: <.01% max: 0.83% x̄: 0.28% x̃: 0.08% 95% mean confidence interval for cycles value: -777.71 -315.38 95% mean confidence interval for cycles %-change: -1.42% -0.83% Cycles are helped. This seems to be the effect of allocating two registers separatedly instead of a single one with size 2, which causes different register allocation, affecting the cycle estimates. while ICL also has not change on instruction count but report changes negative changes in cycles total cycles in shared programs: 352665369 -> 352707484 (0.01%) cycles in affected programs: 9608288 -> 9650403 (0.44%) helped: 4 HURT: 104 helped stats (abs) min: 24 max: 128 x̄: 88.50 x̃: 101 helped stats (rel) min: <.01% max: 0.85% x̄: 0.46% x̃: 0.49% HURT stats (abs) min: 2 max: 2016 x̄: 408.36 x̃: 48 HURT stats (rel) min: <.01% max: 3.31% x̄: 0.88% x̃: 0.45% 95% mean confidence interval for cycles value: 256.67 523.24 95% mean confidence interval for cycles %-change: 0.63% 1.03% Cycles are HURT. AFAICT this is the result of the case above. Shader-db results for TGL have similar cycles result as ICL, but also affect instructions total instructions in shared programs: 17690586 -> 17690597 (<.01%) instructions in affected programs: 64617 -> 64628 (0.02%) helped: 55 HURT: 32 helped stats (abs) min: 1 max: 16 x̄: 4.13 x̃: 3 helped stats (rel) min: 0.05% max: 2.78% x̄: 0.86% x̃: 0.74% HURT stats (abs) min: 1 max: 65 x̄: 7.44 x̃: 2 HURT stats (rel) min: 0.05% max: 4.58% x̄: 1.13% x̃: 0.69% 95% mean confidence interval for instructions value: -2.03 2.28 95% mean confidence interval for instructions %-change: -0.41% 0.15% Inconclusive result (value mean confidence interval includes 0). Now that more is done in the IR, more dependencies are visible and more SWSB annotations are emitted. Mixed with different register allocation decisions like above, some shaders will see more `sync nops` while others able to avoid them. Most of the new `sync nops` are also redundant and could be dropped, which will be fixed in a separate change. Reviewed-by: Francisco Jerez <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
* intel/fs: Allow FS_OPCODE_SCHEDULING_FENCE stall on registersCaio Marcelo de Oliveira Filho2020-04-292-2/+30
| | | | | | | It will generate the MOVs (or SYNC_NOP in Gen12+) needed for stall. Reviewed-by: Francisco Jerez <[email protected]> Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/3278>
* intel/ir: Remove scheduling-based cycle count estimates.Francisco Jerez2020-04-283-26/+1
| | | | | | | | | | | | | The cycle count estimation logic part of the scheduler is now redundant with the shader performance modeling pass, and the estimates can be consolidated into the brw::performance analysis result object instead of being part of the CFG, which guarantees that the estimates cannot be accessed without previously calling the performance_analysis::require() method, which makes sure that the right analysis pass is executed at the right time if we don't already have up-to-date cached results. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/ir: Pass block cycle count information explicitly to disassembler.Francisco Jerez2020-04-285-6/+11
| | | | | | | | | | | | So we can eventually remove the cycle count estimates from the CFG data structure and consolidate performance information in the brw::performance object. It would be cleaner to pass the brw::performance object directly to the disassembler but that isn't straightforward since the disassembler is built as a plain C file unlike the rest of the compiler back-end. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/ir: Use brw::performance object instead of CFG cycle counts for ↵Francisco Jerez2020-04-289-22/+45
| | | | | | | | | | | | codegen stats. These should be more accurate than the current cycle counts, since among other things they consider the effect of post-scheduling passes like the software scoreboard on TGL. In addition it will enable us to clean up some of the now redundant cycle-count estimation functionality in the instruction scheduler. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Add INTEL_DEBUG=no32 debugging flag.Francisco Jerez2020-04-281-1/+2
| | | | | | | | | | | This is useful in order to identify codegen issues caused by SIMD32. It doesn't currently have any effect on compute shaders since SIMD32 dispatch is only enabled for CS when it's strictly necessary to do so in order to support the workgroup size requested for the shader -- That might change in the future though when we hook up the SIMD32 heuristic to CS compilation. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Implement performance analysis-based SIMD32 heuristic for fragment ↵Francisco Jerez2020-04-281-7/+17
| | | | | | | | | | | | | | | | | | | | | | | shaders. The heuristic enables the SIMD32 fragment shader based on whether the IR performance modeling pass predicts it to have greater throughput than the SIMD16 and SIMD8 variants of the same shader. It would be straightforward to do the same thing in order to control whether SIMD16 dispatch is enabled, but it's pending additional performance evaluation. The INTEL_DEBUG=do32 option is left around in order to force the SIMD32 shader to be used regardless of the result of the heuristic, since it's useful as a debugging aid e.g. in order to identify SIMD32-specific codegen issues which may be masked by the SIMD32 heuristic, or cases where the heuristic is incorrectly disabling SIMD32 shaders that offer a performance advantage. Currently this is only enabled on Gen6+, since SIMD32 codegen support is incomplete on earlier platforms. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Heap-allocate fs_visitors in brw_compile_fs().Francisco Jerez2020-04-281-38/+39
| | | | | | | | | | | | | | This makes brw_compile_fs() look a bit more similar to brw_compile_cs(). It saves us three v*_shader_stats local variables, and will save us additional triplicated declarations as we start tracking IR performance analysis results. The triplicated cfg pointers are left around because they're set to NULL to mark specific dispatch modes as disabled (e.g. in order to enforce hardware restrictions). Doing the same thing with the visitor pointers would cause data leaks. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/ir: Import shader performance analysis pass.Francisco Jerez2020-04-287-1/+1658
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces an analysis pass intended to estimate several performance statistics of the shader, including cycle count latency and throughput values, based on static modeling. It has instruction performance information more comprehensive than the current scheduling pass for all platforms between Gen4-11, and works on both the FS and VEC4 back-end. The most immediate purpose of this pass is to implement a heuristic meant to determine whether using SIMD32 dispatch for a fragment shader can be expected to help more than it hurts. In addition this will allow the effect of passes run after scheduling (e.g. the TGL software scoreboard pass and the VEC4 dependency control pass) to be visible in shader-db statistics. But that isn't the end of the story, other potential applications of this pass (not part of this MR) I've been playing around with are: - Implement a similar SIMD16 heuristic allowing the identification of inefficient SIMD16 fragment shaders. - Implement similar SIMD16 and SIMD32 heuristics for the compute shader stage -- Currently compute shader builds always use the SIMD16 shader if available and never use the SIMD32 shader unless strictly necessary, which is suboptimal under certain conditions. - Hook up to the instruction scheduler in order to improve the accuracy of its timing information. - Use as heuristic in order to drive the selection of scheduling modes (Matt was experimenting with that). - Plug to the TGL software scoreboard pass in order to implement a more effective SBID token allocation algorithm, since in general the optimal token allocation depends on the timings of all instructions in the program. - Use its bottleneck detection functionality in order to implement a heuristic computing a more optimal bound for the number of fragment shader threads executed in parallel (by adjusting the MaximumNumberofThreadsPerPSD control of 3DSTATE_PS). As a follow-up I'm planning to submit updated timing information for Gen12 platforms -- Everything else required to support Gen12 like SWSB handling is already included in this patch, but there were some IP concerns regarding the TGL timing parameters since they cannot currently be obtained with the documentation and hardware which is publicly available. The timing parameters for any previous Gen7-11 platforms can be obtained by anyone by sampling the timestamp register using e.g. shader_time, though I have some more convenient instrumentation coming up. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/vec4: Fix constness of vec4_instruction::reads_flag() and ::writes_flag().Francisco Jerez2020-04-281-2/+2
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Replace fs_visitor::bank_conflict_cycles() with stand-alone function.Francisco Jerez2020-04-284-17/+17
| | | | | | This will be re-usable by the IR performance analysis pass. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Fix constness of argument of ↵Francisco Jerez2020-04-281-2/+2
| | | | | | fs_instruction_scheduler::is_compressed(). Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Rename half() helpers to quarter(), allow index up to 3.Francisco Jerez2020-04-284-14/+14
| | | | | | | | Makes more sense considering SIMD32. Relaxing the assertion in brw_ir_fs.h will be required in order to avoid assertion failures on SNB with SIMD32 fragment shaders. Reviewed-by: Kenneth Graunke <[email protected]>