aboutsummaryrefslogtreecommitdiffstats
path: root/src/intel/compiler
Commit message (Collapse)AuthorAgeFilesLines
* intel/compiler: fix ddx and ddy for 16-bit floatIago Toral Quiroga2019-04-181-19/+18
| | | | | | | | | | | | | | | We were assuming 32-bit elements. Also, In SIMD8 we pack 2 vector components in a single SIMD register, so for example, component Y of a 16-bit vec2 starts is at byte offset 16B. This means that when we compute the offset of the elements to be differentiated we should not stomp whatever base offset we have, but instead add to it. v2 - Use byte_offset() helper (Jason) - Merge the fix for SIMD8: using byte_offset() fixes that too. Reviewed-by: Jason Ekstrand <[email protected]> (v1) Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: set correct precision fields for 3-source float instructionsIago Toral Quiroga2019-04-181-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | Source0 and Destination extract the floating-point precision automatically from the SrcType and DstType instruction fields respectively when they are set to types :F or :HF. For Source1 and Source2 operands, we use the new 1-bit fields Src1Type and Src2Type, where 0 means normal precision and 1 means half-precision. Since we always use the type of the destination for all operands when we emit 3-source instructions, we only need set Src1Type and Src2Type to 1 when we are emitting a half-precision instruction. v2: - Set the bit separately for each source based on its type so we can do mixed floating-point mode in the future (Topi). v3: - Use regular citation style for the comment referencing the PRM (Matt). - Decided not to add asserts in the emission code to check that only mixed HF/F types are used since such checks would break negative tests for brw_eu_validate.c (Matt) Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: allow half-float on 3-source instructions since gen8Iago Toral Quiroga2019-04-181-1/+2
| | | | | | Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: don't compact 3-src instructions with Src1Type or Src2Type bitsIago Toral Quiroga2019-04-181-1/+4
| | | | | | | | | | | | | | | We are now using these bits, so don't assert that they are not set. In gen8, if these bits are set compaction is not possible. On gen9 and CHV platforms set_3src_control_index() checks these bits (and others) against a table to validate if the particular bit combination is eligible for compaction or not. v2 - Add more detail in the commit message explaining the situation for SKL+ and CHV (Jason) Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: add new half-float register type for 3-src instructionsIago Toral Quiroga2019-04-181-0/+4
| | | | | | | | | | | | This is available since gen8. v2: restore previously existing assertion. v3: don't use separate tables for gen7 and gen8, just assert that we don't use half-float before gen8 (Matt) Reviewed-by: Topi Pohjolainen <[email protected]> (v1) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: add instruction setters for Src1Type and Src2Type.Iago Toral Quiroga2019-04-181-0/+2
| | | | | | | | | | | | | | | The original SrcType is a 3-bit field that takes a subset of the types supported for the hardware for 3-source instructions. Since gen8, when the half-float type was added, 3-source floating point operations can use use mixed precision mode, where not all the operands have the same floating-point precision. While the precision for the first operand is taken from the type in SrcType, the bits in Src1Type (bit 36) and Src2Type (bit 35) define the precision for the other operands (0: normal precision, 1: half precision). Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Matt Turner <[email protected]> Acked-by: Jason Ekstrand <[email protected]>
* intel/compiler: drop unnecessary temporary from 32-bit fsign implementationIago Toral Quiroga2019-04-181-3/+2
| | | | Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: implement 16-bit fsignIago Toral Quiroga2019-04-181-1/+16
| | | | | | | | | | | v2: - make 16-bit be its own separate case (Jason) v3: - Drop the result_int temporary (Jason) Reviewed-by: Topi Pohjolainen <[email protected]> (v1) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: handle extended math restrictions for half-floatIago Toral Quiroga2019-04-183-12/+34
| | | | | | | | | | | | | | | Extended math with half-float operands is only supported since gen9, but it is limited to SIMD8. In gen8 we lower it to 32-bit. v2: quashed together the following patches (Jason): - intel/compiler: allow extended math functions with HF operands - intel/compiler: lower 16-bit extended math to 32-bit prior to gen9 - intel/compiler: extended Math is limited to SIMD8 on half-float Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Topi Pohjolainen <[email protected]> (allow extended math functions with HF operands, extended Math is limited to SIMD8 on half-float)
* intel/compiler: lower some 16-bit float operations to 32-bitIago Toral Quiroga2019-04-181-0/+5
| | | | | | | The hardware doesn't support half-float for these. Reviewed-by: Topi Pohjolainen <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: assert restrictions on conversions to half-floatIago Toral Quiroga2019-04-181-2/+3
| | | | | | | | | | | There are some hardware restrictions that brw_nir_lower_conversions should have taken care of before we get here. v2: - rebased on top of regioning lowering pass Reviewed-by: Topi Pohjolainen <[email protected]> (v1) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: handle b2i/b2f with other integer conversion opcodesIago Toral Quiroga2019-04-181-8/+8
| | | | | | | | | | | | | Since we handle booleans as integers this makes more sense. v2: - rebased to incorporate new boolean conversion opcodes v3: - rebased on top regioning lowering pass Reviewed-by: Jason Ekstrand <[email protected]> (v1) Reviewed-by: Topi Pohjolainen <[email protected]> (v2)
* intel/compiler: split float to 64-bit opcodes from int to 64-bitIago Toral Quiroga2019-04-181-0/+7
| | | | | | | | | | | Going forward having these split is a bit more convenient since these two groups have different restrictions. v2: - Rebased on top of new regioning lowering pass. Reviewed-by: Topi Pohjolainen <[email protected]> (v1) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: add a NIR pass to lower conversionsIago Toral Quiroga2019-04-184-0/+174
| | | | | | | | | | | | | | | | | | | | | | Some conversions are not directly supported in hardware and need to be split in two conversion instructions going through an intermediary type. Doing this at the NIR level simplifies a bit the complexity in the backend. v2: - Consider fp16 rounding conversion opcodes - Properly handle swizzles on conversion sources. v3 - Run the pass earlier, right after nir_opt_algebraic_late (Jason) - NIR alu output types already have the bit-size (Jason) - Use 'is_conversion' to identify conversion operations (Jason) v4: - Be careful about the intermediate types we use so we don't lose range and avoid incorrect rounding semantics (Jason) Reviewed-by: Topi Pohjolainen <[email protected]> (v1) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler/icl: Use tcs barrier id bits 24:30 instead of 24:27Topi Pohjolainen2019-04-171-7/+17
| | | | | | | | | Similarly to 1cc17fb731466c68586915acbb916586457b19bc Fixes gpu hangs with dEQP-VK.tessellation.shader_input_output.barrier Reviewed-by: Anuj Phogat <[email protected]> Signed-off-by: Topi Pohjolainen <[email protected]>
* i965: Move program key debugging to the compiler.Kenneth Graunke2019-04-163-0/+238
| | | | | | | | | | | | | | | | | | | The i965 driver has a bunch of code to compare two sets of program keys and print out the differences. This can be useful for debugging why a shader needed to be recompiled on the fly due to non-orthogonal state dependencies. anv doesn't do recompiles, so we didn't need to share this in the past - but I'd like to use it in iris. This moves the bulk of the code to the compiler where it can be reused. To make that possible, we need to decouple it from i965 - we can't get at the brw program cache directly, nor use brw_context to print things. Instead, we use compiler->shader_perf_log(), and simply pass in keys. We put all of this debugging code in brw_debug_recompile.c, and only export a single function, for simplicity. I also tidied the code a bit while moving it, now that it all lives in one file. Reviewed-by: Jordan Justen <[email protected]>
* intel/compiler: Do not reswizzle dst if instruction writes to flag registerDanylo Piliaiev2019-04-161-0/+6
| | | | | | | | | | If we write to the flag register changing the swizzle would change what channels are written to the flag register. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110201 Fixes: 4cd1a0be Signed-off-by: Danylo Piliaiev <[email protected]> Reviewed-by: <[email protected]>
* nir: make nir_const_value scalarKarol Herbst2019-04-143-16/+16
| | | | | | | | | v2: remove & operator in a couple of memsets add some memsets v3: fixup lima Signed-off-by: Karol Herbst <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> (v2)
* nir/builder: Add a nir_imm_zero helperJason Ekstrand2019-04-141-10/+1
| | | | | | v2: replace nir_zero_vec with nir_imm_zero (Karol Herbst) Reviewed-by: Karol Herbst <[email protected]>
* intel/nir: use nir_src_is_const and nir_src_as_uintKarol Herbst2019-04-141-6/+4
| | | | | Signed-off-by: Karol Herbst <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/nir: Take a nir_tex_instr and src index in brw_texture_offsetJason Ekstrand2019-04-144-27/+21
| | | | | This makes things a bit simpler and it's also more robust because it no longer has a hard dependency on the offset being a 32-bit value.
* intel/fs: Remove unused condition from opt_algebraic caseSagar Ghuge2019-04-121-5/+0
| | | | | | | | | We will never hit a condition where we have src1 and src2 as immediate operands. Signed-off-by: Sagar Ghuge <[email protected]> Reviewed-by: Matt Turner <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* nir/i965/freedreno/vc4: add a bindless bool to type size functionsTimothy Arceri2019-04-124-25/+30
| | | | | | | This required to calculate sizes correctly when we have bindless samplers/images. Reviewed-by: Marek Olšák <[email protected]>
* nir: move brw_nir_rewrite_image_intrinsic into common codeKarol Herbst2019-04-121-41/+0
| | | | | | Signed-off-by: Karol Herbst <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Marek Olšák <[email protected]>
* intel/common: move gen_debug to intel/devMark Janes2019-04-1016-16/+16
| | | | | | | | | libintel_common depends on libintel_compiler, but it contains debug functionality that is needed by libintel_compiler. Break the circular dependency by moving gen_debug files to libintel_dev. Suggested-by: Kenneth Graunke <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Use NIR_PASS_V when lowering CS intrinsicsCaio Marcelo de Oliveira Filho2019-04-081-3/+4
| | | | | | | | | This will make that step visible in NIR_PRINT=1. v2: Also use the macro for the cleanup passes. Reviewed-by: Ian Romanick <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Don't loop when lowering CS intrinsicsCaio Marcelo de Oliveira Filho2019-04-081-15/+10
| | | | | | | | | This was needed when certain intrinsics were lowered to other ones that were defined by the same pass. After 060817b2 "intel,nir: Move gl_LocalInvocationID lowering to nir_lower_system_values" we don't need the loop anymore. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Add support for CS to group invocations in quadsCaio Marcelo de Oliveira Filho2019-04-083-16/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using quads, instead of mapping the elements to the next 4 local invocation indices, we map the two next in the "current" row and two next in the "next row". A side effect is that a thread will execute the indices in a different order. We now perform the lowering of both local invocation ID and index together -- and don't rely anymore on lowering done by nir_lower_system_values. That is convenient when doing the math for quads, because we need X and Y to get the right invocation index. When the pass progresses, fold the constants and clean up to reduce the noise from the indexing math. This implements the derivative_group_quadsNV semantics from NV_compute_shader_derivatives. v2: Take subgroup_id into account, otherwise only values in the first subgroup would be used. (Jason) v3: Calculate invocation index and ID together, to avoid duplicating some math in the quads case when both index and ID are used. (Jason) v4: Don't call cleanup passes as part of the lowering, let that to the call site. (Jason) Change calculation to use less instructions. (Jason) Reviewed-by: Ian Romanick <[email protected]> (v3) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Use TEX_LOGICAL whenever implicit lod is supportedCaio Marcelo de Oliveira Filho2019-04-081-2/+6
| | | | | | | Make sure we include compute shaders that have a derivative group defined. Reviewed-by: Jason Ekstrand <[email protected]>
* nir/radv: remove restrictions on opt_if_loop_last_continue()Timothy Arceri2019-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I implemented opt_if_loop_last_continue() I had restricted this pass from moving other if-statements inside the branch opposite the continue. At the time it was causing a bunch of spilling in shader-db for i965. However Samuel Pitoiset noticed that making this pass more aggressive significantly improved the performance of Doom on RADV. Below are the statistics he gathered. 28717 shaders in 14931 tests Totals: SGPRS: 1267317 -> 1267549 (0.02 %) VGPRS: 896876 -> 895920 (-0.11 %) Spilled SGPRs: 24701 -> 26367 (6.74 %) Code Size: 48379452 -> 48507880 (0.27 %) bytes Max Waves: 241159 -> 241190 (0.01 %) Totals from affected shaders: SGPRS: 23584 -> 23816 (0.98 %) VGPRS: 25908 -> 24952 (-3.69 %) Spilled SGPRs: 503 -> 2169 (331.21 %) Code Size: 2471392 -> 2599820 (5.20 %) bytes Max Waves: 586 -> 617 (5.29 %) The codesize increases is related to Wolfenstein II it seems largely due to an increase in phis rather than the existing jumps. This gives +10% FPS with Doom on my Vega56. Rhys Perry also benchmarked Doom on his VEGA64: Before: 72.53 FPS After: 80.77 FPS v2: disable pass on non-AMD drivers Reviewed-by: Ian Romanick <[email protected]> (v1) Acked-by: Samuel Pitoiset <[email protected]>
* intel/compiler: use defined size for vector componentsDave Airlie2019-04-031-1/+1
| | | | | | | If we increase vector sizing later it would be nice to avoid tripped over this again. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/compiler: Use partial redundancy elimination for comparesIan Romanick2019-03-281-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Almost all of the hurt shaders are repeated instances of the same shader in synmark's compilation speed tests. shader-db results: All Gen6+ platforms had similar results. (Skylake shown) total instructions in shared programs: 15256840 -> 15256389 (<.01%) instructions in affected programs: 54137 -> 53686 (-0.83%) helped: 288 HURT: 0 helped stats (abs) min: 1 max: 15 x̄: 1.57 x̃: 1 helped stats (rel) min: 0.06% max: 26.67% x̄: 1.99% x̃: 0.74% 95% mean confidence interval for instructions value: -1.76 -1.38 95% mean confidence interval for instructions %-change: -2.47% -1.50% Instructions are helped. total cycles in shared programs: 372286583 -> 372283851 (<.01%) cycles in affected programs: 833829 -> 831097 (-0.33%) helped: 265 HURT: 16 helped stats (abs) min: 2 max: 74 x̄: 11.81 x̃: 4 helped stats (rel) min: 0.04% max: 9.07% x̄: 0.99% x̃: 0.35% HURT stats (abs) min: 2 max: 130 x̄: 24.88 x̃: 8 HURT stats (rel) min: <.01% max: 12.31% x̄: 1.44% x̃: 0.27% 95% mean confidence interval for cycles value: -12.30 -7.15 95% mean confidence interval for cycles %-change: -1.06% -0.64% Cycles are helped. Iron Lake and GM45 had similar results. (GM45 shown) total instructions in shared programs: 5038653 -> 5038495 (<.01%) instructions in affected programs: 13939 -> 13781 (-1.13%) helped: 50 HURT: 1 helped stats (abs) min: 1 max: 15 x̄: 3.18 x̃: 4 helped stats (rel) min: 0.33% max: 13.33% x̄: 2.24% x̃: 1.09% HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 HURT stats (rel) min: 0.83% max: 0.83% x̄: 0.83% x̃: 0.83% 95% mean confidence interval for instructions value: -3.73 -2.47 95% mean confidence interval for instructions %-change: -3.16% -1.21% Instructions are helped. total cycles in shared programs: 128118922 -> 128118228 (<.01%) cycles in affected programs: 134906 -> 134212 (-0.51%) helped: 50 HURT: 0 helped stats (abs) min: 2 max: 60 x̄: 13.88 x̃: 18 helped stats (rel) min: 0.06% max: 3.19% x̄: 0.74% x̃: 0.70% 95% mean confidence interval for cycles value: -16.54 -11.22 95% mean confidence interval for cycles %-change: -0.95% -0.53% Cycles are helped. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Make alpha test work with MRT and sample maskDanylo Piliaiev2019-03-251-18/+17
| | | | | | | | | | | | | | | | Fix the order of src0_alpha and sample mask in fb payload. From SKL PRM Volume 7, "Data Payload Register Order for Render Target Write Messages": Type S0A oM sZ oS M2 M3 M4 SIMD8 1 1 0 0 s0A oM R SIMD16 1 1 0 0 1/0s0A 3/2s0A oM It also fixes working of alpha to coverage with sample mask on GEN6 since now they are in correct order. Signed-off-by: Danylo Piliaiev <[email protected]> Signed-off-by: Francisco Jerez <[email protected]> Reviewed-by: Francisco Jerez <[email protected]>
* i965,iris,anv: Make alpha to coverage work with sample maskDanylo Piliaiev2019-03-254-4/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From "Alpha Coverage" section of SKL PRM Volume 7: "If Pixel Shader outputs oMask, AlphaToCoverage is disabled in hardware, regardless of the state setting for this feature." From OpenGL spec 4.6, "15.2 Shader Execution": "The built-in integer array gl_SampleMask can be used to change the sample coverage for a fragment from within the shader." From OpenGL spec 4.6, "17.3.1 Alpha To Coverage": "If SAMPLE_ALPHA_TO_COVERAGE is enabled, a temporary coverage value is generated where each bit is determined by the alpha value at the corresponding sample location. The temporary coverage value is then ANDed with the fragment coverage value to generate a new fragment coverage value." Similar wording could be found in Vulkan spec 1.1.100 "25.6. Multisample Coverage" Thus we need to compute alpha to coverage dithering manually in shader and replace sample mask store with the bitwise-AND of sample mask and alpha to coverage dithering. The following formula is used to compute final sample mask: m = int(16.0 * clamp(src0_alpha, 0.0, 1.0)) dither_mask = 0x1111 * ((0xfea80 >> (m & ~3)) & 0xf) | 0x0808 * (m & 2) | 0x0100 * (m & 1) sample_mask = sample_mask & dither_mask Credits to Francisco Jerez <[email protected]> for creating it. It gives a number of ones proportional to the alpha for 2, 4, 8 or 16 least significant bits of the result. GEN6 hardware does not have issue with simultaneous usage of sample mask and alpha to coverage however due to the wrong sending order of oMask and src0_alpha it is still affected by it. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109743 Signed-off-by: Danylo Piliaiev <[email protected]> Reviewed-by: Francisco Jerez <[email protected]>
* compiler/nir: add lowering for 16-bit flrpIago Toral Quiroga2019-03-251-0/+1
| | | | | | | | | And enable it on Intel. v2: - Squash the change to enable it on Intel (Jason) Reviewed-by: Jason Ekstrand <[email protected]>
* compiler/nir: add lowering option for 16-bit fmodIago Toral Quiroga2019-03-251-0/+1
| | | | | | | | | And enable it on Intel. v2: - Squash the change to enable this lowering on Intel (Jason) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: handle GLSL_TYPE_INTERFACE as GLSL_TYPE_STRUCTCaio Marcelo de Oliveira Filho2019-03-233-3/+3
| | | | Reviewed-by: Jason Ekstrand <[email protected]>
* anv,radv,turnip: Lower TG4 offsets with nir_lower_texJason Ekstrand2019-03-211-0/+1
| | | | | | v2: turn on for turnip as well (Karol Herbst) Reviewed-by: Karol Herbst <[email protected]>
* intel/nir: Lower array-deref-of-vector UBO and SSBO loadsJason Ekstrand2019-03-151-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a serious performance issue with DXVK: https://github.com/doitsujin/dxvk/issues/937 This was caused by a recent change that to improve performance on RADV which back-fired on ANV and killed performance for some apps: https://github.com/doitsujin/dxvk/commit/e5a06d3f4a103a54cd4eb51970fedee405d1d698 Throwing in this bit of lowering lets us come along and CSE those UBO loads (or copy-prop for SSBO load) and get one load where we previously would have gotten several. VkPipeline-db results on Kaby Lake: total instructions in shared programs: 5115361 -> 5073185 (-0.82%) instructions in affected programs: 1754333 -> 1712157 (-2.40%) helped: 5331 HURT: 63 total cycles in shared programs: 2544501169 -> 2481144545 (-2.49%) cycles in affected programs: 2531058653 -> 2467702029 (-2.50%) helped: 9202 HURT: 4323 total loops in shared programs: 3340 -> 3331 (-0.27%) loops in affected programs: 9 -> 0 helped: 9 HURT: 0 total spills in shared programs: 3246 -> 3053 (-5.95%) spills in affected programs: 384 -> 191 (-50.26%) helped: 10 HURT: 5 total fills in shared programs: 4626 -> 4452 (-3.76%) fills in affected programs: 439 -> 265 (-39.64%) helped: 10 HURT: 5 All of the shaders with hurt spilling were in Rise of the Tomb Raider which also had shaders solidly helped in the spilling department. Not shown in those results (because I've not had success dumping the shaders) is Witcher 3 where this reduces spilling and improves over-all perf by around 20-25%. There were no shader-db changes. Apparently, this just isn't a pattern that happens in OpenGL. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]> Cc: "19.0" [email protected]
* i965: Stop setting LowerBuferInterfaceBlocksJason Ekstrand2019-03-151-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead, we do UBO and SSBO deref lowering in NIR after we've given it a chance to optimize SSBO access: Shader-db results on Kaby Lake: total instructions in shared programs: 15235775 -> 15235484 (<.01%) instructions in affected programs: 14992 -> 14701 (-1.94%) helped: 19 HURT: 20 total cycles in shared programs: 339220331 -> 339027307 (-0.06%) cycles in affected programs: 79831981 -> 79638957 (-0.24%) helped: 540 HURT: 602 total loops in shared programs: 4402 -> 4348 (-1.23%) loops in affected programs: 186 -> 132 (-29.03%) helped: 27 HURT: 0 total spills in shared programs: 23261 -> 23234 (-0.12%) spills in affected programs: 38 -> 11 (-71.05%) helped: 1 HURT: 0 total fills in shared programs: 31442 -> 31371 (-0.23%) fills in affected programs: 98 -> 27 (-72.45%) helped: 1 HURT: 0 LOST: 12 GAINED: 12 Most of the help and hurt in instruction counts was just churn caused by re-ordering of optimizations and the fact that the NIR deref lowering code is emitting slightly different instructions. Nothing was hurt by more than three instructions and most things weren't helped by more than four. The primary exception to this is one Car Chase shader: shaders/non-free/gfxbench4/carchase/341.shader_test CS SIMD32: 1144 -> 821 (-28.23%) There is also one compute shader in Manhattan 3.1 and a fragment shader in the UE4 Shooter Game demo that now get a loop partially unrolled. Those showed up in the results as hurt instructions but were manually removed to get the results above. The lost/gained was a dozen Car Chase shaders that went from SIMD8 to SIMD16 thanks to improved register pressure: shaders/non-free/gfxbench4/carchase/366.shader_test CS shaders/non-free/gfxbench4/carchase/368.shader_test CS shaders/non-free/gfxbench4/carchase/370.shader_test CS shaders/non-free/gfxbench4/carchase/372.shader_test CS shaders/non-free/gfxbench4/carchase/376.shader_test CS shaders/non-free/gfxbench4/carchase/378.shader_test CS shaders/non-free/gfxbench4/carchase/380.shader_test CS shaders/non-free/gfxbench4/carchase/382.shader_test CS shaders/non-free/gfxbench4/carchase/384.shader_test CS shaders/non-free/gfxbench4/carchase/388.shader_test CS shaders/non-free/gfxbench4/carchase/4.shader_test CS shaders/non-free/gfxbench4/carchase/6.shader_test CS Given how much it appeared to be improved, I ran Car Chase on my laptop. Unfortunately, I wasn't able to see any measurable improvement. It might be helped by 1-2% but it's in the noise. It does render correctly as far as I can tell so the improvement is legitimate. All of the loops that got delete were in dolphin uber shaders. I've had no opportunity to test them for correctness or performance. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* i965: Disable ARB_fragment_shader_interlock for platforms prior to GEN9Plamena Manolova2019-03-141-0/+1
| | | | | | | | | | | ARB_fragment_shader_interlock depends on memory fences to ensure fragment ordering and this ordering guarantee is only supported from GEN9 onwards. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109980 Fixes: 939312702e35 "i965: Add ARB_fragment_shader_interlock support." Signed-off-by: Plamena Manolova <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/nir: Combine store_derefs to improve code from SPIR-VCaio Marcelo de Oliveira Filho2019-03-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to lack of write mask in SPIR-V store, generators may produce multiple stores to the same vector but using different array derefs. Use the combining store pass to clean this up. For example, layout(binding = 3) buffer block { vec4 v; }; void main() { v.x = 11; v.y = 22; } after going to SPIR-V and NIR, ends up with in two store_derefs to v[0] and v[1] vec2 32 ssa_4 = deref_struct &ssa_3->field0 (ssbo vec4) /* &((block *)ssa_2)->field0 */ vec2 32 ssa_6 = deref_array &(*ssa_4)[0] (ssbo float) /* &((block *)ssa_2)->field0[0] */ intrinsic store_deref (ssa_6, ssa_7) (1, 0) /* wrmask=x */ /* access=0 */ vec1 32 ssa_13 = load_const (0x00000001 /* 0.000000 */) vec2 32 ssa_14 = deref_array &(*ssa_4)[1] (ssbo float) /* &((block *)ssa_2)->field0[1] */ intrinsic store_deref (ssa_14, ssa_15) (1, 0) /* wrmask=x */ /* access=0 */ producing two different sends instructions in skl. The combining pass transform the snippet above into vec2 32 ssa_4 = deref_struct &ssa_3->field0 (ssbo vec4) /* &((block *)ssa_2)->field0 */ vec4 32 ssa_18 = vec4 ssa_7, ssa_15, ssa_16, ssa_17 intrinsic store_deref (ssa_4, ssa_18) (3, 0) /* wrmask=xy */ /* access=0 */ producing a single sends instruction. v2: Move this from spirv_to_nir into the general optimization pass for intel compiler. (Jason) Reviewed-by: Jason Ekstrand <[email protected]>
* intel/nir: Combine store_derefs after vectorizing IOCaio Marcelo de Oliveira Filho2019-03-131-0/+1
| | | | | | | | | | | | | | | | | | Shader-db results for skl: total instructions in shared programs: 15232903 -> 15224781 (-0.05%) instructions in affected programs: 61246 -> 53124 (-13.26%) helped: 221 HURT: 0 total cycles in shared programs: 371440470 -> 371398018 (-0.01%) cycles in affected programs: 281363 -> 238911 (-15.09%) helped: 221 HURT: 0 Results for bdw are very similar. Reviewed-by: Jason Ekstrand <[email protected]>
* nir: Add a pass to combine store_derefs to same vectorCaio Marcelo de Oliveira Filho2019-03-131-0/+1
| | | | | | | | | v2: (all from Jason) Reuse existing function for the end of the block combinations. Check the SSA values are coming from the right place in tests. Document the case when the store to array_deref is reused. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Fix opt_peephole_csel to not throw away saturates.Kenneth Graunke2019-03-121-0/+1
| | | | | | | | | | | | | | | | We were not copying the saturate bit from the original instruction to the new replacement instruction. This caused major misrendering in DiRT Rally on iris, where comparisons leading to discards failed due to the missing saturate, causing lots of extra garbage pixels to be drawn in text rendering, trees, and so on. This did not show up on i965 because st/nir performs a more aggressive version of nir_opt_peephole_select, yielding more b32csel operations. Fixes: 52c7df1643e i965/fs: Merge CMP and SEL into CSEL on Gen8+ Reviewed-by: Lionel Landwerlin <[email protected]> Reviewed-by: Ian Romanick <[email protected]>
* intel/nir: Vectorize all IOJason Ekstrand2019-03-121-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IO scalarization pass that we run to help with linking end up turning some shader I/O such as that for tessellation and geometry shaders into many scalar URB operations rather than one vector one. To alleviate this, we now vectorize the I/O once again. This fixes a 10% performance regression in the GfxBench tessellation test that was caused by scalarizing. Shader-db results on Kaby Lake: total instructions in shared programs: 15224023 -> 15220871 (-0.02%) instructions in affected programs: 342009 -> 338857 (-0.92%) helped: 1236 HURT: 443 total spills in shared programs: 23471 -> 23465 (-0.03%) spills in affected programs: 6 -> 0 helped: 1 HURT: 0 total fills in shared programs: 31770 -> 31766 (-0.01%) fills in affected programs: 4 -> 0 helped: 1 HURT: 0 Cycles was just a lot of churn do to moves being different places. Most of the pure churn in instructions was +/- one or two instructions in fragment shaders. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107510 Fixes: 4434591bf56a "intel/nir: Call nir_lower_io_to_scalar_early" Fixes: 8d8222461f9d "intel/nir: Enable nir_opt_find_array_copies" Reviewed-by: Connor Abbott <[email protected]>
* intel/nir: Move lower_mem_access_bit_sizes to postprocess_nirJason Ekstrand2019-03-081-2/+1
| | | | | | | | | It doesn't really matter where this pass goes as long as it's after we call nir_lower_explicit_io and before we go into the back-end. Putting it brw_postprocess_nir lets us move nir_lower_explicit_io significantly later in the pipeline. Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
* intel/compiler: silence unitialized variable warning in opt_vector_float()Brian Paul2019-03-081-1/+1
| | | | Reviewed-by: Lionel Landwerlin <[email protected]>
* intel/nir: Move 64-bit lowering laterJason Ekstrand2019-03-061-21/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have a loop unrolling cost function and loop unrolling isn't going to kill us the moment we have a 64-bit op in a loop, we can go ahead and move 64-bit lowering later. This gives us the opportunity to do more optimizations and actually let the full optimizer run even on 64-bit ops rather than hoping one round of opt_algebraic will fix everything. This substantially reduces both fp64 shader compile times and the resulting code size. On the vs-isnan-dvec test from piglit: Before this commit: 1684.63s user 17.29s system 99% cpu 28:28.24 total 101479 instructions. 0 loops. 802452 cycles. 79:369 spills:fills. Peak memory usage (according to massif): 1.435 GB After this commit: 179.64s user 7.75s system 99% cpu 3:07.92 total 57316 instructions. 0 loops. 459287 cycles. 0:0 spills:fills. Peak memory usage (according to massif): 531.0 MB Reviewed-by: Matt Turner <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>
* nir/lower_doubles: Inline functions directly in lower_doublesJason Ekstrand2019-03-062-19/+6
| | | | | | | | | | | | Instead of trusting the caller to already have created a softfp64 function shader and added all its functions to our shader, we simply take the softfp64 shader as an argument and do the function inlining ouselves. This means that there's no more nasty functions lying around that the caller needs to worry about cleaning up. Reviewed-by: Matt Turner <[email protected]> Reviewed-by: Jordan Justen <[email protected]> Reviewed-by: Kenneth Graunke <[email protected]>