summaryrefslogtreecommitdiffstats
path: root/src/intel/compiler
Commit message (Collapse)AuthorAgeFilesLines
* intel/compiler: Implement TCS 8_PATCH mode and INTEL_DEBUG=tcs8Kenneth Graunke2019-05-146-31/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our tessellation control shaders can be dispatched in several modes. - SINGLE_PATCH (Gen7+) processes a single patch per thread, with each channel corresponding to a different patch vertex. PATCHLIST_N will launch (N / 8) threads. If N is less than 8, some channels will be disabled, leaving some untapped hardware capabilities. Conditionals based on gl_InvocationID are non-uniform, which means that they'll often have to execute both paths. However, if there are fewer than 8 vertices, all invocations will happen within a single thread, so barriers can become no-ops, which is nice. We also burn a maximum of 4 registers for ICP handles, so we can compile without regard for the value of N. It also works in all cases. - DUAL_PATCH mode processes up to two patches at a time, where the first four channels come from patch 1, and the second group of four come from patch 2. This tries to provide better EU utilization for small patches (N <= 4). It cannot be used in all cases. - 8_PATCH mode processes 8 patches at a time, with a thread launched per vertex in the patch. Each channel corresponds to the same vertex, but in each of the 8 patches. This utilizes all channels even for small patches. It also makes conditions on gl_InvocationID uniform, leading to proper jumps. Barriers, unfortunately, become real. Worse, for PATCHLIST_N, the thread payload burns N registers for ICP handles. This can burn up to 32 registers, or 1/4 of our register file, for URB handles. For Vulkan (and DX), we know the number of vertices at compile time, so we can limit the amount of waste. In GL, the patch dimension is dynamic state, so we either would have to waste all 32 (not reasonable) or guess (badly) and recompile. This is unfortunate. Because we can only spawn 16 thread instances, we can only use this mode for PATCHLIST_16 and smaller. The rest must use SINGLE_PATCH. This patch implements the new 8_PATCH TCS mode, but leaves us using SINGLE_PATCH by default. A new INTEL_DEBUG=tcs8 flag will switch to using 8_PATCH mode for testing and benchmarking purposes. We may want to consider using 8_PATCH mode in Vulkan in some cases. The data I've seen shows that 8_PATCH mode can be more efficient in some cases, but SINGLE_PATCH mode (the one we use today) is faster in other cases. Ultimately, the TES matters much more than the TCS for performance, so the decision may not matter much. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: Move ICP handle fetching into a helper function.Kenneth Graunke2019-05-142-36/+47
| | | | | | This will be significantly different in 8_PATCH mode. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: Don't repeat dispatch max fixing conditionKenneth Graunke2019-05-141-2/+5
| | | | | | | Having a single flag will keep both places in sync if the condition gets more complicated. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: Rename invocation_id_mask to instance_id_maskKenneth Graunke2019-05-141-4/+4
| | | | | | | The payload field is actually "instance" (thread number), which is used to calculate the invocation ID. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: Refactor TCS invocation ID setup into a helperKenneth Graunke2019-05-142-14/+23
| | | | | | | When we add 8_PATCH mode, this will get a bit more complex, so we may as well start by putting it in a helper function. Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: Repeat nir_opt_algebraic_lateIan Romanick2019-05-141-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A tiny bit of help seems to come from nir_copy_prop. Future patches will benefit from this change. Doing more copy propagation on the vec4 backend led to a disaster in hurt cycles. v2: Fix typo in comment. Noticed by Matt. All Gen8+ platforms had similar results. (Ice Lake shown) total instructions in shared programs: 17224634 -> 17224623 (<.01%) instructions in affected programs: 4586 -> 4575 (-0.24%) helped: 11 HURT: 0 helped stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 helped stats (rel) min: 0.19% max: 0.53% x̄: 0.27% x̃: 0.23% 95% mean confidence interval for instructions value: -1.00 -1.00 95% mean confidence interval for instructions %-change: -0.36% -0.19% Instructions are helped. total cycles in shared programs: 360828542 -> 360828714 (<.01%) cycles in affected programs: 151159 -> 151331 (0.11%) helped: 49 HURT: 28 helped stats (abs) min: 1 max: 254 x̄: 26.41 x̃: 6 helped stats (rel) min: 0.06% max: 12.02% x̄: 1.34% x̃: 0.42% HURT stats (abs) min: 1 max: 196 x̄: 52.36 x̃: 15 HURT stats (rel) min: 0.05% max: 10.74% x̄: 2.55% x̃: 0.88% 95% mean confidence interval for cycles value: -13.48 17.95 95% mean confidence interval for cycles %-change: -0.69% 0.84% Inconclusive result (value mean confidence interval includes 0). Haswell, Ivy Bridge, and Sandy Bridge had similar results. (Haswell shown) total instructions in shared programs: 13529544 -> 13529542 (<.01%) instructions in affected programs: 358 -> 356 (-0.56%) helped: 2 HURT: 0 total cycles in shared programs: 357290311 -> 357289678 (<.01%) cycles in affected programs: 178324 -> 177691 (-0.35%) helped: 48 HURT: 40 helped stats (abs) min: 1 max: 201 x̄: 31.52 x̃: 13 helped stats (rel) min: 0.06% max: 10.92% x̄: 1.71% x̃: 0.66% HURT stats (abs) min: 1 max: 224 x̄: 22.00 x̃: 6 HURT stats (rel) min: 0.05% max: 15.84% x̄: 1.29% x̃: 0.31% 95% mean confidence interval for cycles value: -18.28 3.89 95% mean confidence interval for cycles %-change: -1.01% 0.32% Inconclusive result (value mean confidence interval includes 0). Iron Lake and GM45 had similar results. (Iron Lake shown) total instructions in shared programs: 8159110 -> 8158980 (<.01%) instructions in affected programs: 22719 -> 22589 (-0.57%) helped: 65 HURT: 0 helped stats (abs) min: 1 max: 3 x̄: 2.00 x̃: 2 helped stats (rel) min: 0.07% max: 1.05% x̄: 0.73% x̃: 0.74% 95% mean confidence interval for instructions value: -2.06 -1.94 95% mean confidence interval for instructions %-change: -0.78% -0.68% Instructions are helped. total cycles in shared programs: 188609448 -> 188609214 (<.01%) cycles in affected programs: 1875852 -> 1875618 (-0.01%) helped: 109 HURT: 104 helped stats (abs) min: 2 max: 46 x̄: 5.30 x̃: 4 helped stats (rel) min: 0.02% max: 0.90% x̄: 0.09% x̃: 0.07% HURT stats (abs) min: 2 max: 20 x̄: 3.31 x̃: 2 HURT stats (rel) min: 0.01% max: 0.26% x̄: 0.04% x̃: 0.02% 95% mean confidence interval for cycles value: -1.95 -0.25 95% mean confidence interval for cycles %-change: -0.04% -0.01% Cycles are helped. Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Allow cmod propagation to instructions with saturate modifierIan Romanick2019-05-142-9/+528
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | v2: Add unit tests. Suggested by Matt. All Intel GPUs had similar results. (Ice Lake shown) total instructions in shared programs: 17229441 -> 17228658 (<.01%) instructions in affected programs: 159574 -> 158791 (-0.49%) helped: 489 HURT: 0 helped stats (abs) min: 1 max: 5 x̄: 1.60 x̃: 1 helped stats (rel) min: 0.07% max: 2.70% x̄: 0.61% x̃: 0.59% 95% mean confidence interval for instructions value: -1.72 -1.48 95% mean confidence interval for instructions %-change: -0.64% -0.58% Instructions are helped. total cycles in shared programs: 360944149 -> 360937144 (<.01%) cycles in affected programs: 1072195 -> 1065190 (-0.65%) helped: 254 HURT: 27 helped stats (abs) min: 2 max: 234 x̄: 30.51 x̃: 9 helped stats (rel) min: 0.04% max: 8.99% x̄: 0.75% x̃: 0.24% HURT stats (abs) min: 2 max: 83 x̄: 27.56 x̃: 24 HURT stats (rel) min: 0.09% max: 3.79% x̄: 1.28% x̃: 1.16% 95% mean confidence interval for cycles value: -30.11 -19.75 95% mean confidence interval for cycles %-change: -0.70% -0.41% Cycles are helped. Reviewed-by: Matt Turner <[email protected]> [v1]
* intel/fs/ra: Spill without destroying the interference graphJason Ekstrand2019-05-141-13/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of re-building the interference graph every time we spill, we modify it in place so we can avoid recalculating liveness and the whole O(n^2) interference graph building process. We make a simplifying assumption in order to do so which is that all spill/fill temporary registers live for the entire duration of the instruction around which we're spilling. This isn't quite true because a spill into the source of an instruction doesn't need to interfere with its destination, for instance. Not re-calculating liveness also means that we aren't adjusting spill costs based on the new liveness. The combination of these things results in a bit of churn in spilling. It takes a large cut out of the run-time of shader-db on my laptop. Shader-db results on Kaby Lake: total instructions in shared programs: 15311224 -> 15311360 (<.01%) instructions in affected programs: 77027 -> 77163 (0.18%) helped: 11 HURT: 18 total cycles in shared programs: 355544739 -> 355830749 (0.08%) cycles in affected programs: 203273745 -> 203559755 (0.14%) helped: 234 HURT: 190 total spills in shared programs: 12049 -> 12042 (-0.06%) spills in affected programs: 2465 -> 2458 (-0.28%) helped: 9 HURT: 16 total fills in shared programs: 25112 -> 25165 (0.21%) fills in affected programs: 6819 -> 6872 (0.78%) helped: 11 HURT: 16 Total CPU time (seconds): 2469.68 -> 2360.22 (-4.43%) Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Put the VGRFs at the end of the nodesJason Ekstrand2019-05-141-18/+32
| | | | | | | This is slightly less convenient in some places but it will make it much easier when we want to start adding nodes dynamically. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Re-arrange interference setupJason Ekstrand2019-05-141-217/+205
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old code was arranged by the type of interference being added. It would set up payload registers and then add payload interference for all VGRFs. It would set up MRFs and add MRF interference for all VGRFs. This commit re-arranges things to be organized differently. It first creates and sets up all RA nodes and then groups interference into two new categories: live range and instruction interference. Once all the RA nodes have been set up, it walks the list of VGRFs and sets up their live range interference and then walks the list of instructions and sets up instruction interference. This new arrangement will be advantageous for a future patch but, at the moment, it cuts 2% off the run-time of shader-db on my laptop. Shader-db results on Kaby Lake: total instructions in shared programs: 15311224 -> 15311224 (0.00%) instructions in affected programs: 0 -> 0 helped: 0 HURT: 0 total cycles in shared programs: 355544739 -> 355544739 (0.00%) cycles in affected programs: 0 -> 0 helped: 0 HURT: 0 Total CPU time (seconds): 2523.45 -> 2469.68 (-2.13%) Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Do the spill loop inside RAJason Ekstrand2019-05-142-21/+28
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Only add MRF hack interference if we're spillingJason Ekstrand2019-05-141-62/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | The only use of the MRF hack these days is for spilling and there we don't need the precise MRF usage information. If we're spilling then we know pretty well how many MRFs are going to be used. It is possible if the only things that are spilled have fewer SIMD channels than the dispatch width of the shader that this may be more MRFs than needed. That's a risk we're willing to takd. Shader-db results on Kaby Lake: total instructions in shared programs: 15311100 -> 15311224 (<.01%) instructions in affected programs: 16664 -> 16788 (0.74%) helped: 1 HURT: 5 total cycles in shared programs: 355543197 -> 355544739 (<.01%) cycles in affected programs: 731864 -> 733406 (0.21%) helped: 3 HURT: 6 The hurt shaders are all SIMD32 compute shaders where we reserve enough space for a 32-wide spill/fill but don't need it. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Pull the guts of RA into its own classJason Ekstrand2019-05-142-76/+103
| | | | | | | | This accomplishes two things. First, it makes interfaces which are really private to RA private to RA. Second, it gives us a place to store some common stuff as we go through the algorithm. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Move assign_regs further down in the fileJason Ekstrand2019-05-141-70/+70
| | | | | | | It's the main function from which all the other functions are called. It belongs at the bottom. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Split building the interference graph into a helperJason Ekstrand2019-05-141-23/+42
| | | | Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Initialize grf_used with first_non_payload_grfJason Ekstrand2019-05-141-1/+1
| | | | | | | | | There's no reason why we need to use the calculated payload_node_count value which is just first_non_payload_grf aligned up. The grf_used value will be aligned up to 16 anyway (which is a much bigger alignment) before being handed off to hardware. Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Stop adding RA interference to too many SENDS nodesJason Ekstrand2019-05-141-8/+3
| | | | | | | | | | | | | | | | | | | | We only have one node per VGRF so this was adding way too much interference. No idea how we didn't catch this before. Shader-db results on Kaby Lake: total instructions in shared programs: 15311100 -> 15311100 (0.00%) instructions in affected programs: 0 -> 0 helped: 0 HURT: 0 total cycles in shared programs: 355468050 -> 355543197 (0.02%) cycles in affected programs: 2472492 -> 2547639 (3.04%) helped: 17 HURT: 20 Fixes: 014edff0d20d "intel/fs: Add interference between SENDS sources" Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs/ra: Only add dest interference to sources that existJason Ekstrand2019-05-141-1/+1
| | | | | Fixes: 83dedb6354d "i965: Add src/dst interference for certain" Reviewed-by: Kenneth Graunke <[email protected]>
* intel/fs: Stop doing extra RA callsJason Ekstrand2019-05-141-19/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the last phase of the schedule and RA loop, the RA call is redundant if we spill. Immediately afterwards, we're going to see that we couldn't allocate without spilling and call back into RA and tell it to go ahead and spill. We've known about it for a while but we've always brushed over it on the theory that, if you're going to spill, you'll be calling RA a bunch anyway and what does one extra RA hurt? As it turns out, it hurts more than you'd expect. Because the RA interference graph gets sparser with each spill and the RA algorithm is more efficient on sparser graphs, the RA call that we're duplicating is actually the most expensive call in the RA-and-spill loop. There's another extra RA call we do that's a bit harder to see which this also removes. If we try to compile a shader that isn't the minimum dispatch width and it fails to allocate without spilling we call fail() to set an error but then go ahead and do the first spilling RA pass and only after that's complete do we detect the fail and bail out. By making minimum dispatch widths part of the spill condition, we side-step this problem. Getting rid of these extra spills takes the compile time of a nasty Aztec Ruins shader from about 28 seconds to about 26 seconds on my laptop. It also makes shader-db 1.5% faster Shader-db results on Kaby Lake: total instructions in shared programs: 15311100 -> 15311100 (0.00%) instructions in affected programs: 0 -> 0 helped: 0 HURT: 0 total cycles in shared programs: 355468050 -> 355468050 (0.00%) cycles in affected programs: 0 -> 0 helped: 0 HURT: 0 Total CPU time (seconds): 2524.31 -> 2486.63 (-1.49%) Reviewed-by: Kenneth Graunke <[email protected]>
* nir: allow specifying a set of opcodes in lower_alu_to_scalarJonathan Marek2019-05-101-2/+2
| | | | | | | | | This can be used by both etnaviv and freedreno/a2xx as they are both vec4 architectures with some instructions being scalar-only. Signed-off-by: Jonathan Marek <[email protected]> Reviewed-by: Christian Gmeiner <[email protected]> Reviewed-by: Eric Anholt <[email protected]>
* intel/fs/copy-prop: Don't walk all the ACPs for each instructionJason Ekstrand2019-05-101-11/+68
| | | | | | | | | | | | | | | | | | | | | In order to set up KILL sets, the dataflow code was walking the entire array of ACPs for every instruction. If you assume the number of ACPs increases roughly with the number of instructions, this is O(n^2). As it turns out, regions_overlap() is not nearly as cheap as one would like and shows up as a significant chunk on perf traces. This commit changes things around and instead first builds an array of exec_lists which it uses like a hash table (keyed off ACP source or destination) similar to what's done in the rest of the copy-prop code. By first walking the list of ACPs and populating the table and then walking instructions and only looking at ACPs which probably have the same VGRF number, we can reduce the complexity to O(n). This takes the execution time of the piglit vs-isnan-dvec test from about 56.4 seconds on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to about 38.7 seconds. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs/copy-prop: Purge unused ACPsJason Ekstrand2019-05-101-0/+19
| | | | | | | | | | | | | | If the destination of an ACP entry exists only within this block, then there's no need to keep it for dataflow analysis. We can delete it from the out_acp table and avoid growing the bitsets any bigger than we absolutely have to. This reduces the maximum number of global ACP entries in the vs-isnan-dvec with software fp64 on Kaby Lake from 8630 to 3942 and takes the execution time of the piglit vs-isnan-dvec test from about 1:16.2 on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to about 56.4 seconds. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs/copy-prop: Bump the hash table size to 64Jason Ekstrand2019-05-101-1/+1
| | | | | | | | | | While the number of ACPs is generally not huge compared to the number of blocks, 16 does seem a bit small. Bumping it to 64 takes the execution time of the piglit vs-isnan-dvec test from about 1:18.1 on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to about 1:16.2. Reviewed-by: Kenneth Graunke <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Unset flag reg when FB write is not predicatedMatt Turner2019-05-071-0/+1
| | | | | | | | | | | In the FS IR we pretend that the instruction is predicated with (+f0.1) just for flag dependency tracking purposes. Since the instruction doesn't support predication before Haswell, we unset the predicate so we should also unset the flag register so that we can round-trip the disassembly. Reviewed-by: Sagar Ghuge <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/disasm: Disassemble immediate value properly for dimSagar Ghuge2019-05-071-3/+12
| | | | | | | | | | On haswell, for dim instruction we encode immediate float value operand into double float, v2: Fix comment (Matt Turner) Signed-off-by: Sagar Ghuge <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/disasm: Disassemble JIP offset for whileSagar Ghuge2019-05-071-1/+2
| | | | | | Signed-off-by: Sagar Ghuge <[email protected]> Reviewed-by: Anuj Phogat <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Replicate 16 bit immediate value correctlySagar Ghuge2019-05-072-0/+6
| | | | | | | | | | | | For the W or UW (signed or unsigned word) source types, the 16-bit value must be replicated in both the low and high words of the 32-bit immediate value. v2: Fix replication in other places as well V3: fix a few nits (Matt Turner) Signed-off-by: Sagar Ghuge <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Print quad value in hex formatSagar Ghuge2019-05-071-1/+1
| | | | | | | | | | Print quad value same as unsigned quad so that we can distinguish in between quater control disassembled values for e.g 1/2/3[Q] and immediate quad value for e.g 1Q. This allows round-tripping through the assembler/disassembler. Signed-off-by: Sagar Ghuge <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Don't always require precise lowering of flrpIan Romanick2019-05-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No changes on any other Intel platforms. Iron Lake and GM45 had similar results. (Iron Lake shown) total instructions in shared programs: 8164367 -> 8135551 (-0.35%) instructions in affected programs: 3271235 -> 3242419 (-0.88%) helped: 13636 HURT: 90 helped stats (abs) min: 1 max: 30 x̄: 2.13 x̃: 1 helped stats (rel) min: 0.04% max: 10.77% x̄: 1.16% x̃: 0.97% HURT stats (abs) min: 1 max: 4 x̄: 1.80 x̃: 2 HURT stats (rel) min: 0.26% max: 11.11% x̄: 1.76% x̃: 0.78% 95% mean confidence interval for instructions value: -2.13 -2.07 95% mean confidence interval for instructions %-change: -1.16% -1.13% Instructions are helped. total cycles in shared programs: 188719974 -> 188586222 (-0.07%) cycles in affected programs: 70415766 -> 70282014 (-0.19%) helped: 12563 HURT: 515 helped stats (abs) min: 2 max: 600 x̄: 10.90 x̃: 6 helped stats (rel) min: <.01% max: 5.48% x̄: 0.48% x̃: 0.27% HURT stats (abs) min: 2 max: 54 x̄: 6.07 x̃: 4 HURT stats (rel) min: 0.01% max: 4.48% x̄: 0.24% x̃: 0.08% 95% mean confidence interval for cycles value: -10.56 -9.90 95% mean confidence interval for cycles %-change: -0.47% -0.45% Cycles are helped. LOST: 0 GAINED: 13 Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Use the flrp lowering pass for all stages on Gen4 and Gen5Ian Romanick2019-05-061-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously lower_flrp32 was only set for vertex shaders. Fragment shaders performed a(1-c)+bc lowering during code generation. The shaders with loops hurt are SIMD8 and SIMD16 shaders for a text-identical fragment shader. v2: Rebase on 26391cceaa1 ("intel/compiler: Lower ffma on Gen4 and Gen5"). v3: Rebase on a004e95dd73 ("radeonsi/nir: create si_nir_opts() helper") Iron Lake total instructions in shared programs: 8211385 -> 8185974 (-0.31%) instructions in affected programs: 2503898 -> 2478487 (-1.01%) helped: 9936 HURT: 921 helped stats (abs) min: 1 max: 155 x̄: 2.86 x̃: 2 helped stats (rel) min: 0.10% max: 35.48% x̄: 1.67% x̃: 1.11% HURT stats (abs) min: 1 max: 12 x̄: 3.24 x̃: 2 HURT stats (rel) min: 0.21% max: 13.64% x̄: 1.86% x̃: 0.89% 95% mean confidence interval for instructions value: -2.43 -2.25 95% mean confidence interval for instructions %-change: -1.41% -1.33% Instructions are helped. total cycles in shared programs: 188523186 -> 188401198 (-0.06%) cycles in affected programs: 71541604 -> 71419616 (-0.17%) helped: 11649 HURT: 1871 helped stats (abs) min: 2 max: 930 x̄: 12.62 x̃: 6 helped stats (rel) min: <.01% max: 44.61% x̄: 0.68% x̃: 0.25% HURT stats (abs) min: 2 max: 138 x̄: 13.38 x̃: 8 HURT stats (rel) min: <.01% max: 10.99% x̄: 0.49% x̃: 0.17% 95% mean confidence interval for cycles value: -9.42 -8.63 95% mean confidence interval for cycles %-change: -0.54% -0.50% Cycles are helped. total loops in shared programs: 852 -> 856 (0.47%) loops in affected programs: 0 -> 4 helped: 0 HURT: 4 HURT stats (abs) min: 1 max: 1 x̄: 1.00 x̃: 1 HURT stats (rel) min: 0.00% max: 0.00% x̄: 0.00% x̃: 0.00% 95% mean confidence interval for loops value: 1.00 1.00 95% mean confidence interval for loops %-change: 0.00% 0.00% Loops are HURT. LOST: 3 GAINED: 12 GM45 total instructions in shared programs: 5046407 -> 5033694 (-0.25%) instructions in affected programs: 1303584 -> 1290871 (-0.98%) helped: 5010 HURT: 464 helped stats (abs) min: 1 max: 155 x̄: 2.85 x̃: 2 helped stats (rel) min: 0.10% max: 34.38% x̄: 1.63% x̃: 1.08% HURT stats (abs) min: 1 max: 75 x̄: 3.39 x̃: 2 HURT stats (rel) min: 0.20% max: 13.04% x̄: 1.84% x̃: 0.87% 95% mean confidence interval for instructions value: -2.45 -2.20 95% mean confidence interval for instructions %-change: -1.40% -1.28% Instructions are helped. total cycles in shared programs: 128889476 -> 128812366 (-0.06%) cycles in affected programs: 44845402 -> 44768292 (-0.17%) helped: 6079 HURT: 940 helped stats (abs) min: 2 max: 930 x̄: 15.16 x̃: 8 helped stats (rel) min: <.01% max: 41.03% x̄: 0.71% x̃: 0.25% HURT stats (abs) min: 2 max: 138 x̄: 16.01 x̃: 8 HURT stats (rel) min: <.01% max: 10.99% x̄: 0.50% x̃: 0.17% 95% mean confidence interval for cycles value: -11.63 -10.34 95% mean confidence interval for cycles %-change: -0.58% -0.52% Cycles are helped. total loops in shared programs: 633 -> 635 (0.32%) loops in affected programs: 0 -> 2 helped: 0 HURT: 2 total spills in shared programs: 60 -> 69 (15.00%) spills in affected programs: 54 -> 63 (16.67%) helped: 0 HURT: 1 total fills in shared programs: 92 -> 105 (14.13%) fills in affected programs: 80 -> 93 (16.25%) helped: 0 HURT: 1 LOST: 15 GAINED: 15 Reviewed-by: Jason Ekstrand <[email protected]> [v2] Reviewed-by: Matt Turner <[email protected]> [v2]
* nir: Use the flrp lowering pass instead of nir_opt_algebraicIan Romanick2019-05-061-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | I tried to be very careful while updating all the various drivers, but I don't have any of that hardware for testing. :( i965 is the only platform that sets always_precise = true, and it is only set true for fragment shaders. Gen4 and Gen5 both set lower_flrp32 only for vertex shaders. For fragment shaders, nir_op_flrp is lowered during code generation as a(1-c)+bc. On all other platforms 64-bit nir_op_flrp and on Gen11 32-bit nir_op_flrp are lowered using the old nir_opt_algebraic method. No changes on any other Intel platforms. v2: Add panfrost changes. Iron Lake and GM45 had similar results. (Iron Lake shown) total cycles in shared programs: 188647754 -> 188647748 (<.01%) cycles in affected programs: 5096 -> 5090 (-0.12%) helped: 3 HURT: 0 helped stats (abs) min: 2 max: 2 x̄: 2.00 x̃: 2 helped stats (rel) min: 0.12% max: 0.12% x̄: 0.12% x̃: 0.12% Reviewed-by: Matt Turner <[email protected]>
* nir: nir_shader_compiler_options: drop native_integersChristian Gmeiner2019-05-071-1/+0
| | | | | | | | Driver which do not support native integers should use a lowering pass to go from integers to floats. Signed-off-by: Christian Gmeiner <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/fs: Assert when brw_fs_nir sees a nir_deref_instrCaio Marcelo de Oliveira Filho2019-05-021-1/+1
| | | | | | | | | | | Since 09f1de97a76 "anv,i965: Lower away image derefs in the driver" the backend compiler is not expected to handle any derefs, so let's assert on it. This helps identifying problems when a deref is not lowered and "leaks" into the backend compiler. Reviewed-by: Jason Ekstrand <[email protected]>
* delete autotools .gitignore filesEric Engestrom2019-04-291-10/+0
| | | | | | | | One special case, `src/util/xmlpool/.gitignore` is not entirely deleted, as `xmlpool.pot` still gets generated (eg. by `ninja xmlpool-pot`). Signed-off-by: Eric Engestrom <[email protected]> Reviewed-by: Dylan Baker <[email protected]>
* intel/fs: Don't emit empty ELSE blocks.Kenneth Graunke2019-04-281-4/+4
| | | | | | | | While we can clean this up later, it's trivial to not generate the stupid code in the first place, which saves some optimization work. Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Don't handle texop_tex for shaders without implicit LODCaio Marcelo de Oliveira Filho2019-04-252-6/+2
| | | | | | | | | These will be lowered by nir_lower_tex() with the lower_tex_when_implicit_lod_not_supported, so don't need the extra handling here. Reviewed-by: Rob Clark <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]>
* intel/compiler/fs/icl: Use dummy masked urb write for tess evalTopi Pohjolainen2019-04-251-1/+50
| | | | | | | | | | | One cannot write the URB arbitrarily and therefore the message has to be carefully constructed. The clever tricks originate from Kenneth and Jason, I'm just writing the patch. Fixes GPU hangs on ICL with Vulkan CTS. Reviewed-by: Kenneth Graunke <[email protected]> Signed-off-by: Topi Pohjolainen <[email protected]>
* Revert "intel/compiler: split is_partial_write() into two variants"Juan A. Suarez Romero2019-04-2511-54/+30
| | | | | | | | | | This reverts commit 40b3abb4d16af4cef0307e1b4904c2ec0924299e. It is not clear that this commit was entirely correct, and unfortunately it was pushed by error. CC: Jason Ekstrand <[email protected]> Acked-by: Jason Ekstrand <[email protected]>
* intel/compiler: fix uninit non-static variable. (v2)Dave Airlie2019-04-251-0/+3
| | | | | | | | Pointed out by coverity. v2: init nir_locals also. Reviewed-by: Lionel Landwerlin <[email protected]>
* intel/fs: Fix D to W conversion in opt_combine_constantsIan Romanick2019-04-231-1/+1
| | | | | | | | | | | | | | | | | Found by GCC warning: src/intel/compiler/brw_fs_combine_constants.cpp: In function ‘bool needs_negate(const fs_reg*, const imm*)’: src/intel/compiler/brw_fs_combine_constants.cpp:306:34: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits] return ((reg->d & 0xffffu) < 0) != (imm->w < 0); ~~~~~~~~~~~~~~~~~~~^~~ The result of the bit-and is a 32-bit value with the top bits all zero. This will never be < 0. Instead of masking off the bits, just cast to int16_t and let the compiler handle the actual conversion. Fixes: e64be391dd0 ("intel/compiler: generalize the combine constants pass") Cc: Iago Toral Quiroga <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]>
* intel/compiler: Lower ffma on Gen4 and Gen5Ian Romanick2019-04-231-0/+4
| | | | | | | | | | | flrp32 is also a 3-source instruction, but there is another pending series that handles that for Gen4 and Gen5. v2: Rebase on "intel/compiler: Don't have sepearate, per-Gen nir_options" Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Don't have sepearate, per-Gen nir_optionsIan Romanick2019-04-231-31/+11
| | | | | | | | | Instead, just have separate scalar vs. vector nir_options and do per-Gen "fix ups". Suggested-by: Jason Ekstrand <[email protected]> Reviewed-by: Jason Ekstrand <[email protected]> Reviewed-by: Matt Turner <[email protected]>
* intel/compiler: Improve fix_3src_operand()Matt Turner2019-04-221-5/+18
| | | | | | | | | | | | | | | | | | Allow ATTR and IMM sources unconditionally (ATTR are just GRFs, IMM will be handled by opt_combine_constants(). Both are already allowed by opt_copy_propagation(). Also allow FIXED_GRF if the regioning is 8,8,1. Could also allow other stride=1 regions (e.g., 4,4,1) and scalar regions but I don't think those occur. This is sufficient to allow a pass added in a future commit (fs_visitor::lower_linterp) to avoid emitting extra MOV instructions. I removed the 'src.stride > 1' case because it seems wrong: 3-src instructions on Gen6-9 are align16-only and can only do stride=1 or stride=0. A run through Jenkins with an assert(src.stride <= 1) never triggers, so it seems that it was dead code. Reviewed-by: Rafael Antognolli <[email protected]>
* intel/compiler: Add unit tests for sat prop for different exec sizesMatt Turner2019-04-221-0/+68
| | | | | | | The two new unit tests verify that propagating a saturate between instructions of different exec sizes does not happen. Reviewed-by: Rafael Antognolli <[email protected]>
* intel/compiler: Use SIMD16 instructions in fs saturate prop unit testMatt Turner2019-04-221-59/+59
| | | | | | | | | | | | Will allow us to test that propagation between instructions of different exec sizes does not happen (in the next commit). The stray-looking change in intervening_dest_write is to adjust the size of the texture result to keep the test functioning identically when the instructions' exec sizes are doubled. Without the change, the texture does not overwrite the destination fully as the unit test intends. Reviewed-by: Rafael Antognolli <[email protected]>
* intel/fs: Remove fs_generator::generate_linterp from gen11+.Rafael Antognolli2019-04-221-44/+6
| | | | | | | | | We now have a lowering pass that will do this at the fs_visitor level, so we can remove this code from gen11+. v2: Reduce size of the "i" array from 4 to 2 (Matt). Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Add a lowering pass for linear interpolation.Rafael Antognolli2019-04-222-0/+47
| | | | | | | | | | | | | | | | On gen11, instead of using a PLN instruction, we convert FS_OPCODE_LINTERP to 2 or 4 multiply adds. That is done in the fs_generator code. This patch adds a lowering pass that does the same thing at the fs_visitor. It also drops the usage of NF types, since we don't need the extra precision and it lets us skip the accumulator. With all that, some optimizations will still be run on the generated code, and we should get better scheduling. v2: Update comment about saturation and conditional mod (Matt) Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Move the scalar-region conversion to the generator.Rafael Antognolli2019-04-224-5/+5
| | | | | | | | | | Move the scalar-region conversion from the IR to the generator, so it doesn't affect the Gen11 path. We need the non-scalar regioning for a later lowering pass that we are adding. v2: Better commit message (Matt) Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Only propagate saturation if exec_size is the same.Rafael Antognolli2019-04-221-1/+2
| | | | | | | | | Otherwise it could propagate the saturation from a SIMD16 instruction into a SIMD8 instruction. With that, only part of the destination register, which is the source of the move with saturation, would have been updated. Reviewed-by: Matt Turner <[email protected]>
* intel/fs: Add support for float16 to the fsign optimizationsIan Romanick2019-04-201-6/+24
| | | | | | | | | | | | | | | Commit ad98fbc2174 ("intel/fs: Refactor code generation for nir_op_fsign to its own function") criss-crossed with c2b8fb9a810 ("anv/device: expose VK_KHR_shader_float16_int8 in gen8+"), and I was not paying enough attention when I rebased. This adds back the float16 changes and enables the optimization. v2: Incorporate more changes from 19cd2f5debd and a8d8b1a1391 that I missed in the previous version. Fixes: ad98fbc2174 ("intel/fs: Refactor code generation for nir_op_fsign to its own function") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110474 Reviewed-by: Matt Turner <[email protected]> [v1]