aboutsummaryrefslogtreecommitdiffstats
path: root/src/broadcom/compiler/nir_to_vir.c
Commit message (Collapse)AuthorAgeFilesLines
* v3d: Upload all of UBO[0] if any indirect load occurs.Eric Anholt2019-03-211-64/+1
| | | | | | | | | | | | | | | The idea was that we could skip uploading the constant-indexed uniform data and just upload the uniforms that are variably-indexed. However, since the VS bin and render shaders may have a different set of uniforms used, this meant that we had to upload the UBO for each of them. The first case is generally a fairly small impact (usually the uniform array is the most space, other than a couple of FSes in shader-db), while the second is a larger impact: 3DMMES2 was uploading 38k/frame of uniforms instead of 18k. Given that the optimization is of dubious value, has a big downside, and is quite a bit of code, just drop it. No change in shader-db. No change on 3DMMES2 (n=15).
* v3d: Move constant offsets to UBO addresses into the main uniform stream.Eric Anholt2019-03-211-8/+12
| | | | | | | | | | We'd end up with the constant offset in the uniform stream anyway, since they're bigger than small immediates. Avoids the extra uniforms and adds in the shader in favor of just adding once on the CPU. shader-db: total instructions in shared programs: 6496865 -> 6494851 (-0.03%) total uniforms in shared programs: 2119511 -> 2117243 (-0.11%)
* v3d: Eliminate the TLB and TLBU files.Eric Anholt2019-03-051-14/+13
| | | | We can just use the magic register file like we do for other magic waddrs.
* v3d: Use ldunif instructions for uniforms.Eric Anholt2019-03-051-4/+3
| | | | | | | | | | | | | | The idea is that for repeated use of the same uniform, we could avoid loading it on each consumer. The results look pretty good. total instructions in shared programs: 6413571 -> 6521464 (1.68%) total threads in shared programs: 154214 -> 154000 (-0.14%) total uniforms in shared programs: 2393604 -> 2119629 (-11.45%) total spills in shared programs: 4960 -> 4984 (0.48%) total fills in shared programs: 6350 -> 6418 (1.07%) Once we do scheduling at the NIR level, the register pressure (and thus also instructions) issues we see here will drop back down.
* v3d: Switch implicit uniforms over to being any qinst->uniform != ~0.Eric Anholt2019-03-051-16/+22
| | | | | I'm not sure why I didn't do this before -- it's clearly much simpler to add dumping of the extra thing than to have it as another implicit source.
* v3d: Move the stores for fixed function VS output reads into NIR.Eric Anholt2019-03-051-174/+61
| | | | | | | | | | | | | | | This lets us emit the VPM_WRITEs directly from nir_intrinsic_store_output() (useful once NIR scheduling is in place so that we can reduce register pressure), and lets future NIR scheduling schedule the math to generate them. Even in the meantime, it looks like this lets NIR DCE some more code and make better decisions. total instructions in shared programs: 6429246 -> 6412976 (-0.25%) total threads in shared programs: 153924 -> 153934 (<.01%) total loops in shared programs: 486 -> 483 (-0.62%) total uniforms in shared programs: 2385436 -> 2388195 (0.12%) Acked-by: Ian Romanick <[email protected]> (nir)
* v3d: Translate f2i(fround_even) as FTOIN.Eric Anholt2019-03-051-2/+9
| | | | | This appears to be just what the opcode does. Needed for equivalence when moving FF VPM stores into NIR.
* v3d: Stop treating exec masking specially.Eric Anholt2019-03-051-1/+0
| | | | | | | | | | | | | | | In our backend, the successor edges from the blocks only point to where QPU control flow goes, not where the notional control flow goes from a "break" or "continue" modifying the execution mask to resume writing to some channels later. As a result, this attempt at restricting live ranges ended up missing the live range of a value where a conditional break/continue was present in a loop before the later def of a variable. The previous commit ended up fixing the problem that the flag tried to solve. Fixes glsl-vs-loop-continue.shader_test and/or glsl-vs-loop-redundant-condition.shader_test based on register allocation results.
* v3d: Dump the VIR after register spilling if we were forced to.Eric Anholt2019-02-251-0/+10
| | | | | Spilling is unusual, but one often has to debug it when it happens, so dump it.
* v3d: Move i2b and f2b support into emit_comparison.Eric Anholt2019-02-181-13/+12
| | | | | This lets us save a resolve to NIR true/false for ifs and discard_if. No change in shader-db.
* v3d: Emit a simpler negate for the iabs implementation.Eric Anholt2019-02-181-2/+1
| | | | | | One program affected in my shader-db. instructions in affected programs: 110 -> 108 (-1.82%)
* v3d: Delay emitting ldvpm on V3D 4.x until it's actually used.Eric Anholt2019-02-181-6/+43
| | | | | | | | | | | For V3D 3.x, we emitted the ldvpms all at the top so that we didn't need to do VPM setup when the load_inputs are out of order. For V3D 4.x, we can reduce register pressure by delaying our loads until they're actually needed. This also avoids a bunch of silly MOVs in the pre-opt VIR dump. total instructions in shared programs: 6421415 -> 6419933 (-0.02%) total uniforms in shared programs: 2393139 -> 2393140 (<.01%) total threads in shared programs: 153864 -> 153906 (0.03%)
* v3d: Stop tracking num_inputs for VPM loads.Eric Anholt2019-02-181-2/+0
| | | | | It's unused in the VS (since we need vattr_sizes[] anyway), so move it to FS prog data.
* v3d: Add a function to describe what the c->execute.file check means.Eric Anholt2019-02-181-8/+8
| | | | | This is what pointed out that we were misusing the check for last_thrsw in the previous commit.
* v3d: Fix the check for "is the last thrsw inside control flow"Eric Anholt2019-02-181-8/+16
| | | | | | | | | The execute.file check used to be good enough, until I stopped setting up the execute mask for uniform ifs. No known tests fixed, noticed while doing a refactor. Fixes: 080506057310 ("v3d: Handle dynamically uniform IF statements with uniform control flow.")
* v3d: Fix f2b32 behavior.Eric Anholt2019-02-181-1/+6
| | | | | | Now that we don't have the vir_PF() magic, it's obvious that we were doing the wrong thing for f2b32 by allowing -0.0 to produce true instead of false.
* v3d: Kill off vir_PF(), which is hard to use right.Eric Anholt2019-02-181-23/+36
| | | | | | | | | | | | | | | | | | You were allowed to pass in any old temp so that you could hopefully fold the PF up into the def of the temp. If we couldn't find one, it implicitly generated a MOV(nop, reg). However, that PF could have different behavior depending on whether the def being folded into was a float or int opcode, which the caller doesn't necessarily control. Due to the fragility of the function, just switch all callers over to vir_set_pf(). This also encourages the callers to use a _dest call for the inst they're putting the PF on, eliminating a bunch of temps in the pre-optimization VIR. shader-db says the change is in the noise: total instructions in shared programs: 6226247 -> 6227184 (0.02%) instructions in affected programs: 851068 -> 852005 (0.11%)
* v3d: Do bool-to-cond for discard_if as well.Eric Anholt2019-02-181-16/+12
| | | | | | | | | | | | | | | | Turns this minimal conditional discard (glsl-fs-discard-01.shader_test): 0x3de0b086c5fe9000 fcmp.pushn -, r1, r5; mov r2, 0 0x3dec3086bbfc001f nop ; mov.ifa r2, -1 0x3c047186bbe80000 nop ; mov.pushz -, r2 0x3dea3186ba837000 setmsf.ifna -, 0 ; nop into: 0x3c00b186c582a000 fcmp.pushn -, r2, r5; nop 0x3de83186ba837000 setmsf.ifa -, 0 ; nop total instructions in shared programs: 6229820 -> 6226247 (-0.06%)
* v3d: Refactor bcsel and if condition handling.Eric Anholt2019-02-181-29/+14
| | | | | | | Both were doing the same thing to try to get a condition to predicate on. Noticed when I wanted to do this for discard_if as well. No change in shader-db.
* v3d: Add a helper function for getting a nop register.Eric Anholt2019-02-181-10/+9
| | | | Just a little refactor to explain what's going on with QFILE_NULL.
* v3d: Drop our hand-lowered nir_op_ffract.Eric Anholt2019-02-181-3/+1
| | | | | | | | | The NIR lowering works fine, though it causes some slight noise due to what looks like choices about propagating constants up multiply chains changing. total instructions in shared programs: 6229671 -> 6229820 (<.01%) total uniforms in shared programs: 2312171 -> 2312324 (<.01%)
* v3d: Drop a perf note about merging unpack_half_*, which has been implemented.Eric Anholt2019-02-181-3/+0
| | | | This is handled with copy-propagation now.
* v3d: Use the early_fragment_tests flag for the shader's disable-EZ field.Eric Anholt2019-02-181-0/+3
| | | | | | | | | | | Apparently we need disable-EZ flagged, not just "does Z writes". Fixes dEQP-GLES31.functional.image_load_store.early_fragment_tests.no_early_fragment_tests_depth_fbo on 7278, even though it passed in simulation. Signed-off-by: Eric Anholt <[email protected]> Fixes: 051a41d3d56e ("v3d: Add support for the early_fragment_tests flag.")
* v3d: Use the NIR lowering for isign instead of rolling our own.Eric Anholt2019-02-141-16/+1
| | | | | min/max instead of comparisons saves 2 instructions on fs-sign-int.shader_test.
* v3d: Store the actual mask of color buffers present in the key.Eric Anholt2019-02-051-4/+4
| | | | | | | If you only bound rt 1+, we'd still emit a write to the rt0 that isn't present (noticed while debugging an ext_framebuffer_multisample-alpha-to-coverage-no-draw-buffer-zero regression in another change).
* v3d: Add support for CS barrier() intrinsics.Eric Anholt2019-01-141-0/+50
|
* v3d: Add support for CS shared variable load/store/atomics.Eric Anholt2019-01-141-13/+73
| | | | | CS shared variables are handled effectively as SSBO access to a temporary buffer that will be allocated at CS dispatch time.
* v3d: Add support for CS workgroup/invocation id intrinsics.Eric Anholt2019-01-141-1/+53
| | | | | | We get a payload for the ivec3 workgroup and an int local invocation index, and we use the core lowering to turn into the global invocation id and the local invocation id ivec3s.
* v3d: Add support for shader_image_load_store.Eric Anholt2019-01-141-0/+48
| | | | | | This is only exposed on V3D 4.1+, because we didn't have the TMU write operations for images on 3.3 (To do GLES 3.1 there, you have to lower it to SSBO load/stores, which is a problem to solve later).
* v3d: Add SSBO/atomic counters support.Eric Anholt2019-01-141-6/+129
| | | | | So far I assume that all the buffers get written. If they weren't, you'd probably be using UBOs instead.
* v3d: Add support for matrix inputs to the FS.Eric Anholt2019-01-141-13/+14
| | | | | | We've been relying on linking splitting up our varying matrices into separate vectors, but with SSO that doesn't happen. Supporting matrix inputs isn't too hard, though.
* v3d: Stop scalarizing our uniform loads.Eric Anholt2019-01-041-56/+46
| | | | | | | | | | | | | | | We can pull a whole vector in a single indirect load. This saves a bunch of round-trips to the TMU, instructions for setting up multiple loads, references to the UBO base in the uniforms, and apparently manages to reduce register pressure as well. instructions in affected programs: 3086665 -> 2454967 (-20.47%) uniforms in affected programs: 919581 -> 721039 (-21.59%) threads in affected programs: 1710 -> 3420 (100.00%) spills in affected programs: 596 -> 522 (-12.42%) fills in affected programs: 680 -> 562 (-17.35%) Improves 3dmmes performance by 2.29312% +/- 0.139825% (n=5)
* v3d: Do UBO loads a vector at a time.Eric Anholt2019-01-041-35/+89
| | | | | | | In the process of adding support for SSBOs and CS shared vars, I ended up needing a helper function for doing TMU general ops. This helper can be that starting point, and saves us a bunch of round-trips to the TMU by loading a vector at a time.
* v3d: Handle dynamically uniform IF statements with uniform control flow.Eric Anholt2019-01-021-1/+65
| | | | | | | | | Loops will be trickier, since we need some analysis to figure out if the breaks/continues inside are uniform. Until we get that in NIR, this gets us some quick wins. total instructions in shared programs: 6192844 -> 6174162 (-0.30%) instructions in affected programs: 487781 -> 469099 (-3.83%)
* v3d: Fold comparisons for IF conditions into the flags for the IF.Eric Anholt2019-01-021-12/+26
| | | | | total instructions in shared programs: 6193810 -> 6192844 (-0.02%) instructions in affected programs: 800373 -> 799407 (-0.12%)
* v3d: Don't try to fold non-SSA-src comparisons into bcsels.Eric Anholt2019-01-021-1/+17
| | | | | There could have been a write of a src in between the comparison and the bcsel that would invalidate the comparison.
* v3d: Move the "Find the ALU instruction generating our bool" out of bcsel.Eric Anholt2019-01-021-6/+9
| | | | This will be reused for if statements.
* v3d: Simplify the emission of comparisons for the bcsel optimization.Eric Anholt2019-01-021-37/+24
| | | | | | I wanted to reuse the comparison stuff for nir_ifs, but for that I just want the flags and no destination value. Splitting the conditions from the destinations ended up cleaning the existing code up, anyway.
* v3d: Add support for gl_HelperInvocation.Eric Anholt2018-12-301-0/+8
| | | | | | We can just look at the MSF flags -- if they're unset, then we're definitely in a helper invocation. Fixes dEQP-GLES31.functional.shaders.helper_invocation.* with GLES3.1 enabled.
* v3d: Add support for textureSize() on MSAA textures.Eric Anholt2018-12-301-0/+1
| | | | | | Fixes failures in dEQP-GLES31.functional.shaders.builtin_functions.texture_size.samples_1_texture_2d in the GLES3.1 suite.
* v3d: Don't generate temps for comparisons.Eric Anholt2018-12-301-12/+14
| | | | | This was just generated work for vir_opt_dead_code and cluttered up the dumps.
* v3d: Drop unused count_nir_instrs() helper.Eric Anholt2018-12-301-18/+0
| | | | | This was for shader-db, but I haven't cared about NIR instruction counts in a long time.
* v3d: Hook up some shader-db output to GL_ARB_debug_output.Eric Anholt2018-12-301-0/+2
| | | | | | | This allows the original shader-db project's run.c runner to parse things easily, and is probably a good thing to have for GL_ARB_debug_output in general. I formatted it more like Intel's so I can mostly reuse their report script.
* nir/opt_peephole_select: Don't peephole_select expensive math instructionsIan Romanick2018-12-171-1/+1
| | | | | | | | | | | | | | | | On some GPUs, especially older Intel GPUs, some math instructions are very expensive. On those architectures, don't reduce flow control to a csel if one of the branches contains one of these expensive math instructions. This prevents a bunch of cycle count regressions on pre-Gen6 platforms with a later patch (intel/compiler: More peephole select for pre-Gen6). v2: Remove stray #if block. Noticed by Thomas. Signed-off-by: Ian Romanick <[email protected]> Reviewed-by: Thomas Helland <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]>
* nir/opt_peephole_select: Don't try to remove flow control around indirect loadsIan Romanick2018-12-171-1/+1
| | | | | | | | | | | | | | | | | | | That flow control may be trying to avoid invalid loads. On at least some platforms, those loads can also be expensive. No shader-db changes on any Intel platform (even with the later patch "intel/compiler: More peephole select"). v2: Add a 'indirect_load_ok' flag to nir_opt_peephole_select. Suggested by Rob. See also the big comment in src/intel/compiler/brw_nir.c. v3: Use nir_deref_instr_has_indirect instead of deref_has_indirect (from nir_lower_io_arrays_to_elements.c). v4: Fix inverted condition in brw_nir.c. Noticed by Lionel. Signed-off-by: Ian Romanick <[email protected]> Reviewed-by: Lionel Landwerlin <[email protected]>
* nir: Rename Boolean-related opcodes to include 32 in the nameJason Ekstrand2018-12-161-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a squash of a bunch of individual changes: nir/builder: Generate 32-bit bool opcodes transparently nir/algebraic: Remap Boolean opcodes to the 32-bit variant Use 32-bit opcodes in the NIR producers and optimizations Generated with a little hand-editing and the following sed commands: sed -i 's/nir_op_ball_fequal/nir_op_b32all_fequal/g' **/*.c sed -i 's/nir_op_bany_fnequal/nir_op_b32any_fnequal/g' **/*.c sed -i 's/nir_op_ball_iequal/nir_op_b32all_iequal/g' **/*.c sed -i 's/nir_op_bany_inequal/nir_op_b32any_inequal/g' **/*.c sed -i 's/nir_op_\([fiu]lt\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fiu]ge\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fiu]ne\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fiu]eq\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fi]\)ne32g/nir_op_\1neg/g' **/*.c sed -i 's/nir_op_bcsel/nir_op_b32csel/g' **/*.c Use 32-bit opcodes in the NIR back-ends Generated with a little hand-editing and the following sed commands: sed -i 's/nir_op_ball_fequal/nir_op_b32all_fequal/g' **/*.c sed -i 's/nir_op_bany_fnequal/nir_op_b32any_fnequal/g' **/*.c sed -i 's/nir_op_ball_iequal/nir_op_b32all_iequal/g' **/*.c sed -i 's/nir_op_bany_inequal/nir_op_b32any_inequal/g' **/*.c sed -i 's/nir_op_\([fiu]lt\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fiu]ge\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fiu]ne\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fiu]eq\)/nir_op_\132/g' **/*.c sed -i 's/nir_op_\([fi]\)ne32g/nir_op_\1neg/g' **/*.c sed -i 's/nir_op_bcsel/nir_op_b32csel/g' **/*.c Reviewed-by: Eric Anholt <[email protected]> Reviewed-by: Bas Nieuwenhuizen <[email protected]> Tested-by: Bas Nieuwenhuizen <[email protected]>
* v3d: Drop in a bunch of notes about performance improvement opportunities.Eric Anholt2018-12-141-1/+34
| | | | | | These have all been floating in my head, and while I've thought about encoding them in issues on gitlab once they're enabled, they also make sense to just have in the area of the code you'll need to work in.
* v3d: Return the right gl_SampleMaskIn[] value.Eric Anholt2018-12-071-2/+1
| | | | | It's supposed to be the dispatched sample mask for this pixel, not the GL state's sample mask.
* v3d: Convert to using nir_src_as_uint() from const_value derefs.Eric Anholt2018-12-071-14/+10
| | | | | Follows 16870de8a0aa ("nir: Use nir_src_is_const and nir_src_as_* in core code") to clean up v3d.
* nir: Make boolean conversions sized just like the othersJason Ekstrand2018-12-051-4/+4
| | | | | | | | | Instead of a single i2b and b2i, we now have i2b32 and b2iN where N is one if 8, 16, 32, or 64. This leads to having a few more opcodes but now everything is consistent and booleans aren't a weird special case anymore. Reviewed-by: Connor Abbott <[email protected]>