summaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/vc4/vc4_qir.h
Commit message (Collapse)AuthorAgeFilesLines
* vc4: Upload CS/VS UBO uniforms together.Eric Anholt2019-04-101-33/+0
| | | | | | | | | | | | | | | | | Same as I did for V3D, drop all this code trying to GC the non-indirectly-loaded uniforms from the UBO that's used for indirect access of gallium cb[0]. While it does successfully drop some of those, it came at the cost of uploading the VS's indirect unifroms twice, for the bin and render versions of the shader. With the UBO loads simplified, I was also able to easily backport V3D's change to pack a UBO offset into the uniform_data[] field so that we don't need to do the add of the uniform base in the shader. As a bonus, now vc4 doesn't depend on mesa/st type_size functions. total uniforms in shared programs: 25514 -> 25490 (-0.09%) total instructions in shared programs: 77019 -> 76836 (-0.24%)
* vc4: Split UBO0 and UBO1 address uniform handling.Eric Anholt2019-04-101-1/+2
| | | | I'm going to extend how UBO0 works in a moment.
* vc4: Extend dumping of uniforms in QIR and in the command stream.Eric Anholt2018-08-071-0/+2
| | | | Similar to what I did for V3D, provide some description of the uniforms.
* broadcom/vc4: Allow binding non-zero constant buffers.Eric Anholt2018-03-091-0/+1
| | | | | We're going to use UBO loads for implementing YUV linear-to-T-format blits.
* nir: Move vc4's alpha test lowering to core NIR.Eric Anholt2017-10-101-1/+0
| | | | | | | | | | | | | I've been doing this inside of vc4, but vc5 wants it as well and it may be useful for other drivers (Intel has a related path for pre-gen6 with MRT, and freedreno had a TGSI path for it at one point). This required defining a common enum for the standard comparison functions, but other lowering passes are likely to also want that enum. v2: Add to meson.build as well. Acked-by: Rob Clark <[email protected]>
* Revert "vc4: Lazily emit our FS/VS input loads."Eric Anholt2017-03-081-7/+0
| | | | | | This reverts commit 292c24ddac5acc35676424f05291c101fcd47b3e. It broke a lot of GLES2 deqp, and I see at least one problem that will require some serious rework to fix.
* vc4: Lazily emit our FS/VS input loads.Eric Anholt2017-02-241-0/+7
| | | | | | | | | | | | | | | | | | | This reduces register pressure in both types of shaders, by reordering the input loads from the var->data.driver_location order to whatever order they appear first in the NIR shader. These instructions aren't reorderable at our QIR scheduling level because the FS takes two in lockstep to do an interpolation, and the VS takes multiple read instructions in a row to get a whole vec4-level attribute read. shader-db impact: total instructions in shared programs: 76666 -> 76590 (-0.10%) instructions in affected programs: 42945 -> 42869 (-0.18%) total max temps in shared programs: 9395 -> 9208 (-1.99%) max temps in affected programs: 2951 -> 2764 (-6.34%) Some programs get their max temps hurt, depending on the order that the load_input intrinsics appear, because we end up being unable to copy propagate an older VPM read into its only use.
* vc4: Track the last block we emitted at the top level.Eric Anholt2017-02-241-0/+1
| | | | | This will be used for delaying our VPM reads (which must be unconditional) until just before they're used.
* vc4: Avoid emitting small immediates for UBO indirect load address guards.Eric Anholt2017-02-101-0/+2
| | | | | | | | | | | | The kernel will reject our shader if we emit one here, and having 4, 8, or 12 as the top end of our UBO clamp rare is enough that it's not worth making the kernel let us. Fixes piglit fs-const-array-of-struct and fs-const-array-of-struct-of-array since recent GLSL linking changes made us get this as an indirect load of a uniform, instead of a tempoary. Cc: "13.0 17.0" <[email protected]>
* vc4: Add support for coalescing ALU ops into tex_[srtb] MOVs.Eric Anholt2016-11-291-0/+1
| | | | | | | | | | | This isn't as complete as I would like (can't merge interpolation because of the implicit r5 dependency, doesn't work with control flow), but this was cheap and easy. Improves 3DMMES Taiji performance by 1.15353% +/- 0.299896% (n=29, 16) total instructions in shared programs: 99810 -> 99059 (-0.75%) instructions in affected programs: 10705 -> 9954 (-7.02%)
* vc4: Make qir_for_each_inst_inorder() safe against removal.Eric Anholt2016-11-291-1/+1
| | | | | The dead code elimination wants it to be safe, and I actually got segfaults due to it being unsafe with the new coalescing pass.
* vc4: Split optimizing VPM writes from VPM reads.Eric Anholt2016-11-291-0/+1
| | | | | | The VPM write logic will be basically the same as the texture coordinate write logic we need, and it's not really related to the VPM read logic other than the reuse of the use_count array.
* vc4: Restructure texture insts as ALU ops with tex_[strb] as the dst.Eric Anholt2016-11-291-24/+16
| | | | | For now we're still just generating MOVs, but this will let us fold into other ops in the future. No difference on shader-db.
* vc4: Refactor qir_get_op_nsrc(enum qop) to qir_get_nsrc(struct qinst *).Eric Anholt2016-11-291-1/+1
| | | | | | Every caller was dereffing the qinst, and this will let us make the number of sources vary depending on the destination of the qinst so that we can have general ALU ops that store to tex_[strb] and get an implicit uniform.
* vc4: Replace the qinst src[] with a fixed-size array.Eric Anholt2016-11-291-1/+1
| | | | | | This may have made a tiny bit of sense when we had one 4-arg inst per shader, but if we only ever put 2 things in, having a pointer to 2 things almost every instruction is pointless indirection.
* vc4: Remove qir_inst4().Eric Anholt2016-11-291-5/+0
| | | | | This was used originally for unorm4x8 packs, but we now represent those as a series of packed movs.
* vc4: Don't conditionalize the src1 mov of qir_SEL().Eric Anholt2016-11-221-4/+2
| | | | | | | | | | | | | | | | | My thought in having both arguments conditionally moved was that it should theoretically save some power by not doing work in those channels. However, it ends up costing us instructions because we can't register-coalesce the first of the MOVs, and it also introduces extra scheduling dependencies. The instruction cost would swamp whatever power benefit I was hoping for. shader-db results: total instructions in shared programs: 100548 -> 99741 (-0.80%) instructions in affected programs: 42450 -> 41643 (-1.90%) With obvious outliers removed (I had an X11 emacs running over the network in the "after" case), 3DMMES Taiji showed 1.07231% +/- 0.488241% fps improvement (n=18, 30).
* vc4: Flag the last thread switch in the program as the last.Eric Anholt2016-11-121-0/+5
| | | | | | We don't allow the last thread switch to be inside control flow, to be sure that we hit the last state exactly once. If the last texturing was in control flow, fall back to single threaded.
* vc4: Add support for register allocation for threaded shaders.Eric Anholt2016-11-121-0/+7
| | | | | | We have two major requirements: Make sure that only the bottom half of the physical reg space is used, and make sure that none of our values are live in an accumulator across a switch.
* vc4: Add a thread switch QIR instruction.Eric Anholt2016-11-121-0/+10
| | | | | | | This will eventually be generated at the QIR level, so that vc4_qir_schedule.c can arrange the separation of tex_strb from tex_result correctly. It will also be important so that register allocation set the register classes appropriately for values that are live across the switch.
* vc4: Don't abort when a shader compile fails.Eric Anholt2016-11-091-0/+1
| | | | | | | | | It's much better to just skip the draw call entirely. Getting this information out of register allocation will also be useful for implementing threaded fragment shaders, which will need to retry non-threaded if RA fails. Cc: <[email protected]>
* vc4: Fix live intervals analysis for screening defs in if statements.Eric Anholt2016-10-061-2/+5
| | | | | | | | | If a conditional assignment is only conditioned on the exec mask, that's still screening off the value in the executed channels (and, since we're not storing to the unexcuted channels, we don't care what's in there). Fixes a bunch of extra register pressure on Processing's Ribbons demo, which is failing to allocate.
* vc4: Handle discards while in control flow.Eric Anholt2016-08-291-0/+1
| | | | | I missed this while adding loop support because the discard test inside a loop was crashing before, anyway. Fixes piglit glsl-fs-discard-04.
* vc4: Add support for MUL output rotation.Eric Anholt2016-08-251-0/+12
| | | | Extracted from a patch by jonasarrow on github.
* vc4: Add support for the 2-bit LOAD_IMM variants.Eric Anholt2016-08-251-0/+26
| | | | | Extracted and fixed up from a patch by jonasarrow on github. This ended up not getting used for ddx/ddy, but seems like it might still be useful.
* vc4: Add a QIR value for the QPU element register.Eric Anholt2016-08-251-0/+1
| | | | | This will be used in the ddx/ddy support for "Am I the top half?" or "Am I the left half?" checks.
* vc4: Fix GPU hangs with >16 varying values.Eric Anholt2016-08-241-0/+12
| | | | Fixes glsl-routing in piglit and hangs in glbenchmark 2.0.2.
* nir: Define system values for vc4's blending-lowering arguments.Eric Anholt2016-08-221-7/+0
| | | | | | | | | | | | | In the GLSL-to-NIR conversion of VC4, I had a bit of trouble with what I was calling the "state uniforms" that I was putting into the NIR fighting with its other lowering passes. Instead of using magic uniform base numbers in the backend, follow the lead of load_user_clip_plane and just define system values for them. v2: Fix unintended change to channel_num, drop unspecified const_index value on blend_const_color_r_float. Reviewed-by: Kenneth Graunke <[email protected]>
* vc4: Avoid VS shader recompiles by keeping a set of FS inputs seen so far.Eric Anholt2016-08-041-6/+1
| | | | | | | | | | | | We don't want to bake the whole array into the FS key, because of the hashing overhead. But we can keep a set of the arrays seen, and use a pointer to the copy in as the array's proxy. Between this and the previous patch, gl-1.0-blend-func now passes on hardware, where previously it was filling the 256MB CMA area with shaders and OOMing. Drops 712 shaders from shader-db.
* vc4: Avoid generating a custom shader per level in glGenerateMipmaps().Eric Anholt2016-08-031-1/+3
| | | | | | | | | | We were baking in the LOD of the source level to each shader. Instead, pass it in as a uniform -- this requires storing it to a temp register, but that's better than compiling a ton of separate shaders: total instructions in shared programs: 115032 -> 115036 (0.00%) instructions in affected programs: 96 -> 100 (4.17%) LOST: 572
* vc4: Speed up glGenerateMipmaps by avoiding shadow baselevel.Eric Anholt2016-07-151-0/+1
| | | | | | | | | | | | | To support general GL_TEXTURE_BASE_LEVEL we have to copy to a temporary miptree. However, if a single level is being selected, we can use the existing miptree and force all the sampling to be from that particular level. This avoids a ton of software fallbacks in glGenerateMipmaps(), which uses base levels in the blit implementation in gallium. Improves "glmark2 -b terrain" from 2 fps to 3 (perhaps some more precision would be useful?), and cuts its CPU usage during the benchmarking from ~30% to ~10% (total CPU time from 8.8s to 7.6s).
* vc4: Emit resets of the uniform stream at the starts of blocks.Eric Anholt2016-07-131-0/+12
| | | | | | | | If a block might be entered from multiple locations, then the uniform stream will (probably) be at different points, and we need to make sure that it's pointing where we expect it to be. The kernel also enforces that any block reading a uniform resets uniforms, to prevent reading outside of the uniform stream by using looping.
* vc4: Add support for scheduling of branch instructions.Eric Anholt2016-07-131-0/+11
| | | | For now we don't fill the delay slots, and instead just drop in NOPs.
* vc4: Move the QPU instructions to schedule into each block.Eric Anholt2016-07-131-0/+2
| | | | We'll want to schedule them individually, to handle delay slots.
* vc4: Add support for NIR loops and break/continue.Eric Anholt2016-07-121-0/+2
|
* vc4: Add support for storing to NIR registers in a non-SSA fashion.Eric Anholt2016-07-121-0/+12
| | | | | | | Previously, there were occasionally NIR registers in our programs, but they were always actually used SSA-only. Now that we're trying to support control flow, we need to actually conditionally move to registers based on whether channels are active or not.
* vc4: Define a QIR branch instructionEric Anholt2016-07-121-0/+15
| | | | | | This uses the branch condition code in inst->cond to jump to either successor[0] (condition matches) or successor[0] (condition doesn't match).
* vc4: Implement live intervals using a CFG.Eric Anholt2016-07-121-1/+15
| | | | | Right now our CFG is always a trivial single basic block, but that will change when enable loops.
* vc4: Create a basic block structure and move the instructions into it.Eric Anholt2016-07-121-3/+47
| | | | | | | The optimization passes and scheduling aren't actually ready for multiple blocks with control flow yet (as seen by the "cur_block" references in them instead of iterating over blocks), but this creates the structures necessary for converting them.
* vc4: Add a "qir_for_each_inst_inorder" macro and use it in many places.Eric Anholt2016-07-121-0/+3
| | | | | | | | We have the prior list_foreach() all over the code, but I need to move where instructions live as part of adding support for control flow. Start by just converting to a helper iterator macro. (The simpler "qir_for_each_inst()" will be used for the for-each-inst-in-a-block iterator macro later)
* vc4: Regularize instruction emit macrosEric Anholt2016-07-041-37/+24
| | | | | | ALU0 didn't have the _dest variant, and ALU2 didn't unset the def the way ALU1 did. This should make the ALU[012] macros much clearer, by moving most of their contents to vc4_qir.c
* vc4: Move SF removal to a separate peephole pass.Eric Anholt2016-07-041-0/+1
| | | | | | | | | The DCE pass is going to change significantly to handle control flow, while we don't really need to change it for the SF handling. We also need to add some more SF peephole optimization for SF updates generated by control flow support. No change on shader-db.
* vc4: Drop the dead QIR_PACK() macro.Eric Anholt2016-07-041-8/+0
| | | | | This isn't used since we switched to using the dst.pack field instead of custom instructions.
* vc4: Add support for vertex color clamping in the rasterizer.Eric Anholt2016-05-171-0/+1
| | | | | This gets us precompile of vertex shaders at the state tracker level as well.
* vc4: Add support for loading immediate values in QIR.Eric Anholt2016-05-061-0/+17
| | | | | | | This will be used for resetting the uniform stream in the presence of branching, but may also be useful as an optimization to reduce how many uniforms we have to copy out per draw call (in exchange for increasing icache pressure).
* vc4: Add a small QIR validate pass.Eric Anholt2016-05-061-0/+2
| | | | | This has caught a couple of bugs during loop development so far, and I should probably have written it long ago.
* vc4: When emitting an instruction to an existing temp, mark it non-SSA.Eric Anholt2016-05-061-0/+2
| | | | Prevents a bug in the later control-flow support series.
* vc4: Use NIR lowering for sRGB decode.Eric Anholt2016-05-021-5/+0
| | | | | This should get us the same decode code generated, but with a lot less custom code in the driver.
* vc4: Remove the CSE pass.Eric Anholt2016-05-021-1/+0
| | | | | | It's not doing anything according to shader-db now that we're using NIR. It would have had to be reworked significantly anyway, to handle control flow.
* vc4: Emit only one FRAG_Z or FRAG_W QIR opcode.Eric Anholt2016-05-021-2/+19
| | | | | | We were generating piles of FRAG_W for interpolation, only to CSE them away immediately. Since this is the only thing that CSE is doing for us any more, just avoid making the CSE work necessary.