aboutsummaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/vc4/vc4_qir_schedule.c
Commit message (Collapse)AuthorAgeFilesLines
* vc4: Fix register pressure cost estimates when a src appears twice.Eric Anholt2017-03-081-3/+13
| | | | | | | | | | This ended up confusing the scheduler for things like fabs (implemented as fmaxabs x, x) or squaring a number, and it would try to avoid scheduling them because it appeared more expensive than other instructions. Fixes failure to register allocate in dEQP-GLES2.functional.uniform_api.random.3 with almost no shader-db effects (+.35% max temps)
* vc4: Try to schedule QIR instructions between writing to and reading math.Eric Anholt2016-11-301-0/+22
| | | | | | | | | This helps us get the delay slots between SFU writes and reads filled. total instructions in shared programs: 94494 -> 93970 (-0.55%) instructions in affected programs: 59206 -> 58682 (-0.89%) 3DMMES performance +1.89967% +/- 0.157611% (n=10,9)
* vc4: Improve interleaving of texture coordinates vs results.Eric Anholt2016-11-301-3/+3
| | | | | | | | | | | | | | | | | | | | The latency_between was trying to handle the delay between the coordinate write ("before") and the corresponding sample read ("after"), but we were handing in the two instructions swapped. This meant that we tried to fit things between a tex_s and its *preceding* tex_result. This made us only interleave normal texture coordinates by accident, and pessimized UBO reads by pushing the tex_result collection earlier until there was nothing but it (and then its preceding coordinate setup) left. In addition to latency reduction, things end up packing better (probably due to reduced live ranges of the texture results): total instructions in shared programs: 98121 -> 94775 (-3.41%) instructions in affected programs: 91196 -> 87850 (-3.67%) 3DMMES performance +1.15569% +/- 0.124714% (n=8,10)
* vc4: Restructure texture insts as ALU ops with tex_[strb] as the dst.Eric Anholt2016-11-291-23/+27
| | | | | For now we're still just generating MOVs, but this will let us fold into other ops in the future. No difference on shader-db.
* vc4: Refactor qir_get_op_nsrc(enum qop) to qir_get_nsrc(struct qinst *).Eric Anholt2016-11-291-4/+4
| | | | | | Every caller was dereffing the qinst, and this will let us make the number of sources vary depending on the destination of the qinst so that we can have general ALU ops that store to tex_[strb] and get an implicit uniform.
* vc4: Make sure we don't overflow texture input/output FIFOs when threaded.Eric Anholt2016-11-221-2/+3
| | | | | | | | I dropped the first hunk of this change last minute when I decided it wasn't actually needed, and apparently failed to piglit it in simulation. The simulator threw an an assertion in gl-1.0-drawpixels-color-index, which queued up 5 coordinates (3 before a switch, two after) before loading the result.
* vc4: Add THRSW nodes after each tex sample setup in multithreaded mode.Eric Anholt2016-11-121-0/+24
| | | | | This is a suboptimal implementation, but Jonas Pfeil found that it was still a massive performance gain.
* vc4: Add some spec citations about texture fifo management.Eric Anholt2016-11-121-5/+37
|
* vc4: Emit resets of the uniform stream at the starts of blocks.Eric Anholt2016-07-131-0/+16
| | | | | | | | If a block might be entered from multiple locations, then the uniform stream will (probably) be at different points, and we need to make sure that it's pointing where we expect it to be. The kernel also enforces that any block reading a uniform resets uniforms, to prevent reading outside of the uniform stream by using looping.
* vc4: Define a QIR branch instructionEric Anholt2016-07-121-0/+8
| | | | | | This uses the branch condition code in inst->cond to jump to either successor[0] (condition matches) or successor[0] (condition doesn't match).
* vc4: Make vc4_qir_schedule handle each block in the program.Eric Anholt2016-07-121-14/+23
| | | | | | | | Basically we just treat each block independently. The only inter-block scheduling I can think of that would be be interesting would be to move texture result collection to after a short loop/if block that doesn't do texturing. However, the kernel disallows that as part of its security validation.
* vc4: Create a basic block structure and move the instructions into it.Eric Anholt2016-07-121-2/+3
| | | | | | | The optimization passes and scheduling aren't actually ready for multiple blocks with control flow yet (as seen by the "cur_block" references in them instead of iterating over blocks), but this creates the structures necessary for converting them.
* Remove wrongly repeated words in commentsGiuseppe Bilotta2016-06-231-1/+1
| | | | | | | | | | | | | | | | | Clean up misrepetitions ('if if', 'the the' etc) found throughout the comments. This has been done manually, after grepping case-insensitively for duplicate if, is, the, then, do, for, an, plus a few other typos corrected in fly-by v2: * proper commit message and non-joke title; * replace two 'as is' followed by 'is' to 'as-is'. v3: * 'a integer' => 'an integer' and similar (originally spotted by Jason Ekstrand, I fixed a few other similar ones while at it) Signed-off-by: Giuseppe Bilotta <[email protected]> Reviewed-by: Chad Versace <[email protected]>
* vc4: Fix doxygen warnings12.0-branchpointRhys Kidd2016-05-301-2/+2
| | | | | | | | Now that vc4 automated code documentation can be generated with doxygen, fix the warnings issued by Doxygen 1.8.11. Signed-off-by: Rhys Kidd <[email protected]> Reviewed-by: Emil Velikov <[email protected]>
* vc4: Allow TLB Z/color/stencil writes from any ALU operation in QIR.Eric Anholt2016-04-081-11/+24
| | | | | | | | This lets us write the Z directly from the FTOI for computed Z, and may let us coalesce color writes in the future. No change in my shader-db, but clearly drops an instruction in piglit's early-z test.
* vc4: Add missing scheduling dependency for MS color writes.Eric Anholt2016-04-081-0/+1
|
* vc4: Move discard handling to the condition flag.Eric Anholt2016-03-161-5/+0
| | | | | | | | | | | | | | | Now that the field exists in the instruction, we can make discards less special. As a bonus, that means that we should be able to merge some more .sf instructions together when we get around to that. This causes some scheduling changes, as it allows tlb_color_reads to be delayed past the discard condition setup. Since the tlb_color_read ends up later, this may mean performance improvements, but I haven't tested. total instructions in shared programs: 78114 -> 78035 (-0.10%) instructions in affected programs: 1922 -> 1843 (-4.11%) total estimated cycles in shared programs: 234318 -> 234329 (0.00%) estimated cycles in affected programs: 8200 -> 8211 (0.13%)
* vc4: Add missing braces in initializerRhys Kidd2016-02-151-1/+1
| | | | | | | | | | | | Silences the following GCC warning: mesa/src/gallium/drivers/vc4/vc4_qir_schedule.c: In function 'qir_schedule_instructions': mesa/src/gallium/drivers/vc4/vc4_qir_schedule.c:578:16: warning: missing braces around initializer [-Wmissing-braces] struct schedule_state state = { 0 }; ^ Signed-off-by: Rhys Kidd <[email protected]> Signed-off-by: Eric Anholt <[email protected]>
* vc4: Replace the SSA-style SEL operators with conditional MOVs.Eric Anholt2016-01-061-4/+3
| | | | | | | | | | | | | I'm moving away from QIR being SSA (since NIR is doing lots of SSA optimization for us now) and instead having QIR just be QPU operations with virtual registers. By making our SELs be composed of two MOVs, we could potentially coalesce the registers for the MOV's src and dst and eliminate the MOV. total instructions in shared programs: 88448 -> 88028 (-0.47%) instructions in affected programs: 39845 -> 39425 (-1.05%) total estimated cycles in shared programs: 246306 -> 245762 (-0.22%) estimated cycles in affected programs: 162887 -> 162343 (-0.33%)
* vc4: Do instruction scheduling on the QIR to hide texture fetch latency.Eric Anholt2015-12-181-0/+619
This is a rewrite of vc4_opt_qpu_schedule.c to operate on QIR. Texture fetch can probably take as much as the rest of the cycles of the program, so it's important to hide our other cycles during it (which is hard to do after register allocation). Also, we can queue up multiple texture requests before collecting the resulting samples, so that we keep the texture unit busy more of the time. High-settings openarena performance +2.35849% +/- 0.221154% (n=7). Also about 2-3% on the multiarb demo. 8 piglit tests (ext_framebuffer_multisample accuracy depthstencil) go from failing in rendering to failing in register allocation, but hopefully I can fix that up with some better register pressure handling here. total instructions in shared programs: 87723 -> 88448 (0.83%) instructions in affected programs: 78411 -> 79136 (0.92%) total estimated cycles in shared programs: 276583 -> 246306 (-10.95%) estimated cycles in affected programs: 265691 -> 235414 (-11.40%)