aboutsummaryrefslogtreecommitdiffstats
path: root/src/gallium/drivers/vc4/vc4_qpu_schedule.c
Commit message (Collapse)AuthorAgeFilesLines
* vc4: Switch the post-RA scheduler over to the DAG datastructure.Eric Anholt2019-03-111-110/+73
| | | | Just a small code reduction from shared infrastructure.
* vc4: Reuse list_for_each_entry_rev().Eric Anholt2019-03-111-2/+2
|
* vc4: Rework scheduling of thread switch to cut one more NOP.Eric Anholt2016-12-291-46/+75
| | | | | | | | | | | | | | Jonas's patch got us most of the benefit of scheduling instructions into the delay slots of thread switch, but if there had been nothing to pair the thrsw with, it would move the thrsw up and leave a NOP where the thrsw was. Instead, don't pair anything with thrsw through the normal scheduling path, and have a separate helper function that inserts the thrsw earlier if possible and inserts any necessary NOPs. total instructions in shared programs: 93027 -> 92643 (-0.41%) instructions in affected programs: 14952 -> 14568 (-2.57%)
* vc4: Fill thread switching delay slotsJonas Pfeil2016-12-291-7/+38
| | | | | | | | | | | | | | | Scan for instructions without a signal set in front of the switching instruction and move the signal up there. shader-db results: total instructions in shared programs: 94494 -> 93027 (-1.55%) instructions in affected programs: 23545 -> 22078 (-6.23%) v2: Fix re-emitting of the instruction in the loop trying to emit NOPs, drop a scheduling change from branch delay slots. (by anholt) Signed-off-by: Jonas Pfeil <[email protected]>
* vc4: Avoid false scheduling dependencies for LOAD_IMMs.Eric Anholt2016-11-301-0/+5
| | | | | | | | | | Noticed in shaders with branching, where we ended up scheduling delay slots near the start of a block for the uniforms reset setup. total instructions in shared programs: 93970 -> 93951 (-0.02%) instructions in affected programs: 3117 -> 3098 (-0.61%) 3DMMES performance +0.423087% +/- 0.133521% (n=9,10)
* vc4: Add a note for the future about texture latency calculation.Eric Anholt2016-11-291-0/+20
| | | | | | | Debugging a shader-db reported cycle count regression from the tex coalescing, I eventually figured out that the texture latencies were totally bogus. Really fixing it will probably involve mirroring vc4_qir_schedule.c's texture fifo management here.
* vc4: Add support for QPU scheduling of thread switch instructions.Eric Anholt2016-11-121-2/+27
| | | | This is vaguely based off of Jonas Pfeil's thread switch support branch.
* vc4: Don't pair up TLB scoreboard locking instructions early in QPU sched.Eric Anholt2016-11-091-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | Jonas Pfeil noticed that we were putting passthrough tlb_z writes early in the shader, despite QIR and QPU scheduling both trying to delay scoreboard locking for as long as possible. The problem was that when trying to pair up QPU instructions, at some point the passthrough tlb_z would be the last one available and it would get paired, even if the other half would open up other instructions to be scheduled and we could have paired tlb_z with something later in the program. Also, since passthrough z is just a mov, it pairs up really easily. The proper fix would probably be to flip the order of scheduling instructions so we went from bottom to top (also relevant for branch delay slot scheduling). However, we can do a quick fix here to just not schedule a TLB lock until there's nothing but TLB left in the program, at a slight instruction cost (est .61% cycle count in shader-db) but a major fragment shader parallelism win. glmark2 results: texture:texture-filter=linear: +1.24481% +/- 0.626117% (n=15) bump:bump-render=height: 1.24991% +/- 0.154793% (n=136,133 -- screensaver outliers removed)
* vc4: Add QPU scheduling to handle MUL rotate sources.Eric Anholt2016-08-251-0/+13
| | | | We need MUL rotates to do ddx/ddy support.
* vc4: Emit resets of the uniform stream at the starts of blocks.Eric Anholt2016-07-131-0/+21
| | | | | | | | If a block might be entered from multiple locations, then the uniform stream will (probably) be at different points, and we need to make sure that it's pointing where we expect it to be. The kernel also enforces that any block reading a uniform resets uniforms, to prevent reading outside of the uniform stream by using looping.
* vc4: Add support for scheduling of branch instructions.Eric Anholt2016-07-131-17/+103
| | | | For now we don't fill the delay slots, and instead just drop in NOPs.
* vc4: Move the QPU instructions to schedule into each block.Eric Anholt2016-07-131-44/+69
| | | | We'll want to schedule them individually, to handle delay slots.
* vc4: Fix a pasteo in scheduling condition flag usage.Eric Anholt2016-07-041-1/+1
| | | | | | | Noticed by code inspection. This hasn't been too big of a deal, because our cond usages all start out as adder ops, either MOVs or the FTOI for Z writes. MOVs *can* get converted to mul ops during scheduling, but apparently we hadn't hit this.
* vc4: Fix latency handling for QPU texture scheduling.Eric Anholt2015-12-181-32/+50
| | | | | | There's only high latency between a complete texture fetch setup and collecting its result, not between each step of setting up the texture fetch request.
* vc4: Keep sample mask writes from being reordered after TLB writesEric Anholt2015-12-181-1/+2
| | | | | | Fixes a regression I noticed after introducing scheduling on the QIR. Cc: "11.1" <[email protected]>
* vc4: Add debugging of the estimated time to run the shader to shader-db.Eric Anholt2015-12-111-15/+38
|
* vc4: Add support for storing sample mask.Eric Anholt2015-12-041-0/+4
| | | | | From the API perspective, writing 1 bits can't turn on pixels that were off, so we AND it with the sample mask from the payload.
* vc4: Convert from simple_list.h to list.hEric Anholt2015-05-291-38/+26
| | | | list.h is a nicer and more familiar set of list functions/macros.
* vc4: Add support for turning constant uniforms into small immediates.Eric Anholt2014-12-171-2/+6
| | | | | | | | | | | | | | | | | | | | | | Small immediates have the downside of taking over the raddr B field, so you might have less chance to pack instructions together thanks to raddr B conflicts. However, it also reduces some register pressure since it lets you load 2 "uniform" values in one instruction (avoiding a previous load of the constant value to a register), and increases some pairing for the same reason. total uniforms in shared programs: 16231 -> 13374 (-17.60%) uniforms in affected programs: 10280 -> 7423 (-27.79%) total instructions in shared programs: 40795 -> 41168 (0.91%) instructions in affected programs: 25551 -> 25924 (1.46%) In a previous version of this patch I had a reduction in instruction count by forcing the other args alongside a SMALL_IMM to be in the A file or accumulators, but that increases register pressure and had a bug in handling FRAG_Z. In this patch is I just use raddr conflict resolution, which is more expensive. I think I'd rather tweak allocation to have some way to slightly prefer good choices for files in general, rather than risk failing to register allocate by forcing things into register classes.
* vc4: Do QPU scheduling across uniform loads.Eric Anholt2014-12-091-28/+60
| | | | | | | | This means another pass of reordering the uniform data store, but it lets us pair up a lot more instructions. total instructions in shared programs: 44639 -> 43176 (-3.28%) instructions in affected programs: 36938 -> 35475 (-3.96%)
* vc4: Populate the delay field better, and schedule high delay first.Eric Anholt2014-12-091-1/+49
| | | | | | | This is a standard scheduling heuristic, and clearly helps. total instructions in shared programs: 46418 -> 44467 (-4.20%) instructions in affected programs: 42531 -> 40580 (-4.59%)
* vc4: Skip raddr dependencies for 32-bit immediate loads.Eric Anholt2014-12-091-2/+5
| | | | These don't have raddr fields.
* vc4: Mark VPM read setup as impacting VPM reads, not writes.Eric Anholt2014-12-091-1/+7
| | | | | Fixes assertion failures if we adjust scheduling priorities to emphasize VPM reads more.
* vc4: Add separate write-after-read dependency tracking for pairing.Eric Anholt2014-12-051-20/+58
| | | | | | | | | If an operation is the last one to read a register, the instruction containing it can also include the op that has the next write to that register. total instructions in shared programs: 57486 -> 56995 (-0.85%) instructions in affected programs: 43004 -> 42513 (-1.14%)
* vc4: Fix inverted priority of instructions for QPU scheduling.Eric Anholt2014-12-051-10/+10
| | | | | | | | | | | We were scheduling TLB operations as early as possible, and texture setup as late as possible. When I introduced prioritization, I visually inspected that an independent operation got moved above texture results collection, which tricked me into thinking it was working (but it was just because texture setup was being pushed late). total instructions in shared programs: 57651 -> 57486 (-0.29%) instructions in affected programs: 18532 -> 18367 (-0.89%)
* vc4: Pair up QPU instructions when scheduling.Eric Anholt2014-12-011-17/+62
| | | | | | | | | | | We've got two mostly-independent operations in each QPU instruction, so try to pack two operations together. This is fairly naive (doesn't track read and write separately in instructions, doesn't convert ADD-based MOVs into MUL-based movs, doesn't reorder across uniform loads), but does show a decent improvement on shader-db-2. total instructions in shared programs: 59583 -> 57651 (-3.24%) instructions in affected programs: 47361 -> 45429 (-4.08%)
* vc4: Introduce scheduling of QPU instructions.Eric Anholt2014-12-011-0/+693
This doesn't reschedule much currently, just tries to fit things into the regfile A/B write-versus-read slots (the cause of the improvements in shader-db), and hide texture fetch latency by scheduling setup early and results collection late (haven't performance tested it). This infrastructure will be important for doing instruction pairing, though. shader-db2 results: total instructions in shared programs: 61874 -> 59583 (-3.70%) instructions in affected programs: 50677 -> 48386 (-4.52%)