| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
| |
Required by Nine. Tested with util_run_tests.
It's added to softpipe, llvmpipe, and r300g/swtcl.
Tested-by: David Heidelberg <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This increases the cost of a raddr b conflict spill (save r3 to rb31, move
src1 to r3, move rb31 back to r3 when done, instead of just move src1 to
r3), but on average thanks to instruction pairing it's more worthwhile to
have another accumulator.
total instructions in shared programs: 46428 -> 46171 (-0.55%)
instructions in affected programs: 38030 -> 37773 (-0.68%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The register allocator walks from the end of the nodes array looking for
trivially-allocatable things to put on the stack, meaning (assuming
everything is trivially colorable and gets put on the stack in a single
pass) the low node numbers get allocated first. The things allocated
first happen to get the lower-numbered registers, which is to say the fast
accumulators that can be paired more easily.
When we previously made the nodes match the temporary register numbers,
we'd end up putting the shader inputs (VS or FS) in the accumulators,
which are often long-lived values. By prioritizing the shortest-lived
values for allocation, we can get a lot more instructions that involve
accumulators, and thus fewer conflicts for raddr and WS.
total instructions in shared programs: 52870 -> 46428 (-12.18%)
instructions in affected programs: 52260 -> 45818 (-12.33%)
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since d8da6deceadf5e48201d848b7061dad17a5b7cac where the
state tracker started using UCMP on cayman a number of tests
regressed.
this seems to be r600g is doing CNDGE_INT for UCMP which is >= 0,
we should be doing CNDE_INT with reverse arguments.
Reviewed-by: Glenn Kennard <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
|
|
|
| |
See commits 5067506e and b6109de3 for the Coccinelle script.
Reviewed-by: Brian Paul <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The register allocator prefers low-index registers from vc4_regs[] in the
configuration we're using, which is good because it means we prioritize
allocating the accumulators (which are faster). On the other hand, it was
causing raddr conflicts because everything beyond r0-r2 ended up in
regfile A until you got massive register pressure. By interleaving, we
end up getting more instruction pairing from getting non-conflicting
raddrs and QPU_WSes.
total instructions in shared programs: 55957 -> 52719 (-5.79%)
instructions in affected programs: 46855 -> 43617 (-6.91%)
|
| |
|
|
|
|
|
|
|
|
| |
We can avoid it by carefully ordering the packing. This is important as a
step in giving r3 to the register allocator.
total instructions in shared programs: 56087 -> 55957 (-0.23%)
instructions in affected programs: 18368 -> 18238 (-0.71%)
|
| |
|
|
|
|
| |
This is being emitted now from st_glsl_to_tgsi.cpp.
|
|
|
|
| |
This is the maximum value allowed for this field.
|
|
|
|
|
|
|
|
| |
All uses of this require that the value be at least one, so it's
easier to report at least one than having to wrap all uses
in MAX2(max_compute_units, 1).
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Harvested GPUs have some of their render backends disabled, so
in order to prevent the hardware from trying to render things
with these disabled backends we need to correctly program
the PA_SC_RASTER_CONFIG register.
v2:
- Write RASTER_CONFIG for all SEs.
v3:
- Set GRBM_GFX_INDEX.INSTANCE_BROADCAST_WRITES bit.
- Set GRBM_GFX_INFEX.SH_BROADCAST_WRITES bit when done setting
PA_SC_RASTER_CONFIG.
- Get num_se and num_sh_per_se from kernel.
v4:
- Get correct value for num_se
- Remove loop for setting PA_SC_RASTER_CONFIG
- Only compute raster config when a backend has been disabled.
v5: Michel Dänzer
- Fix computation for chips with multiple SEs
https://bugs.freedesktop.org/show_bug.cgi?id=60879
CC: "10.4 10.3" <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
| |
Fixes R11G11B10F rendering, and is required for SRGB format support.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
| |
There were previously regressions regarding border colors, which the
updated swizzle logic resolves.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This is a hack since it uses the texture information together with the
sampler, but I don't see a better way to do it. In OpenGL, there is a
1:1 correspondence.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
| |
Expert debugging assistance provided by Chris Forbes.
Signed-off-by: Ilia Mirkin <[email protected]>
Reviewed-by: Rob Clark <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Multiple scenes per context are meant to be used so a new scene can be built
while another one is processed in rasterization. However, quite surprisingly,
this does not actually work (and according to git log, possibly never did,
though maybe it did at some point further back (5 years+) but was buggy)
because we always wait immediately on the rasterizer to finish the scene when
contexts (and hence setup/scene) is flushed. This means when we try to get
an empty scene later, any old one is already empty again.
Thus using multiple scenes is just a waste of memory (not too bad, since the
additional scenes are guaranteed to be empty, which means their size ought to
be one data block (64kB) plus the size of some structs), without actually
really doing anything. (There is also quite some code for the whole concept of
multiple scenes which doesn't really do much in practice, but keep it hoping
the wait-on-scene-flush can be fixed some day.)
Reviewed-by: Jose Fonseca <[email protected]>
|
|
|
|
|
| |
total instructions in shared programs: 56995 -> 56087 (-1.59%)
instructions in affected programs: 40503 -> 39595 (-2.24%)
|
|
|
|
|
| |
No difference on shader-db because we tend to have a lot of other
conflicts going on as well (like RADDR_A disagreements)
|
|
|
|
|
|
|
|
|
| |
If an operation is the last one to read a register, the instruction
containing it can also include the op that has the next write to that
register.
total instructions in shared programs: 57486 -> 56995 (-0.85%)
instructions in affected programs: 43004 -> 42513 (-1.14%)
|
|
|
|
|
|
|
|
|
|
|
| |
We were scheduling TLB operations as early as possible, and texture setup
as late as possible. When I introduced prioritization, I visually
inspected that an independent operation got moved above texture results
collection, which tricked me into thinking it was working (but it was just
because texture setup was being pushed late).
total instructions in shared programs: 57651 -> 57486 (-0.29%)
instructions in affected programs: 18532 -> 18367 (-0.89%)
|
|
|
|
|
| |
Avoids assertion failures in vc4_qpu_validate.c if we happen to find the
right set of operations available.
|
|
|
|
|
| |
This might happen if the blending functions are set up to not actually use
the destination color/alpha, for example.
|
|
|
|
|
| |
This is nice when you're tracking down which command list is hanging the
GPU.
|
|
|
|
|
|
| |
Similar to the scheme that Ilia put in place for a3xx.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
|
| |
Also seems to fix kill/discard.
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Ilia Mirkin <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Jan Vesely <[email protected]>
Reviewed-by: Tom Stellard <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Jan Vesely <[email protected]>
Reviewed-by: Tom Stellard <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
We've got two mostly-independent operations in each QPU instruction, so
try to pack two operations together. This is fairly naive (doesn't track
read and write separately in instructions, doesn't convert ADD-based MOVs
into MUL-based movs, doesn't reorder across uniform loads), but does show
a decent improvement on shader-db-2.
total instructions in shared programs: 59583 -> 57651 (-3.24%)
instructions in affected programs: 47361 -> 45429 (-4.08%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since 73dd50acf6d244979c2a657906aa56d3ac60d550
glsl: implement switch flow control using a loop
The SB backend was falling over in an assert or crashing.
Tracked this down to the loops having no repeats, but requiring
a working break, initial code just called the loop handler for
all non-if statements, but this caused a regression in
tests/shaders/dead-code-break-interaction.shader_test.
So I had to add further code to detect if all the departure
nodes are empty and avoid generating an empty loop for that case.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=86089
Cc: "10.4" <[email protected]>
Reviewed-By: Glenn Kennard <[email protected]>
Signed-off-by: Dave Airlie <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
|
|
|
| |
Signed-off-by: Rob Clark <[email protected]>
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This doesn't reschedule much currently, just tries to fit things into the
regfile A/B write-versus-read slots (the cause of the improvements in
shader-db), and hide texture fetch latency by scheduling setup early and
results collection late (haven't performance tested it). This
infrastructure will be important for doing instruction pairing, though.
shader-db2 results:
total instructions in shared programs: 61874 -> 59583 (-3.70%)
instructions in affected programs: 50677 -> 48386 (-4.52%)
|
|
|
|
| |
This is actually implicitly handled by the TLB operations.
|
|
|
|
|
| |
Prevents a regression with QPU scheduling, which happens to put the no-op
reads for unused VPM contents end up at the end of the program.
|
|
|
|
|
|
|
| |
We're supposed to be checking that nothing else writes r4, which is done
by the TMU result collection signal, not the coordinate setup.
Avoids a regression when QPU instruction scheduling is introduced.
|