| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Acked-by: Kenneth Graunke <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Eric Engestrom <[email protected]>
Acked-by: Marek Olšák <[email protected]>
Acked-by: Jason Ekstrand <[email protected]>
Acked-by: Bas Nieuwenhuizen <[email protected]>
Acked-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
| |
v2: remove & operator in a couple of memsets
add some memsets
v3: fixup lima
Signed-off-by: Karol Herbst <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]> (v2)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can use the same register spilling infrastructure for our loads/stores
of indirect access of temp variables, instead of doing an if ladder.
Cuts 50% of instructions and max-temps from 2 KSP shaders in shader-db.
Also causes several other KSP shaders with large bodies and large loop
counts to not be force-unrolled.
The change was originally motivated by NOLTIS slightly modifying register
pressure in piglit temp mat4 array read/write tests, triggering register
allocation failures.
|
|
|
|
|
| |
We were missing a * 4 even if the particular hardware matched our
assumption.
|
| |
|
|
|
|
|
| |
This code is so touchy, trying to emit the minimum amount of address math.
Some day we'll move it all to NIR, I hope.
|
|
|
|
|
|
|
|
| |
While waiting for the CSD UABI to get reviewed, I keep having to rebase
the CS patch. Just land the compiler side for now to keep it from
diverging.
For now this covers just GLES 3.1 compute shaders, not CL kernels.
|
|
|
|
|
|
|
|
|
| |
We're using ARB_debug_output for the main shader-db, but I had this env
var left around from the shader-db-2 support (vc4 apitrace-based). Keep
the env var around since it's nice sometimes to get the stats on a shader
you're optimizing without having to do a shader-db run, but drop the old
formatting that's not useful and keeps tricking me when I go to add
another measurement to the shader-db output.
|
|
|
|
|
| |
This gives us finer-grained feedback on how we're doing on register
pressure than "did we trigger a new shader to spill or drop thread count?"
|
|
|
|
|
| |
A shader invocation always executes 16 channels together, so we often end
up multiplying things by this magic 16 number. Give it a name.
|
|
|
|
|
|
|
| |
This required to calculate sizes correctly when we have bindless
samplers/images.
Reviewed-by: Marek Olšák <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Our exec masking introduces lots of redundant flags updates, and even
without that there will be cases where NIR comparisons on the same sources
for different reasons may generate the same comparison instruction before
the selection.
total instructions in shared programs: 6492930 -> 6460934 (-0.49%)
total uniforms in shared programs: 2117460 -> 2115106 (-0.11%)
total spills in shared programs: 4983 -> 4987 (0.08%)
total fills in shared programs: 6408 -> 6416 (0.12%)
|
|
|
|
|
|
|
|
|
| |
We have a pass to lower global registers to locals and many drivers
dutifully call it. However, no one ever creates a global register ever
so it's all dead code. It's time we bury it.
Acked-by: Karol Herbst <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
| |
Signed-off-by: Karol Herbst <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
| |
The T/G shader references and common state will be needed for GLES 3.2.
|
|
|
|
|
|
|
| |
4.1 and 4.2 both have the same 16k limit, but it I'm seeing GPU hangs in
the CTS at 8k and 16k. 4k at least lets us get one 4k display working.
Cc: [email protected]
|
|
|
|
| |
These are more vc4 leftovers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The idea was that we could skip uploading the constant-indexed uniform
data and just upload the uniforms that are variably-indexed. However,
since the VS bin and render shaders may have a different set of uniforms
used, this meant that we had to upload the UBO for each of them. The
first case is generally a fairly small impact (usually the uniform array
is the most space, other than a couple of FSes in shader-db), while the
second is a larger impact: 3DMMES2 was uploading 38k/frame of uniforms
instead of 18k.
Given that the optimization is of dubious value, has a big downside, and
is quite a bit of code, just drop it. No change in shader-db. No change
on 3DMMES2 (n=15).
|
|
|
|
|
|
|
|
|
|
| |
We'd end up with the constant offset in the uniform stream anyway, since
they're bigger than small immediates. Avoids the extra uniforms and adds
in the shader in favor of just adding once on the CPU.
shader-db:
total instructions in shared programs: 6496865 -> 6494851 (-0.03%)
total uniforms in shared programs: 2119511 -> 2117243 (-0.11%)
|
|
|
|
|
|
| |
I want to reuse this for encoding small constant UBO/SSBO offsets into the
uniform stream to reduce the extra uniform loads and adds for the small
constant offsets.
|
|
|
|
|
|
| |
Noticed while trying to get a CTS run again.
Fixes: 33886474d646 ("v3d: Use the DAG datastructure for QPU instruction scheduling.")
|
|
|
|
| |
Just a small code reduction from shared infrastructure.
|
| |
|
|
|
|
|
|
| |
You usually want to go find the highest pressure and figure out why you
couldn't spill or what pattern led to a bunch of pressure leading to that
point.
|
|
|
|
|
| |
We now have NIR dead code eliminating our VPM reads, so this shouldn't be
necessary.
|
|
|
|
| |
We can just use the magic register file like we do for other magic waddrs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The idea is that for repeated use of the same uniform, we could avoid
loading it on each consumer. The results look pretty good.
total instructions in shared programs: 6413571 -> 6521464 (1.68%)
total threads in shared programs: 154214 -> 154000 (-0.14%)
total uniforms in shared programs: 2393604 -> 2119629 (-11.45%)
total spills in shared programs: 4960 -> 4984 (0.48%)
total fills in shared programs: 6350 -> 6418 (1.07%)
Once we do scheduling at the NIR level, the register pressure (and thus
also instructions) issues we see here will drop back down.
|
|
|
|
|
| |
On V3D 4.x, we can use ldunifrf to load uniforms to any register, and this
will let us schedule the ldunif wherever we want in the program.
|
|
|
|
| |
This seems to be left over from vc4, and I don't use them any more.
|
|
|
|
|
| |
We can load a uniform to any register, so add support for non-ALU
instructions with sig.ldunif to a temp.
|
|
|
|
|
| |
I'm not sure why I didn't do this before -- it's clearly much simpler to
add dumping of the extra thing than to have it as another implicit source.
|
|
|
|
|
|
|
|
|
| |
This feels like the right tradeoff for threads vs uniforms, particularly
given that we often have very short thread segments right now:
total instructions in shared programs: 6411504 -> 6413571 (0.03%)
total threads in shared programs: 153946 -> 154214 (0.17%)
total uniforms in shared programs: 2387665 -> 2393604 (0.25%)
|
|
|
|
|
|
|
| |
On each iteration of successfully spilling a reg, we'd allocate another
copy of temp_registers, and when decrementing thread conut we'd allocate
another copy of the graph. These all got cleaned up on freeing the
compile.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This lets us emit the VPM_WRITEs directly from
nir_intrinsic_store_output() (useful once NIR scheduling is in place so
that we can reduce register pressure), and lets future NIR scheduling
schedule the math to generate them. Even in the meantime, it looks like
this lets NIR DCE some more code and make better decisions.
total instructions in shared programs: 6429246 -> 6412976 (-0.25%)
total threads in shared programs: 153924 -> 153934 (<.01%)
total loops in shared programs: 486 -> 483 (-0.62%)
total uniforms in shared programs: 2385436 -> 2388195 (0.12%)
Acked-by: Ian Romanick <[email protected]> (nir)
|
|
|
|
|
| |
This appears to be just what the opcode does. Needed for equivalence when
moving FF VPM stores into NIR.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In our backend, the successor edges from the blocks only point to where
QPU control flow goes, not where the notional control flow goes from a
"break" or "continue" modifying the execution mask to resume writing to
some channels later. As a result, this attempt at restricting live ranges
ended up missing the live range of a value where a conditional
break/continue was present in a loop before the later def of a variable.
The previous commit ended up fixing the problem that the flag tried to
solve.
Fixes glsl-vs-loop-continue.shader_test and/or
glsl-vs-loop-redundant-condition.shader_test based on register allocation
results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the backend, we often have condition codes on writes to variables, such
that there's no screening def anywhere and the previous live ranges
algorithm would conclude that the start of the range extends to the start
of the program. However, we do know that the live range can only extend
as early as you can reach from all blocks writing to the variable.
The motivation was that, while we have a couple of hacks to try to promote
conditional writes up to being a def within the block, the exec_mask one
was broken and needed a replacement.
Based on c3c1aa5aeb92 ("intel/fs: Restrict live intervals to the subset
possibly reachable from any definition.").
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we have a MOV of a uniform value available to spill, that's one of our
best choices. We can just not spill the value, and emit a new load of the
uniform as the fill. This saves bothering the TMU and the thrsw, and is
the same cost in uniforms (since the spill offset is a uniform anyway).
This doesn't have a huge impact on shader-db, since there aren't a whole
lot of spills and we usually copy-prop the uniforms at the VIR level such
that the only uniform MOVs are from vir_lower_uniforms:
total instructions in shared programs: 6430292 -> 6430279 (<.01%)
total uniforms in shared programs: 2386023 -> 2385787 (<.01%)
total spills in shared programs: 4961 -> 4960 (-0.02%)
total fills in shared programs: 6352 -> 6350 (-0.03%)
However, I'm interested in dropping the uniforms copy-prop in the backend,
since it would be cheaper to not load repeated uniforms if we have the
registers to spare. This also saves many spills on
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.20, which is what
motivated a bunch of my recent backend work in the first place:
before: 46 spills, 106 fills, 3062 instructions
after: 0 spills, 0 fills, 2611 instructions
|
|
|
|
|
| |
Spilling is unusual, but one often has to debug it when it happens, so
dump it.
|
|
|
|
|
| |
There are no users at the moment, but I wanted to start using this in
register spilling.
|
|
|
|
|
| |
This lets us save a resolve to NIR true/false for ifs and discard_if. No
change in shader-db.
|
|
|
|
|
|
| |
One program affected in my shader-db.
instructions in affected programs: 110 -> 108 (-1.82%)
|
|
|
|
|
|
|
|
|
|
|
| |
For V3D 3.x, we emitted the ldvpms all at the top so that we didn't need
to do VPM setup when the load_inputs are out of order. For V3D 4.x, we
can reduce register pressure by delaying our loads until they're actually
needed. This also avoids a bunch of silly MOVs in the pre-opt VIR dump.
total instructions in shared programs: 6421415 -> 6419933 (-0.02%)
total uniforms in shared programs: 2393139 -> 2393140 (<.01%)
total threads in shared programs: 153864 -> 153906 (0.03%)
|
|
|
|
|
| |
It's unused in the VS (since we need vattr_sizes[] anyway), so move it to
FS prog data.
|
|
|
|
|
| |
This is what pointed out that we were misusing the check for last_thrsw in
the previous commit.
|
|
|
|
|
|
|
|
|
| |
The execute.file check used to be good enough, until I stopped setting up
the execute mask for uniform ifs.
No known tests fixed, noticed while doing a refactor.
Fixes: 080506057310 ("v3d: Handle dynamically uniform IF statements with uniform control flow.")
|
|
|
|
|
|
| |
Now that we don't have the vir_PF() magic, it's obvious that we were doing
the wrong thing for f2b32 by allowing -0.0 to produce true instead of
false.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
You were allowed to pass in any old temp so that you could hopefully fold
the PF up into the def of the temp. If we couldn't find one, it
implicitly generated a MOV(nop, reg). However, that PF could have
different behavior depending on whether the def being folded into was a
float or int opcode, which the caller doesn't necessarily control.
Due to the fragility of the function, just switch all callers over to
vir_set_pf(). This also encourages the callers to use a _dest call for
the inst they're putting the PF on, eliminating a bunch of temps in the
pre-optimization VIR.
shader-db says the change is in the noise:
total instructions in shared programs: 6226247 -> 6227184 (0.02%)
instructions in affected programs: 851068 -> 852005 (0.11%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Turns this minimal conditional discard (glsl-fs-discard-01.shader_test):
0x3de0b086c5fe9000 fcmp.pushn -, r1, r5; mov r2, 0
0x3dec3086bbfc001f nop ; mov.ifa r2, -1
0x3c047186bbe80000 nop ; mov.pushz -, r2
0x3dea3186ba837000 setmsf.ifna -, 0 ; nop
into:
0x3c00b186c582a000 fcmp.pushn -, r2, r5; nop
0x3de83186ba837000 setmsf.ifa -, 0 ; nop
total instructions in shared programs: 6229820 -> 6226247 (-0.06%)
|
|
|
|
|
|
|
| |
Both were doing the same thing to try to get a condition to predicate on.
Noticed when I wanted to do this for discard_if as well.
No change in shader-db.
|