| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Same as I did for V3D, drop all this code trying to GC the
non-indirectly-loaded uniforms from the UBO that's used for indirect
access of gallium cb[0]. While it does successfully drop some of those,
it came at the cost of uploading the VS's indirect unifroms twice, for the
bin and render versions of the shader.
With the UBO loads simplified, I was also able to easily backport V3D's
change to pack a UBO offset into the uniform_data[] field so that we don't
need to do the add of the uniform base in the shader.
As a bonus, now vc4 doesn't depend on mesa/st type_size functions.
total uniforms in shared programs: 25514 -> 25490 (-0.09%)
total instructions in shared programs: 77019 -> 76836 (-0.24%)
|
|
|
|
| |
I'm going to extend how UBO0 works in a moment.
|
|
|
|
| |
Similar to what I did for V3D, provide some description of the uniforms.
|
|
|
|
|
| |
We're going to use UBO loads for implementing YUV linear-to-T-format
blits.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I've been doing this inside of vc4, but vc5 wants it as well and it may be
useful for other drivers (Intel has a related path for pre-gen6 with MRT,
and freedreno had a TGSI path for it at one point).
This required defining a common enum for the standard comparison
functions, but other lowering passes are likely to also want that enum.
v2: Add to meson.build as well.
Acked-by: Rob Clark <[email protected]>
|
|
|
|
|
|
| |
This reverts commit 292c24ddac5acc35676424f05291c101fcd47b3e. It broke a
lot of GLES2 deqp, and I see at least one problem that will require some
serious rework to fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reduces register pressure in both types of shaders, by reordering the
input loads from the var->data.driver_location order to whatever order
they appear first in the NIR shader. These instructions aren't
reorderable at our QIR scheduling level because the FS takes two in
lockstep to do an interpolation, and the VS takes multiple read
instructions in a row to get a whole vec4-level attribute read.
shader-db impact:
total instructions in shared programs: 76666 -> 76590 (-0.10%)
instructions in affected programs: 42945 -> 42869 (-0.18%)
total max temps in shared programs: 9395 -> 9208 (-1.99%)
max temps in affected programs: 2951 -> 2764 (-6.34%)
Some programs get their max temps hurt, depending on the order that the
load_input intrinsics appear, because we end up being unable to copy
propagate an older VPM read into its only use.
|
|
|
|
|
| |
This will be used for delaying our VPM reads (which must be unconditional)
until just before they're used.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The kernel will reject our shader if we emit one here, and having 4, 8, or
12 as the top end of our UBO clamp rare is enough that it's not worth
making the kernel let us.
Fixes piglit fs-const-array-of-struct and
fs-const-array-of-struct-of-array since recent GLSL linking changes made
us get this as an indirect load of a uniform, instead of a tempoary.
Cc: "13.0 17.0" <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
This isn't as complete as I would like (can't merge interpolation because
of the implicit r5 dependency, doesn't work with control flow), but this
was cheap and easy.
Improves 3DMMES Taiji performance by 1.15353% +/- 0.299896% (n=29, 16)
total instructions in shared programs: 99810 -> 99059 (-0.75%)
instructions in affected programs: 10705 -> 9954 (-7.02%)
|
|
|
|
|
| |
The dead code elimination wants it to be safe, and I actually got
segfaults due to it being unsafe with the new coalescing pass.
|
|
|
|
|
|
| |
The VPM write logic will be basically the same as the texture coordinate
write logic we need, and it's not really related to the VPM read logic
other than the reuse of the use_count array.
|
|
|
|
|
| |
For now we're still just generating MOVs, but this will let us fold into
other ops in the future. No difference on shader-db.
|
|
|
|
|
|
| |
Every caller was dereffing the qinst, and this will let us make the number
of sources vary depending on the destination of the qinst so that we can
have general ALU ops that store to tex_[strb] and get an implicit uniform.
|
|
|
|
|
|
| |
This may have made a tiny bit of sense when we had one 4-arg inst per
shader, but if we only ever put 2 things in, having a pointer to 2 things
almost every instruction is pointless indirection.
|
|
|
|
|
| |
This was used originally for unorm4x8 packs, but we now represent those as
a series of packed movs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
My thought in having both arguments conditionally moved was that it should
theoretically save some power by not doing work in those channels.
However, it ends up costing us instructions because we can't
register-coalesce the first of the MOVs, and it also introduces extra
scheduling dependencies. The instruction cost would swamp whatever power
benefit I was hoping for.
shader-db results:
total instructions in shared programs: 100548 -> 99741 (-0.80%)
instructions in affected programs: 42450 -> 41643 (-1.90%)
With obvious outliers removed (I had an X11 emacs running over the network
in the "after" case), 3DMMES Taiji showed 1.07231% +/- 0.488241% fps
improvement (n=18, 30).
|
|
|
|
|
|
| |
We don't allow the last thread switch to be inside control flow, to be
sure that we hit the last state exactly once. If the last texturing was
in control flow, fall back to single threaded.
|
|
|
|
|
|
| |
We have two major requirements: Make sure that only the bottom half of the
physical reg space is used, and make sure that none of our values are live
in an accumulator across a switch.
|
|
|
|
|
|
|
| |
This will eventually be generated at the QIR level, so that
vc4_qir_schedule.c can arrange the separation of tex_strb from tex_result
correctly. It will also be important so that register allocation set the
register classes appropriately for values that are live across the switch.
|
|
|
|
|
|
|
|
|
| |
It's much better to just skip the draw call entirely. Getting this
information out of register allocation will also be useful for
implementing threaded fragment shaders, which will need to retry
non-threaded if RA fails.
Cc: <[email protected]>
|
|
|
|
|
|
|
|
|
| |
If a conditional assignment is only conditioned on the exec mask, that's
still screening off the value in the executed channels (and, since we're
not storing to the unexcuted channels, we don't care what's in there).
Fixes a bunch of extra register pressure on Processing's Ribbons demo,
which is failing to allocate.
|
|
|
|
|
| |
I missed this while adding loop support because the discard test inside a
loop was crashing before, anyway. Fixes piglit glsl-fs-discard-04.
|
|
|
|
| |
Extracted from a patch by jonasarrow on github.
|
|
|
|
|
| |
Extracted and fixed up from a patch by jonasarrow on github. This ended
up not getting used for ddx/ddy, but seems like it might still be useful.
|
|
|
|
|
| |
This will be used in the ddx/ddy support for "Am I the top half?" or "Am I
the left half?" checks.
|
|
|
|
| |
Fixes glsl-routing in piglit and hangs in glbenchmark 2.0.2.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the GLSL-to-NIR conversion of VC4, I had a bit of trouble with what I
was calling the "state uniforms" that I was putting into the NIR fighting
with its other lowering passes. Instead of using magic uniform base
numbers in the backend, follow the lead of load_user_clip_plane and just
define system values for them.
v2: Fix unintended change to channel_num, drop unspecified const_index
value on blend_const_color_r_float.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't want to bake the whole array into the FS key, because of the
hashing overhead. But we can keep a set of the arrays seen, and use a
pointer to the copy in as the array's proxy.
Between this and the previous patch, gl-1.0-blend-func now passes on
hardware, where previously it was filling the 256MB CMA area with shaders
and OOMing.
Drops 712 shaders from shader-db.
|
|
|
|
|
|
|
|
|
|
| |
We were baking in the LOD of the source level to each shader. Instead,
pass it in as a uniform -- this requires storing it to a temp register,
but that's better than compiling a ton of separate shaders:
total instructions in shared programs: 115032 -> 115036 (0.00%)
instructions in affected programs: 96 -> 100 (4.17%)
LOST: 572
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To support general GL_TEXTURE_BASE_LEVEL we have to copy to a temporary
miptree. However, if a single level is being selected, we can use the
existing miptree and force all the sampling to be from that particular
level.
This avoids a ton of software fallbacks in glGenerateMipmaps(), which uses
base levels in the blit implementation in gallium. Improves "glmark2 -b
terrain" from 2 fps to 3 (perhaps some more precision would be useful?),
and cuts its CPU usage during the benchmarking from ~30% to ~10% (total
CPU time from 8.8s to 7.6s).
|
|
|
|
|
|
|
|
| |
If a block might be entered from multiple locations, then the uniform
stream will (probably) be at different points, and we need to make sure
that it's pointing where we expect it to be. The kernel also enforces
that any block reading a uniform resets uniforms, to prevent reading
outside of the uniform stream by using looping.
|
|
|
|
| |
For now we don't fill the delay slots, and instead just drop in NOPs.
|
|
|
|
| |
We'll want to schedule them individually, to handle delay slots.
|
| |
|
|
|
|
|
|
|
| |
Previously, there were occasionally NIR registers in our programs, but
they were always actually used SSA-only. Now that we're trying to support
control flow, we need to actually conditionally move to registers based on
whether channels are active or not.
|
|
|
|
|
|
| |
This uses the branch condition code in inst->cond to jump to either
successor[0] (condition matches) or successor[0] (condition doesn't
match).
|
|
|
|
|
| |
Right now our CFG is always a trivial single basic block, but that will
change when enable loops.
|
|
|
|
|
|
|
| |
The optimization passes and scheduling aren't actually ready for multiple
blocks with control flow yet (as seen by the "cur_block" references in
them instead of iterating over blocks), but this creates the structures
necessary for converting them.
|
|
|
|
|
|
|
|
| |
We have the prior list_foreach() all over the code, but I need to move
where instructions live as part of adding support for control flow. Start
by just converting to a helper iterator macro. (The simpler
"qir_for_each_inst()" will be used for the for-each-inst-in-a-block
iterator macro later)
|
|
|
|
|
|
| |
ALU0 didn't have the _dest variant, and ALU2 didn't unset the def the way
ALU1 did. This should make the ALU[012] macros much clearer, by moving
most of their contents to vc4_qir.c
|
|
|
|
|
|
|
|
|
| |
The DCE pass is going to change significantly to handle control flow,
while we don't really need to change it for the SF handling. We also need
to add some more SF peephole optimization for SF updates generated by
control flow support.
No change on shader-db.
|
|
|
|
|
| |
This isn't used since we switched to using the dst.pack field instead of
custom instructions.
|
|
|
|
|
| |
This gets us precompile of vertex shaders at the state tracker level as
well.
|
|
|
|
|
|
|
| |
This will be used for resetting the uniform stream in the presence of
branching, but may also be useful as an optimization to reduce how many
uniforms we have to copy out per draw call (in exchange for increasing
icache pressure).
|
|
|
|
|
| |
This has caught a couple of bugs during loop development so far, and I
should probably have written it long ago.
|
|
|
|
| |
Prevents a bug in the later control-flow support series.
|
|
|
|
|
| |
This should get us the same decode code generated, but with a lot less
custom code in the driver.
|
|
|
|
|
|
| |
It's not doing anything according to shader-db now that we're using NIR.
It would have had to be reworked significantly anyway, to handle control
flow.
|
|
|
|
|
|
| |
We were generating piles of FRAG_W for interpolation, only to CSE them
away immediately. Since this is the only thing that CSE is doing for us
any more, just avoid making the CSE work necessary.
|