| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
This file will contain optimization passes for both vpm reads
and writes.
Signed-off-by: Varad Gautam <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a rewrite of vc4_opt_qpu_schedule.c to operate on QIR. Texture
fetch can probably take as much as the rest of the cycles of the program,
so it's important to hide our other cycles during it (which is hard to do
after register allocation). Also, we can queue up multiple texture
requests before collecting the resulting samples, so that we keep the
texture unit busy more of the time.
High-settings openarena performance +2.35849% +/- 0.221154% (n=7). Also
about 2-3% on the multiarb demo. 8 piglit tests
(ext_framebuffer_multisample accuracy depthstencil) go from failing in
rendering to failing in register allocation, but hopefully I can fix that
up with some better register pressure handling here.
total instructions in shared programs: 87723 -> 88448 (0.83%)
instructions in affected programs: 78411 -> 79136 (0.92%)
total estimated cycles in shared programs: 276583 -> 246306 (-10.95%)
estimated cycles in affected programs: 265691 -> 235414 (-11.40%)
|
|
|
|
|
|
|
|
| |
This is the core of ARB_texture_multisample. Most of the piglit tests for
GL_ARB_texture_multisample require GL 3.0, but exposing support for this
lets us use the gallium blitter for multisample resolves. We can
sometimes multisample resolve using just the RCL, but that requires that
the blit is 1:1, unflipped, and aligned to tile boundaries.
|
|
|
|
|
|
|
|
|
|
| |
This massively reduces our dependency on VC4-specific optimization passes.
shader-db:
total uniforms in shared programs: 32077 -> 32067 (-0.03%)
uniforms in affected programs: 149 -> 139 (-6.71%)
total instructions in shared programs: 98208 -> 98182 (-0.03%)
instructions in affected programs: 2154 -> 2128 (-1.21%)
|
|
|
|
|
|
| |
For now, this just splits up store_output intrinsics to be scalars, and
drops unused outputs in the coordinate shader. My goal is to be able to
drop a bunch of my VC4-specific optimization by letting NIR handle it.
|
|
|
|
|
| |
The rest of vc4_program.c is about compiling, while this is about
uniform emit at draw time.
|
|
|
|
|
| |
There weren't that many variations of RCL generation, and this lets us
skip all the in-kernel validation for what we generated.
|
|
|
|
| |
I want to notice discrepancies when I diff -u between Mesa and the kernel.
|
|
|
|
|
| |
Just because we put the source in a subdir, doesn't mean we need helper
libraries in the build. This will also simplify the Android build setup.
|
|
|
|
|
| |
There will be other blit code showing up, and it seems like the place
you'd look.
|
|
|
|
|
|
| |
I want to be able to have multiple jobs being set up at the same time (for
example, a render job to do a little fixup blit in the course of doing a
render to the main FBO).
|
|
|
|
|
|
|
|
|
|
|
|
| |
This cleans up some pointless operations generated by the in-driver mul24
lowering (commonly generated by making a vec4 index for a matrix in a
uniform array).
I could fill in other operations, but pretty much anything else ought to
be getting handled at the NIR level, I think.
total uniforms in shared programs: 13423 -> 13421 (-0.01%)
uniforms in affected programs: 346 -> 344 (-0.58%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This lets us more intelligently decide which uniform values should be put
into temporaries, by choosing the most reused values to push to temps
first.
total uniforms in shared programs: 13457 -> 13433 (-0.18%)
uniforms in affected programs: 1524 -> 1500 (-1.57%)
total instructions in shared programs: 40198 -> 40019 (-0.45%)
instructions in affected programs: 6027 -> 5848 (-2.97%)
I noticed this opportunity because with the NIR work, some programs were
happening to make different uniform copy propagation choices that
significantly increased instruction counts.
|
|
|
|
|
| |
total instructions in shared programs: 41168 -> 40976 (-0.47%)
instructions in affected programs: 18156 -> 17964 (-1.06%)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small immediates have the downside of taking over the raddr B field, so
you might have less chance to pack instructions together thanks to raddr B
conflicts. However, it also reduces some register pressure since it lets
you load 2 "uniform" values in one instruction (avoiding a previous load
of the constant value to a register), and increases some pairing for the
same reason.
total uniforms in shared programs: 16231 -> 13374 (-17.60%)
uniforms in affected programs: 10280 -> 7423 (-27.79%)
total instructions in shared programs: 40795 -> 41168 (0.91%)
instructions in affected programs: 25551 -> 25924 (1.46%)
In a previous version of this patch I had a reduction in instruction count
by forcing the other args alongside a SMALL_IMM to be in the A file or
accumulators, but that increases register pressure and had a bug in
handling FRAG_Z. In this patch is I just use raddr conflict resolution,
which is more expensive. I think I'd rather tweak allocation to have some
way to slightly prefer good choices for files in general, rather than risk
failing to register allocate by forcing things into register classes.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This doesn't reschedule much currently, just tries to fit things into the
regfile A/B write-versus-read slots (the cause of the improvements in
shader-db), and hide texture fetch latency by scheduling setup early and
results collection late (haven't performance tested it). This
infrastructure will be important for doing instruction pairing, though.
shader-db2 results:
total instructions in shared programs: 61874 -> 59583 (-3.70%)
instructions in affected programs: 50677 -> 48386 (-4.52%)
|
|
|
|
|
| |
Our submits now return immediately and you have to manually wait for
things to complete if you want to (like a normal driver).
|
|
|
|
|
|
|
|
|
| |
The kernel files are built into a separate static library and
all the functions that require it are already wrapped in ifdef
USE_VC4_SIMULATOR. Don't forget the header file :)
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
| |
Now this whole setup matches the kernel's file layout much more closely.
|
|
|
|
|
|
| |
We have to expose them for GL 2.0, but we just always return a value of 0.
We should be advertising 0 query bits instead of 64, but gallium doesn't
have plumbing for that yet. At least this stops the segfaults.
|
| |
|
|
|
|
|
|
| |
This allows for introducing dead code eliminating of uniforms, copy
propagation of uniforms, and instruction rescheduling between instructions
that both read uniforms.
|
|
|
|
|
| |
I'm going to be rewriting it all, and having it mixed up with the
QIR-to-QPU opcode translation was messy.
|
|
|
|
|
|
|
|
|
| |
- include all headers in Makefile.sources
Cc: Eric Anholt <[email protected]>
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Acked-by: Matt Turner <[email protected]>
|
|
|
|
|
|
| |
Debugging a regression in discard support was just too full of duplicate
instructions, so I decided to remove them instead of re-analyzing each of
them as I dumped their outputs in simulation.
|
|
|
|
|
|
|
| |
Now that tiling is in place, we can expose the other formats. Depth is
still broken (need to make changes in the shader), but if you don't expose
it things crash all over. SNORM is dropped, but we could re-add it later
with some shader fixes to handle converting between [0,1] and [-1,1].
|
|
|
|
|
|
| |
This still treats everything as RGBA8888 for the most part, same as
before. This is a prerequisite for handling other texture formats, since
only RGBA8888 has a raster-layout mode.
|
|
|
|
|
|
|
|
|
|
| |
This required building a shader parser that would walk the program to find
where the texturing-related uniforms are in the uniforms stream.
Note that as of this commit, a new kernel is required for rendering on
actual VC4 hardware (currently that commit is named "drm/vc4: Introduce
shader validation and better command stream validation.", but is likely to
be squashed as part of an eventual merge of the kernel driver).
|
|
|
|
|
|
|
|
| |
This ensures that when I'm using the simulator, I get a closer match to
what behavior on real hardware will be. It lets me rapidly iterate on the
kernel validation code (which otherwise has a several-minute turnaround
time), and helps catch buffer overflow bugs in the userspace driver
faster.
|
|
|
|
|
|
|
|
| |
We put in a bunch of extra MOVs for program outputs, and this can clean
those up. We should do uniforms, too, though.
v2: Fix missing flagging of progress when we actually optimize. Caught by
Aaron Watry.
|
|
|
|
|
|
| |
This cleans up a bunch of noise in the compiled coordinate shaders (since
we don't need the varying outputs), and also from writemasked instructions
with negated src operands.
|
|
|
|
|
| |
There was a lot of extra noise in my piglit shader dumps because of silly
CMPs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This introduces an IR (QIR, for QPU IR) to do optimization on. It's a
scalar, SSA IR in general. It looks like optimization is pretty easy this
way, though I haven't figured out if it's going to be good for our weird
register allocation or not (or if I want to reduce to basically QPU
instructions first), and I've got some problems with it having some
multi-QPU-instruction opcodes (SEQ and CMP, for example) which I probably
want to break down.
Of course, this commit mostly doesn't work, since many other things are
still hardwired, like the VBO data.
v2: Rewrite to use a bunch of helpers (qir_OPCODE) for emitting QIR
instructions into temporary values, and make qir_inst4 take the 4 args
separately instead of an array (all later callers wanted individual
args).
|
|
This mostly just takes every draw call and turns it into a sequence of
commands that clear the FBO and draw a single shaded triangle to it,
regardless of the actual input vertices or shaders. I copied the initial
driver skeleton mostly from freedreno, and I've preserved Rob Clark's
copyright for those. I also based my initial hardcoded shaders and
command lists on Scott Mansell (phire)'s "hackdriver" project, though the
bit patterns of the shaders emitted end up being different.
v2: Rebase on gallium megadrivers changes.
v3: Rebase on PIPE_SHADER_CAP_MAX_CONSTS change.
v4: Rely on simpenrose actually being installed when building for
simulation.
v5: Add more header duplicate-include guards.
v6: Apply Emil's review (protection against vc4 sim and ilo at the same
time, and dropping the dricommon drm bits) and fix a copyright header
(thanks, Roland)
|