| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2: Actually switch on the other math instructions mentioned in the
comment.
v3: Add timing data for textureSize(), and clean up some long comment
lines.
Testing shader_time of fs16 shaders on a few frames of various apps:
nexuiz improved by 2.9% +/- 1.5% (n=10)
no difference on GLB2.5 (n=36, outliers removed)
no difference on GLB2.7 (n=25)
etqw improved by 2.6% +/- 2.2% (n=25)
no difference on lightsmark (n=25)
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
I've tested this to be true with various ALU ops on gen7 (with the
exception of MADs, which go at either 3 or 4 cycles per dispatch).
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
| |
For gen7 everything changes, and we have actual information on latency.
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This gives the instruction scheduler a chance to schedule between the
loads, whereas before it was restricted due to the dependencies between
the MRFs for setting them up.
For one shader in gles3conform, it goes from getting stuck in register
allocation for as long as anybody's bothered to leave it running down
to 23 seconds, thanks to the LIFO scheduling.
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This came from an idea by Ben Segovia. 16-wide pixel shaders are very
important for latency hiding on i965, so we want to try really hard to
get them. If scheduling an instruction makes some set of instructions
available, those are probably the ones that make the instruction's
result dead. By choosing those first, we'll have a tendency to reduce
the amount of live data as opposed to creating more.
Previously, we were sometimes getting this behavior out of the
scheduler, which was what produced the scheduler's original performance
wins on lightsmark. Unfortunately, that was mostly an accident of the
lame instruction latency information that I had, which made it
impossible to fix the actual scheduling for performance. Now that we've
fixed the scheduling for setup for register allocation, we can safely
update the latency parameters for the final schedule.
In shader-db, we lose 37 16-wide shaders, but gain 90 new ones. 4
shaders that were spilling change how many registers spill, for a
reduction of 70/3899 instructions.
v2: Simplify the new loop.
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
| |
Seeing when instructions become available to schedule is really useful.
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
| |
Otherwise, you end up with some report from within a second of context
destroy, which is now what you really want for testing the impact of
changes
|
|
|
|
|
|
|
|
|
| |
Sometimes I've got a patch for a performance optimization that's not
showing a statistically significant performance difference on reported
FPS, but still seems like a good idea because it ought to reduce time
spent in the shader. If I can see the total number of cycles spent in
the shader stage being optimized, it may show that the patch is still
worthwhile (or point out that it's actually broken in some way).
|
|
|
|
|
|
|
|
|
|
| |
Some shaders experience resets more than others, which skews the numbers
reported. Attempt to correct for this by linearly scaling according to
the number of resets that happen.
Note that will not be accurate if invocations of shaders have varying
times and longer invocations are more likely to reset. However, this
should at least be better than the previous situation.
|
|
|
|
|
|
| |
I'm about to emit other kinds of writes besides time deltas, and it
turns out with the frequency of resets, we couldn't really use the old
time delta write() function more than once in a shader.
|
|
|
|
|
|
| |
Gen7 stores the JIP/UIP bits in different places.
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
try_rewrite_rhs_to_dst is a quick optimization to avoid generating new
temporaries (and MOVs from those temporaries to the dest) for every
expression tree we visit. By generating better code in simple cases, we
reduce the burden on later optimization passes like register coalescing.
Previously, we compared inst->regs_written() to lhs->vector_elements
to make sure the instruction generating our value wrote the same number
of components as our destination register.
However, this fails in some cases. One example is texturing (which
produces a vec4) into gl_FragData[i]. Technically, gl_FragData[i] is
also a vec4. However, the destination VGRF actually has size 4n (where
n is the size of the array).
split_virtual_grfs() can't split VGRFs that are used by SEND messages
which require contiguous destination registers (like texturing), and
register allocation needs all VGRFs to have sizes between 1 and 4.
Amnesia: The Dark Descent hits this case: a texturing instruction
(4 components) gets rewritten to the gl_FragData output register
(which was 4*3 = 12 components), causing the register allocator to
hit the "we rely on split_virtual_grfs" assertion.
This makes it possible to play Amnesia.
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I had tried this in the past, but ran into trouble with applications
that sample from undiscarded pixels in the same subspan. To fix that
issue, only jump to the end for an entire subspan at a time.
Improves GLbenchmark 2.7 (1024x768) performance by 7.9 +/- 1.5% (n=8).
v2: Drop the br variable in the jump instruction -- if I ever do jumps
pre-gen6, it'll be a different code block anyway since we don't have
HALT until gen6.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This makes much more sense on gen6+, and will also prove useful for
early exit of shaders on discard.
v2: fix up a stale comment from before converting gen4-5.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
| |
We're going to redo discard handling to track discards in the other flag
subregister, saving instructions in the discard and allowing predicated
jumps out to the end of the shader.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
| |
We're about to start using the f0.1 subregister.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
This makes our output more consistent with other disasm tools, and
will be necessary when we start using f0.1.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
We've been calling it a register number, it's actually the subregister,
and things will get confusing once we start using it if it isn't fixed.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
There's a flag subreg nr field in bits2 next to src0.vertstride, but
there shouldn't be anything in bits3 next to src1.vertstride.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
| |
This is needed to compute render_to_fbo. It even has the comment.
NOTE: This is a candidate for stable branches.
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch enables support for ETC2 compressed textures on
all intel hardware. At present, ETC2 texture decoding is not
available on intel hardware. So, compressed ETC2 texture data
is decoded in software and stored in a suitable uncompressed
MESA_FORMAT at the time of glCompressedTexImage2D. Currently,
ETC2 formats are only exposed in OpenGL ES 3.0.
V2: Use single etc_wraps variable for both etc1 and etc2.
V3: Remove redundant code and use just one intel_miptree_map_etc()
and intel_miptree_unmap_etc() function.
Choose MESA_FORMAT_SIGNED_{R16, GR1616} for ETC2 signed-{r11, rg11}
formats
Signed-off-by: Anuj Phogat <[email protected]>
Tested-by: Matt Turner <[email protected]>
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This can be used for two purposes: Using hand-coded shaders to determine
per-instruction timings, or figuring out which shader to optimize in a
whole application.
Note that this doesn't cover the instructions that set up the message to
the URB/FB write -- we'd need to convert the MRF usage in these
instructions to GRFs so that our offsets/times don't overwrite our
shader outputs.
Reviewed-by: Kenneth Graunke <[email protected]> (v1)
v2: Check the timestamp reset flag in the VS, which is apparently
getting set fairly regularly in the range we watch, resulting in
negative numbers getting added to our 32-bit counter, and thus large
values added to our uint64_t.
v3: Rebase on reladdr changes, removing a new safety check that proved
impossible to satisfy. Add a comment to the AOP defs from Ken's
review, and put them in a slightly more sensible spot.
v4: Check timestamp reset in the FS as well.
|
|
|
|
|
| |
For getting values from the new timestamp register, the channels we
load have nothing to do with the pixels dispatched.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Serious Sam 3 had a shader hitting this path, but it's used rarely so it
didn't show a significant performance difference (n=7). It does reduce
compile time massively, though -- one shader goes from 14s compile time
and 11723 instructions generated to .44s and 499 instructions.
Note that some shaders lose 16-wide mode because we don't support
16-wide and pull constants at the moment (generally, things looping over
a few-element array where the loop isn't getting unrolled). Given that
those shaders are being generated with 15-20% fewer instructions, it
probably outweighs the loss of 16-wide.
|
|
|
|
|
| |
I wanted to separate this step for easier reviewing when I add the
variable-index case next.
|
|
|
|
|
|
| |
v2: Fix SNB math bug in register_coalesce() where I was looking at the
instruction to be removed, not the instruction to be copy propagated
into.
|
|
|
|
|
| |
This gen6 restriction was removed in gen7 as the mathbox merge to act
more like a normal instruction was finished in the hardware.
|
|
|
|
|
|
|
| |
The gen7 send-from-GRF path is sufficiently different from the perspective of
IR generation and optimization that I just made it a separate opcode.
v2: fix whitespace, rebase on Ken's recent refactor.
|
|
|
|
|
| |
We're going to use another send message for handling loads with a varying
per-fragment array index.
|
|
|
|
|
|
| |
As of gen7, we can skip the header on some messages, and this can make
optimization on those messages much nicer when you've got GRFs instead of MRFs
as the source.
|
| |
|
|
|
|
|
|
|
| |
In the VS case, we were missing the entire compile time in the stall
detection!
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
After walking our IR instructions (Mesa or GLSL), we don't want to also
mark the start of the FB/URB writes or whatever as being that IR. This
can end up being misleading when the end of the IR visit got copy
propagated out to a later instruction in the URB writes.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
| |
The VP generation doesn't set up the output reg strings, so if you
didn't happen to get these values as 0 on the stack, you'd lose.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These helper macros save you from writing nasty expressions like:
if ((inst->src[1].type == BRW_REGISTER_TYPE_F &&
inst->src[1].imm.f == 1.0) ||
((inst->src[1].type == BRW_REGISTER_TYPE_D ||
inst->src[1].type == BRW_REGISTER_TYPE_UD) &&
inst->src[1].imm.u == 1)) {
Instead, you simply get to write inst->src[1].is_one(). Simple.
Also, this makes the FS backend match the VS backend (which has these).
This patch also converts opt_algebraic to use the new helper functions.
As a consequence, it will now also optimize integer-typed expressions.
Reviewed-by: Eric Anholt <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
The gen4 simd16 workaround looks at ir->type to determine how much
storage to allocate for the simd16 value. In fragment programs,
texturing only ever returns float vec4s (unlike GLSL, which can also
have scalar floats or vector integers), so this is the right type.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56962
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This should help avoid confusion now that we're using the gl_api enum
to distinguishing between core and compatibility API's. The
corresponding enum value for core API's is API_OPENGL_CORE.
Acked-by: Eric Anholt <[email protected]>
Acked-by: Matt Turner <[email protected]>
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The brw_compile structure contains the brw_instruction store and the
brw_eu_emit.c state tracking fields. These are only useful for the
final assembly generation pass; the earlier compilation stages doesn't
need them.
This also means that the code generator for future hardware won't have
access to the brw_compile structure, which is extremely desirable
because it prevents accidental generation of Gen4-7 code.
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compiling shaders requires several main steps:
1. Generating VS IR from either GLSL IR or Mesa IR
2. Optimizing the IR
3. Register allocation
4. Generating assembly code
This patch splits out step 4 into a separate class named "vec4_generator."
There are several reasons for doing so:
1. Future hardware has a different instruction encoding. Splitting
this out will allow us to replace vec4_generator (which relies
heavily on the brw_eu_emit.c code and struct brw_instruction) with
a new code generator that writes the new format.
2. It reduces the size of the vec4_visitor monolith. (Arguably, a lot
more should be split out, but that's left for "future work.")
3. Separate namespaces allow us to make helper functions for
generating instructions in both classes: ADD() can exist in
vec4_visitor and create IR, while ADD() in vec4_generator() can
create brw_instructions. (Patches for this upcoming.)
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Final code generation should never fail. This is a bug, and there
should be no user-triggerable cases where this could occur.
Also, we're not going to have a fail() method after the split.
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The brw_compile structure is closely tied to the Gen4-7 hardware
encoding. However, do_vs_prog is very generic: it just calls out to
get a compiled program and then uploads it.
This isn't ultimately where we want it, but it's a step in the right
direction: it's now closer to the code generator.
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During compilation, we allocate a bunch of things: the IR needs to last
at least until code generation...and then the program store needs to
last until after we upload the program.
For simplicity's sake, just keep it all around until we upload the
program. After that, it can all be freed.
This will also save a lot of headaches during the upcoming refactoring.
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
| |
We used to steal it out of the brw_compile struct, but that won't be
initialized in time soon (and is eventually going away).
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
| |
We used to steal it out of the brw_compile struct...but vec4_visitor
isn't going to have one of those in the future.
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
This leaves only the final code generation stage in brw_vec4_emit.cpp,
moving the payload setup, run(), and brw_vs_emit functions to brw_vec4.cpp.
The fragment shader backend puts these functions in brw_fs.cpp, so this
patch also helps with consistency.
Reviewed-by: Eric Anholt <[email protected]>
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
| |
Fixes a ton of piglit regressions since the depthstencil fixes for gen6+.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=57309
Reviewed-by: Chad Versace <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes a crash in http://workshop.chromeexperiments.com/stars/ on i965,
and the new piglit test glsl-fs-clamp-5.
We were trying to emit a saturating move into a uniform, which the code
generator appropriately choked on. This was broken in the change in
32ae8d3b321185a85b73ff703d8fc26bd5f48fa7.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=57166
NOTE: This is a candidate for the 9.0 branch.
Reviewed-by: Kenneth Graunke <[email protected]>
|