| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In order to set up KILL sets, the dataflow code was walking the entire
array of ACPs for every instruction. If you assume the number of ACPs
increases roughly with the number of instructions, this is O(n^2). As
it turns out, regions_overlap() is not nearly as cheap as one would like
and shows up as a significant chunk on perf traces.
This commit changes things around and instead first builds an array of
exec_lists which it uses like a hash table (keyed off ACP source or
destination) similar to what's done in the rest of the copy-prop code.
By first walking the list of ACPs and populating the table and then
walking instructions and only looking at ACPs which probably have the
same VGRF number, we can reduce the complexity to O(n). This takes the
execution time of the piglit vs-isnan-dvec test from about 56.4 seconds
on an unoptimized debug build (what we run in CI) with NIR_VALIDATE=0 to
about 38.7 seconds.
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the destination of an ACP entry exists only within this block, then
there's no need to keep it for dataflow analysis. We can delete it from
the out_acp table and avoid growing the bitsets any bigger than we
absolutely have to. This reduces the maximum number of global ACP
entries in the vs-isnan-dvec with software fp64 on Kaby Lake from 8630
to 3942 and takes the execution time of the piglit vs-isnan-dvec test
from about 1:16.2 on an unoptimized debug build (what we run in CI) with
NIR_VALIDATE=0 to about 56.4 seconds.
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
While the number of ACPs is generally not huge compared to the number of
blocks, 16 does seem a bit small. Bumping it to 64 takes the execution
time of the piglit vs-isnan-dvec test from about 1:18.1 on an unoptimized
debug build (what we run in CI) with NIR_VALIDATE=0 to about 1:16.2.
Reviewed-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 40b3abb4d16af4cef0307e1b4904c2ec0924299e.
It is not clear that this commit was entirely correct, and unfortunately
it was pushed by error.
CC: Jason Ekstrand <[email protected]>
Acked-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function is used in two different scenarios that for 32-bit
instructions are the same, but for 16-bit instructions are not.
One scenario is that in which we are working at a SIMD8 register
level and we need to know if a register is fully defined or written.
This is useful, for example, in the context of liveness analysis or
register allocation, where we work with units of registers.
The other scenario is that in which we want to know if an instruction
is writing a full scalar component or just some subset of it. This is
useful, for example, in the context of some optimization passes
like copy propagation.
For 32-bit instructions (or larger), a SIMD8 dispatch will always write
at least a full SIMD8 register (32B) if the write is not partial. The
function is_partial_write() checks this to determine if we have a partial
write. However, when we deal with 16-bit instructions, that logic disables
some optimizations that should be safe. For example, a SIMD8 16-bit MOV will
only update half of a SIMD register, but it is still a complete write of the
variable for a SIMD8 dispatch, so we should not prevent copy propagation in
this scenario because we don't write all 32 bytes in the SIMD register
or because the write starts at offset 16B (wehere we pack components Y or
W of 16-bit vectors).
This is a problem for SIMD8 executions (VS, TCS, TES, GS) of 16-bit
instructions, which lose a number of optimizations because of this, most
important of which is copy-propagation.
This patch splits is_partial_write() into is_partial_reg_write(), which
represents the current is_partial_write(), useful for things like
liveness analysis, and is_partial_var_write(), which considers
the dispatch size to check if we are writing a full variable (rather
than a full register) to decide if the write is partial or not, which
is what we really want in many optimization passes.
Then the patch goes on and rewrites all uses of is_partial_write() to use
one or the other version. Specifically, we use is_partial_var_write()
in the following places: copy propagation, cmod propagation, common
subexpression elimination, saturate propagation and sel peephole.
Notice that the semantics of is_partial_var_write() exactly match the
current implementation of is_partial_write() for anything that is
32-bit or larger, so no changes are expected for 32-bit instructions.
Tested against ~5000 tests involving 16-bit instructions in CTS produced
the following changes in instruction counts:
Patched | Master | % |
================================================
SIMD8 | 621,900 | 706,721 | -12.00% |
================================================
SIMD16 | 93,252 | 93,252 | 0.00% |
================================================
As expected, the change only affects SIMD8 dispatches.
Reviewed-by: Topi Pohjolainen <[email protected]>
|
|
|
|
|
|
|
|
| |
The scalar back-end uses SHADER_OPCODE_SEND for all surface messages so
we no longer need the non-logical opcodes there. Prefix them VEC4 so
it's clear that they're only used by the vec4 back-end.
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
|
|
|
|
|
|
|
|
|
| |
The unused typed surface read/write support in the vec4 back-end has
been dropped and the fs back-end now uses SHADER_OPCODE_SEND for all
image and buffer ops. There's no reason to keep these opcodes around
anymore.
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the visitor attempts to enforce the regioning restrictions
that apply to double-precision instructions on CHV/BXT at NIR-to-i965
translation time. It is possible though for the copy propagation pass
to violate this restriction if a strided move is propagated into one
of the affected instructions. I've only reproduced this issue on a
future platform but it could affect CHV/BXT too under the right
conditions.
Cc: [email protected]
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
| |
The implementation of these opcodes in the generator assumes that their
arguments are packed, and it generates register regions based on that
assumption.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
dataport messages
v2: Split changes to the message type field to another patch. Suggested
by Caio.
Signed-off-by: Ian Romanick <[email protected]>
Reviewed-by: Caio Marcelo de Oliveira Filho <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the dataflow propagation algorithm would calculate the ACP
live-in and -out sets in a two-pass fixed-point algorithm. The first
pass would update the live-out sets of all basic blocks of the program
based on their live-in sets, while the second pass would update the
live-in sets based on the live-out sets. This is incredibly
inefficient in the typical case where the CFG of the program is
approximately acyclic, because it can take up to 2*n passes for an ACP
entry introduced at the top of the program to reach the bottom (where
n is the number of basic blocks in the program), until which point the
algorithm won't be able to reach a fixed point.
The same effect can be achieved in a single pass by computing the
live-in and -out sets in lock-step, because that makes sure that
processing of any basic block will pick up the updated live-out sets
of the lexically preceding blocks. This gives the dataflow
propagation algorithm effectively O(n) run-time instead of O(n^2) in
the acyclic case.
The time spent in dataflow propagation is reduced by 30x in the
GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
test-case on my CHV system (the improvement is likely to be of the
same order of magnitude on other platforms). This more than reverses
an apparent run-time regression in this test-case from my previous
copy-propagation undefined-value handling patch, which was ultimately
caused by the additional work introduced in that commit to account for
undefined values being multiplied by a huge quadratic factor.
According to Chad this test was failing on CHV due to a 30s time-out
imposed by the Android CTS (this was the case regardless of my
undefined-value handling patch, even though my patch substantially
exacerbated the issue). On my CHV system this patch reduces the
overall run-time of the test by approximately 12x, getting us to
around 13s, well below the time-out.
v2: Initialize live-out set to the universal set to avoid rather
pessimistic dataflow estimation in shaders with cycles (Addresses
performance regression reported by Eero in GpuTest Piano).
Performance numbers given above still apply. No shader-db changes
with respect to master.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271
Reported-by: Chad Versace <[email protected]>
Reviewed-by: Ian Romanick <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This makes the dataflow propagation logic of the copy propagation pass
more intelligent in cases where the destination of a copy is known to
be undefined for some incoming CFG edges, building upon the
definedness information provided by the last patch. Helps a few
programs, and avoids a handful shader-db regressions from the next
patch.
shader-db results on ILK:
total instructions in shared programs: 6541547 -> 6541523 (-0.00%)
instructions in affected programs: 360 -> 336 (-6.67%)
helped: 8
HURT: 0
LOST: 0
GAINED: 10
shader-db results on BDW:
total instructions in shared programs: 8174323 -> 8173882 (-0.01%)
instructions in affected programs: 7730 -> 7289 (-5.71%)
helped: 5
HURT: 2
LOST: 0
GAINED: 4
shader-db results on SKL:
total instructions in shared programs: 8185669 -> 8184598 (-0.01%)
instructions in affected programs: 10364 -> 9293 (-10.33%)
helped: 5
HURT: 2
LOST: 0
GAINED: 2
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2: Fix alignment style (Topi Pohjolainen)
(Jason Ekstrand)
- Enable bit_size parameter to scattered messages to enable different
bitsizes byte/word/dword.
- Remove use of brw_send_indirect_scattered_message in favor of
brw_send_indirect_surface_message.
- Move scattered messages to surface messages namespace.
- Assert align1 for scattered messages and assume Gen8+.
- Inline brw_set_dp_byte_scattered_read.
v3: (Jason Ekstrand)
- Use renamed brw_byte_scattered_data_element_from_bit_size method
- Assert scattered read for Gen8+ and Haswell.
- Use conditional expresion at components_read.
- Include comment about params for scattered opcodes.
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
v2: (Jason Ekstrand)
- Enable bit_size parameter to scattered messages to enable different
bitsizes byte/word/dword.
- Remove use of brw_send_indirect_scattered_message in favor of
brw_send_indirect_surface_message.
- Move scattered messages to surface messages namespace.
- Assert align1 for scattered messages and assume Gen8+.
- Inline brw_set_dp_byte_scattered_write.
v3: - Remove leftover newline (Topi Pohjolainen)
- Rename brw_data_size to brw_scattered_data_element and use
defines instead of an enum (Jason Ekstrand)
- Assert scattered write for Gen8+ and Haswell (Jason Ekstrand)
Signed-off-by: Jose Maria Casanova Crespo <[email protected]>
Signed-off-by: Alejandro Piñeiro <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|
|
Mostly a dummy git mv with a couple of noticable parts:
- With the earlier header cleanups, nothing in src/intel depends
files from src/mesa/drivers/dri/i965/
- Both Autoconf and Android builds are addressed. Thanks to Mauro and
Tapani for the fixups in the latter
- brw_util.[ch] is not really compiler specific, so it's moved to i965.
v2:
- move brw_eu_defines.h instead of brw_defines.h
- remove no-longer applicable includes
- add missing vulkan/ prefix in the Android build (thanks Tapani)
v3:
- don't list brw_defines.h in src/intel/Makefile.sources (Jason)
- rebase on top of the oa patches
[Emil Velikov: commit message, various small fixes througout]
Signed-off-by: Emil Velikov <[email protected]>
Reviewed-by: Jason Ekstrand <[email protected]>
|