| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dispatch.
The eliminate_find_live_channel optimization eliminates
FIND_LIVE_CHANNEL instructions in cases where control flow is known to
be uniform, and replaces them with 'MOV 0', which in turn unblocks
subsequent elimination of the BROADCAST instruction frequently used on
the result of FIND_LIVE_CHANNEL. This is however not correct in
per-sample fragment shader dispatch because the PSD can dispatch a
fully unlit sample under certain conditions. Disable the optimization
in that case.
Reviewed-by: Jason Ekstrand <[email protected]>
v2: Add devinfo argument to brw_stage_has_packed_dispatch() to
implement hardware generation check.
|
|
|
|
|
|
|
| |
It might be useful to actually handle this once copy propagation
becomes smarter about register-misaligned offsets.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
register coalesce.
Because the pass already checks that the destination offset of each
'scan_inst' that needs to be rewritten matches 'inst->src[0].offset'
exactly, the final offset of the rewritten instruction is just the
original destination offset of the copy. This is in preparation for
adding support for sub-GRF offsets to the VEC4 IR.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
| |
MOV source.
In preparation for adding support for sub-GRF offsets to the VEC4 IR.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
| |
check.
In preparation for adding support for sub-GRF offsets to the VEC4 IR.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
| |
For simplicity just assume that two writes to the same GRF with
different sub-GRF offsets will potentially interfere and break the
dependency control chain. This is in preparation for adding sub-GRF
offset support to the VEC4 IR.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
| |
bytes.
This simplifies things slightly and makes the pass more correct in
presence of sub-GRF offsets.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
| |
This should also have the side effect of fixing convert_to_hw_regs()
to handle sub-GRF register offsets.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
| |
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
| |
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
| |
C.f. 'i965/fs: Print fs_reg::offset field consistently for all
register files.'.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
This makes sure that overlap checks are done correctly throughout the
back-end when the '*this' register starts before the register/size
pair provided as argument, and is actually less annoying to use than
in_range() at this point since regions_overlap() takes its size
arguments in bytes.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
units.
The previous regs_read value can be recovered by rewriting each
reference of regs_read() like 'x = i.regs_read(j)' to 'x =
DIV_ROUND_UP(i.size_read(j), reg_unit)'.
For the same reason as in the previous patches, this doesn't attempt
to be particularly clever about simplifying the result in the interest
of keeping the rather lengthy patch as obvious as possible. I'll come
back later to clean up any ugliness introduced here.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in bytes.
The previous regs_written field can be recovered by rewriting each
rvalue reference of regs_written like 'x = i.regs_written' to 'x =
DIV_ROUND_UP(i.size_written, reg_unit)', and each lvalue reference
like 'i.regs_written = x' to 'i.size_written = x * reg_unit'.
For the same reason as in the previous patches, this doesn't attempt
to be particularly clever about simplifying the result in the interest
of keeping the rather lengthy patch as obvious as possible. I'll come
back later to clean up any ugliness introduced here.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
::regs_written.
This is in preparation for dropping vec4_instruction::regs_read and
::regs_written in favor of more accurate alternatives expressed in
byte units. The main reason these wrappers are useful is that a
number of optimization passes implement dataflow analysis with
register granularity, so these helpers will come in handy once we've
switched register offsets and sizes to the byte representation. The
wrapper functions will also make sure that GRF misalignment (currently
neglected by most of the back-end) is taken into account correctly in
the calculation of regs_read and regs_written.
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
expressed in bytes.
The dst/src_reg::offset field in byte units introduced in the previous
patch is a more straightforward alternative to an offset
representation split between ::reg_offset and ::subreg_offset fields.
The split representation makes it too easy to forget about one of the
offsets while dealing with the other, which has led to multiple FS
back-end bugs in the past. To make the matter worse the unit
reg_offset was expressed in was rather inconsistent, for uniforms it
would be expressed in either 4B or 16B units depending on the
back-end, and for most other things it would be expressed in 32B
units.
This encodes reg_offset as a new offset field expressed consistently
in byte units. Each rvalue reference of reg_offset in existing code
like 'x = r.reg_offset' is rewritten to 'x = r.offset / reg_unit', and
each lvalue reference like 'r.reg_offset = x' is rewritten to
'r.offset = r.offset % reg_unit + x * reg_unit'.
Because the change affects a lot of places and is rather non-trivial
to verify due to the inconsistent value of reg_unit, I've tried to
avoid making any additional changes other than applying the rewrite
rule above in order to keep the patch as simple as possible, sometimes
at the cost of introducing obvious stupidity (e.g. algebraic
expressions that could be simplified given some knowledge of the
context) -- I'll clean those up later on in a second pass.
v2: Fix division by the wrong reg_unit in the UNIFORM case of
convert_to_hw_regs(). (Iago)
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Generated by:
sed -i -e 's/brw_device_info/gen_device_info/g' src/intel/**/*.c
sed -i -e 's/brw_device_info/gen_device_info/g' src/intel/**/*.h
sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.c
sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.cpp
sed -i -e 's/brw_device_info/gen_device_info/g' **/i965/*.h
Signed-off-by: Jason Ekstrand <[email protected]>
Reviewed-by: Jordan Justen <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
var_range_end(v, n) loops over the n components of variable number v and
finds the maximum value, giving the last use of any component of v.
Therefore it expects v to correspond to the variable associated with the
.x channel of the VGRF.
var_from_reg() however returns the variable for the first channel of the
VGRF, post-swizzle.
So, if the last register had a swizzle with y, z, or w in the swizzle
component, we would read out of bounds. For any other register, we would
read liveness information from the next register.
The fix is to convert the src_reg to a dst_reg in order to call the
dst_reg version of var_from_reg() that doesn't consider the swizzle.
Cc: [email protected]
Reviewed-by: Francisco Jerez <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
| |
The pass isn't really control-flow aware and you can get into case where it
tries to combine instructions from different blocks. This can actually
lead to an assertion failure when removing unneeded instructions if part of
the vector is set in one block and part in another. This prevents
regressions in the next commit.
Signed-off-by: Jason Ekstrand <[email protected]>
Cc: "12.0" <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
When computing where the first non-payload GRF starts, we can't rely on
the number of attributes, as each attribute can be using 1 or 2 slots
depending on whether they are a dvec3/4 or other.
Instead, we need to use the number of slots used by the attributes.
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
Do not use total attributes because a dvec3/dvec4 attribute requires two
slots. So rather use total attribute slots.
v2: do not use loop to calculate required attribute slots (Kenneth
Graunke)
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Topi Pohjolainen <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is not strictly required for the following changes because none
of the three-source opcodes we support at the moment in the compiler
back-end has been removed or redefined, but that's likely to change in
the future. In any case having hardware instructions specified as a
pair of hardware device and opcode number explicitly in all cases will
simplify the opcode look-up interface introduced in a subsequent
commit, since the opcode number alone is in general ambiguous.
Reviewed-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
A future series will implement support for an instruction that happens
to have the same opcode number as another instruction we support
already on a disjoint set of hardware generations. In order to
disambiguate which instruction it is brw_instruction_name() will need
some way to find out which device we are generating code for.
Reviewed-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, opt_vector_float() always interpreted MOV sources as
floating point, and always created a MOV with a F-type destination.
This meant that we could mess up sequences of integer loads, such as:
mov vgrf6.0.x:D, 0D
mov vgrf6.0.y:D, 1D
mov vgrf6.0.z:D, 2D
mov vgrf6.0.w:D, 3D
Here, integer 0/1/2/3 become approximately 0.0f, so we generated:
mov vgrf6.0:F, [0F, 0F, 0F, 0F]
which is clearly wrong. We can properly handle this by converting
integer values to float (rather than bitcasting), and emitting a type
converting MOV:
mov vgrf6.0:D, [0F, 1F, 2F, 3F]
To do this, see first see if the integer values (converted to float)
are representable. If so, we use a D-type MOV. If not, we then try
the floating point values and an F-type MOV. We make zero not impose
type restrictions. This is important because 0D would imply a D-type
MOV, but is often used in sequences such as MOV 0D, MOV 0x3f800000D,
where we want to use an F-type MOV.
Fixes about 54 dEQP-GLES2 failures with the vec4 VS backend. This
recently became visible due to changes in opt_vector_float() which
made it optimize more cases, but it was a pre-existing bug.
Apparently it also manages to turn more integer loads into VFs,
producing the following shader-db statistics on Haswell:
total instructions in shared programs: 7084195 -> 7082191 (-0.03%)
instructions in affected programs: 246027 -> 244023 (-0.81%)
helped: 1937
total cycles in shared programs: 65669642 -> 65651968 (-0.03%)
cycles in affected programs: 531064 -> 513390 (-3.33%)
helped: 1177
v2: Handle the type of zero better.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We don't handle this properly - we'd have to perform the type conversion
before trying to convert the value to a VF.
While we could do that, it doesn't seem particularly useful - most
vector loads should be consistently typed (all float or all integer).
As a special case, we do allow type-converting MOVs of integer 0, as
it's represented the same regardless of the type. I believe this case
does actually come up.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
After the previous patch, this helper is only called in one place.
So, just fold it back in - there are a lot of parameters here and
not much code.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reworks opt_vector_float() so that there's only one place that
flushes out any accumulated state and emits a VF.
v2: Don't break the sequence for non-representable numbers - just skip
recording their values. Only break it for non-MOVs or register
changes.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Iago Toral Quiroga <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
| |
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
| |
Reviewed-by: Kristian Høgsberg <[email protected]>
Reviewed-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit moves us to an instruction based model rather than a
register-based model for indirects. This is more accurate anyway as we
have to emit instructions to resolve the reladdr. It's also a lot simpler
because it gets rid of the recursive reladdr problem by design.
One side-effect of this is that we need a whole new algorithm in
move_uniform_array_access_to_pull_constants. This new algorithm is much
more straightforward than the old one and is fairly similar to what we're
already doing in the FS backend.
Reviewed-by: Kristian Høgsberg <[email protected]>
Acked-by: Kenneth Graunke <[email protected]>
|
|
|
|
|
|
|
|
|
| |
NIR already has this optimization and it can do much better than the little
peephole in the backend.
No shader-db change on Haswell or Broadwell.
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bug found by the liveness analysis validation pass that will be
introduced in a later commit. The no-op MOV check in
opt_register_coalesce() was removing instructions which makes the
cached liveness analysis calculation inconsistent with the shader IR.
We were failing to set progress to true in that case though, which
means that invalidate_live_intervals() wouldn't necessarily be called
at the end of the function.
Cc: [email protected]
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
opt_vector_float() transforms several scalar MOV operations to a single
vectorial MOV.
This is done when those MOV covers all the components of the destination
register. So something like:
mov vgrf3.0.xy:D, 0D
mov vgrf3.0.w:D, 1065353216D
mov vgrf3.0.z:D, 0D
is transformed in:
mov vgrf3.0:F, [0F, 0F, 0F, 1F]
But there are cases where not all the components are written. For
example, in:
mov vgrf2.0.x:D, 1073741824D
mov vgrf3.0.xy:D, 0D
mov vgrf3.0.w:D, 1065353216D
mov vgrf4.0.xy:D, 1065353216D
mov vgrf4.0.w:D, 0D
mov vgrf6.0:UD, u4.xyzw:UD
Nor vgrf3 nor vgrf4 .z components are written, so the optimization is
not applied.
But it could be applied anyway with the components covered, using a
writemask to select the ones written. So we could transform it in:
mov vgrf2.0.x:D, 1073741824D
mov vgrf3.0.xyw:F, [0F, 0F, 0F, 1F]
mov vgrf4.0.xyw:F, [1F, 1F, 0F, 0F]
mov vgrf6.0:UD, u4.xyzw:UD
This commit does precisely that: opportunistically apply
opt_vector_float() when possible.
total instructions in shared programs: 7124660 -> 7114784 (-0.14%)
instructions in affected programs: 443078 -> 433202 (-2.23%)
helped: 4998
HURT: 0
total cycles in shared programs: 64757760 -> 64728016 (-0.05%)
cycles in affected programs: 1401686 -> 1371942 (-2.12%)
helped: 3243
HURT: 38
v2: change vectorize_mov() signature (Matt).
v3: take in account predicates (Juan).
v4 [mattst88]: Update shader-db numbers. Fix some whitespace issues.
Reviewed-by: Matt Turner <[email protected]>
Signed-off-by: Juan A. Suarez Romero <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Now that each stage is directly calling brw_nir_lower_io(), and we have
per-stage helper functions, it makes sense to just call the relevant one
directly, rather than going through multiple switch statements.
This also eliminates stupid function parameters, such as the two that
only apply to vertex attributes.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Iago Toral Quiroga <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Gen4/5's SEL instruction cannot use conditional modifiers, so min/max
are implemented as CMP + SEL. Handling that after optimization lets us
CSE more.
On Ironlake:
total instructions in shared programs: 6426035 -> 6422753 (-0.05%)
instructions in affected programs: 326604 -> 323322 (-1.00%)
helped: 1411
total cycles in shared programs: 129184700 -> 129101586 (-0.06%)
cycles in affected programs: 18950290 -> 18867176 (-0.44%)
helped: 2419
HURT: 328
Reviewed-by: Francisco Jerez <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
| |
brw_draw_upload.c uploads VertexID/InstanceID first, then DrawID.
So we need to assign the attribute mapping in that order as well.
Fixes the following Pigit tests with the vec4 backend:
- arb_shader_draw_parameters-drawid vertexid
- arb_shader_draw_parameters-drawid-indirect basevertex
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Kristian Høgsberg <[email protected]>
|
|
|
|
|
|
|
|
| |
This allows ls, and scripts to get the file names in the correct order of
optimization.
Signed-off-by: Ben Widawsky <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
| |
This is no longer necessary...and it doesn't make much sense to
have inputs as destinations.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch re-implements the pre-Haswell VS attribute workarounds.
Instead of emitting shader code in the vec4 backend, we now simply
call a NIR pass to emit the necessary code.
This simplifies the vec4 backend. Beyond deleting code, it removes
the primary use of ATTR as a destination. It also eliminates the
requirement that the vec4 VS backend express the ATTR file in terms
of VERT_ATTRIB_* locations, giving us a bit more flexibility.
This approach is a little different: rather than munging the attributes
at the top, we emit code to fix them up when they're accessed. However,
we run the optimizer afterwards, so CSE should eliminate the redundant
math. It may even be able to fuse it with other calculations based on
the input value.
shader-db does not handle non-default NOS settings, so I have no
statistics about this patch.
Note that the scalar backend does not implement VS attribute
workarounds, as they are unnecessary on hardware which allows SIMD8 VS.
v2: Do one multiply for FIXED rescaling and select components from
either the original or scaled copy, rather than multiplying each
component separately (suggested by Matt Turner).
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
Reviewed-by: Chris Forbes <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Bit 0 of the Patch Header is "TR DS Cache Disable". Setting that bit
disables the DS Cache for tessellator-output topologies resulting in
stitch-transition regions (but leaves it enabled for other cases).
We probably shouldn't leave this to chance - the URB could contain
garbage - which could result in the cache randomly being turned on
or off.
This patch makes the final EOT write 0 to the first DWord (which
only contains this one bit). This ensures the cache is always on.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
| |
Cc: "11.1" <[email protected]>
Reviewed-by: Jordan Justen <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
| |
Shorter than compiler->scalar_stage[MESA_SHADER_VERTEX], which can
help with line-wrapping.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While most align16 instructions only support a SubRegNum of 0 or 4
(using swizzling to control the other channels), 3-src instructions
actually support arbitrary SubRegNums. When the RepCtrl bit is set,
we believe it ignores the swizzle and uses the equivalent of a <0,1,0>
region from the subnr.
In the past, we adopted a vec4-centric approach of specifying subnr of
0 or 4 and a swizzle, then having brw_eu_emit.c convert that to a proper
SubRegNum. This isn't a great fit for the scalar backend, where we
don't set swizzles at all, and happily set subnrs in the range [0, 7].
This patch changes brw_eu_emit.c to use subnr and swizzle directly,
relying on the higher levels to set them sensibly.
This should fix problems where scalar sources get copy propagated into
3-src instructions in the FS backend. I've only observed this with
TES push model inputs, but I suppose it could happen in other cases.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Matt Turner <[email protected]>
|
|
|
|
|
|
|
|
|
|
| |
We have to break open a new vec4 for gl_DrawIDARB. We've used up all
space in the vec4 we use for SGVS and gl_DrawIDARB has to come from its
own separate vertex buffer anyway. This is because we point the vb for
base vertex and base instance into the draw parameter BO for indirect
draw calls, but the draw id is generated by mesa in a different buffer.
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
| |
We already have gl_BaseVertexARB in the .x component of the SGVS vec4
and plug gl_BaseInstanceARB into the last free component (.y).
Reviewed-by: Anuj Phogat <[email protected]>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Setting interleave on the TCS EOT message causes Ivybridge hardware to
GPU hang like crazy. Individual tests would pass, but running even a
simple test like nop.shader_test in a loop would hang within 1-3 runs.
Adding sleep delays worked around the problem, somehow.
Interleave doesn't make much sense given that we only have one patch
URB handle, not two. Complete doesn't seem useful either.
There's no reason to actually set those bits. We were just being lazy.
Signed-off-by: Kenneth Graunke <[email protected]>
Reviewed-by: Edward O'Callaghan <[email protected]>
Reviewed-by: Jordan Justen <[email protected]>
|